Week of 130902

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Ignacio, Maarten, Maria D, Przemek, Xavi
  • remote: Dimitri, Jeremy, Joel, Onno, Pavol, Pepe, Rolf, Sang-Un, Tiju, Tommaso, Wei-Jen

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1
      • FTS3 at RAL (after patching last week) is stressing the network (ND, GGUS:96923) or the storage systems (DE, various sites) quite a lot, looks like some more tuning of the settings will be needed

  • CMS reports (raw view) -
    • all quiet, not much activity
    • GGUS tickets still open:
    • New GGUS tickets:
      • GGUS:96971 ("Job Failure at T1_IT_CNAF"): seems to have been a GPFS glitch, investigating. But things a re running smoothly as of now.
    • We still have a SNOW ticket opened : INC:365019 on CERN Argus servers having overload problems and not properly balancing. Not critical at the moment (not much traffic), but not solved either. NEWS: test setup with DNS proper aliasing put in place; tests will come soon.

  • ALICE -
    • KIT: job submission failures starting Sat morning (unscheduled downtime declared in GOCDB); site fully drained of ALICE jobs on Sun; OK again since ~13:00 today

Sites / Services round table:

  • ASGC - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KISTI
    • on Sep 10 will start a 36-h downtime to rearrange services in the computer center
  • KIT
    • on Sat morning job submission to Grid Engine stopped working, apparently due to 2 WN that ran out of disk space and thereby became black holes; a restart of the services did not help; today the support line advised changes in a few parameters and subsequently the system has been working again since ~13:00
  • NLT1 - ntr
  • PIC - ntr
  • RAL - ntr

  • databases - ntr
  • GGUS/SNow - ntr
  • grid services - ntr
  • storage - ntr

AOB:

  • Note: the next meeting will be held on Friday .

Thursday: JeŻne genevois holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local: Eddie, Maarten, Przemek, Xavi
  • remote: Boris, Gareth, Jeremy, Kyle, Lisa, Pepe, Rolf, Tommaso, Wei-Jen, WooJin

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1:
      • FTS3: still some tuning necessary, but ND ticket is solved

  • CMS reports (raw view) -
    • all substantially quiet
    • GGUS tickets still open:
      • GGUS:96816 ("Debug transfer failing from RAL_Disk to Warsaw", lcg-support@gridppNOSPAMPLEASE.rl.ac.uk now involved, network checks being done)
      • GGUS:96843 ("Failing transfer from Wisconsin to IIHE", SSL handshakes being investigated... could be a peculiar SE SSL configuration)
      • GGUS:96912 ("CVMFS problems", waiting for reply, assigned to ICM) - NO REPLY AT ALL
      • GGUS:96971 ("Job Failure at T1_IT_CNAF"): solved, apparently forgot to close ticket? * New GGUS:
      • GGUS:96990 ("Bristol->RAL transfers") - in progress seems just high load on a single Gftp, a second is being added.
    • We had again a couple of cases of "CVMFS black hole nodes" (KIT and RAL) - already solved (GGUS:96662 and GGUS:96757)
    • We still have a SNOW ticket opened : INC:365019 on CERN Argus servers having overload problems and not properly balancing. Testing activity ongoing on a daily basis; for what concerns us, we see an improvements (lack of errors).

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • transfer failures due to FTS picking expired proxies; this looks like an old problem that suddenly came back somehow; FTS developers have been contacted
  • NDGF - ntr
  • OSG - ntr
  • PIC
    • Mon Sep 16 downtime for upgrade of dCache to 2.2 series
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • storage - ntr

AOB:

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-09-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback