Week of 130401
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday: Easter
Monday holiday
- The meeting will be held on Tuesday instead.
Tuesday
Attendance:
- local: Raja, Maarten, Jan, Jerome, Stefan
- remote: Peter, Xavier, Rolf, Wei-Jen, Oliver, Onno, Lisa, Lucia, Rob, Pepe, Gareth, Jeremy, Roger
Experiments round table:
- CMS reports (raw view) -
- LHC / CMS
- Rereconstruction of 2012 data in the tails, load at the T1 sites small
- CERN / central services and T0
- Frontier system under high load over the weekend, FastSim workflow was mis-configured using the FullSim job splitting causing very short jobs with a lot of access to the SQUID caches to access alignment and calibration constants. If you see failures in SAM tests and/or Hammercloud tests because of failed access to Frontier, please open a savannah ticket to get the SiteReadiness calculation corrected.
- Tier-1:
- Tier-2:
- ALICE -
- NTR
- Xavier: There was a question from ALICE why jobs were lost last week, the reason was a reboot of the VOBOX
- Maarten: Yes, was also reported in the WLCG Ops meeting last week, but also after the reboot there were some instabilities seen
- LHCb reports (raw view) -
- Mainly user jobs with some MC ongoing.
- T0:
- No SAM tests displayed on the SUM dashboard - solved now (GGUS:92924
). Solution not very clear though.
- T1:
- RAL : Continuing to have occasional problems with setting up job environment.
Sites / Services round table:
- FNAL: NTR
- KIT: today 8.30 am one fileserver showed issues, hardware is currently being replaced, for the moment 6 x 30 TB are not available for ATLAS
- CNAF: Kernel upgrade is finished. Pledges are available as of today
- ASGC: NTR
- RAL: NTR
- NL-T1: NTR
- NDGF: NTR
- PIC: Scheduled DT of last week went well, but after coming back online the chimera system was unstable, therefore rolled back to the previous version. CPU pledges installed as of today.
- IN2P3: NTR
- OSG: NTR
- GridPP: NTR
- Batch Services: NTR
- Storage: Announcement: next Monday the "file update functionality" for CASTOR will be removed for ATLAS, CMS and LHCb. ALICE was already running without.
AOB:
Thursday
Attendance:
- local: Maarten, Jan, Jarka, Raja, Jerome, MariaD, Stefan
- remote: Salvatore, Michael, John, David, Jeremy, Pepe, Ronald, Lisa, Rob, Xavier, Wei-Jen, Roger, Peter
Experiments round table:
- CMS reports (raw view) -
- LHC / CMS
- Rereconstruction of 2012 data in the tails, load at the T1 sites small
- CERN / central services and T0
- Tier-1:
- GGUS ticket due to xrootd fallback file unavailability for a few files at FNAL: GGUS:93064
- Tier-2:
- LHCb reports (raw view) -
- Mainly user jobs with some MC ongoing.
- T0:
- T1:
- RAL : Overnight many jobs failed setting up job environment.
Sites / Services round table:
- KIT: NTR
- BNL: Tuesday there was a brief outage of SRM, because SRM DB grew to size that was longer serviceable, reduced size, adjusted now.
- RAL: Currently looking at CVMFS problem reported by LHCb
- ASGC: NTR
- GridPP: NTR
- PIC: NTR
- NDGF: NTR
- NL-T1: Announce DT in 2 weeks (18th April) because of network maintenance, both storage and cpu will be down
- FNAL: NTR
- OSG: NTR
- Storage: NTR
- Dashboards:
- Raja: do you know what was the problem why tests were not displayed last week?
- Maarten: SAM machine was down and needed to be rebooted
- GGUS: NTR
- Grid Services: NTR
AOB: