Week of 131202
WLCG Operations Call details
To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Stefan, Manuel, Felix, Luca, Maarten
- remote: Andrei, Michael, Stefano, Lisa, Onno, Alexei, Tiju, WooJin, Pepe, Christian, Rob,
- apologies: Sang-Un, Rolf
Experiments round table:
- ATLAS
- Central services
- T0/T1
- FZK-LCG2 : A problem with the internal DNS caused the certificate update to fail, resulting in two FTS node a01-013-12{8,9} running with a few expired certificates, which affects transfers to Prague, CSCS and the other sites. Resolved. (GGUS:99364
)
- NDGF-T1: One pool problem (GGUS:99277
)
- INFN-T1: Was set back to production on Friday. Looks OK.
- TAIWAN-LCG2: Looks OK.
- CMS
- ru-PNPI - During the installation of SL6 a severe power incident happened which led to burn-out of power supplies in their disk array. CMS disk space is not available for the moment.
- LHCb
- Main activities is Simulation at all Sites.
- T0:
- T1:
- Network interruption at PIC on the weekend
- GRIDKA: problems with file downloading presumably due to failure getting metadata from the local SRM
Sites / Services round table:
- BNL: NTR
- CNAF: NTR
- FNAL: NTR
- NL-T1: Currently two pool nodes are down because there is hardware maintenance on the attached storage controller. On december 5th there will be maintenance on a power feed. Not yet submitted to GOCDB.
- RAL: NTR
- KIT: NTR
- PIC: Saturday incident, started ~ 6.25pm local time b/c of power supply problem, affected only WAN, fixed around 22.30. All local jobs were continuing to run ok.
- NDGF: dCache upgrade is done and pools are coming back now
- OSG: problem with SAM availabilities and reliabilities where the transferring data was not working correctly. The problem is fixed and should be cleared up shortly
- ASGC: Castor server for CMS for HC and SAM test down, get it back asap.
- KISTI: Scheduled downtime from 4th December 06:00 (UTC) to 09:00 (UTC) for network intervention. The network bandwidth for KISTI-CERN will be 2Gbps after the intervention.
- IN2P3: downtime on December 10th (Major update for network equipment (routers for IPV6) Minor updates for: CVMFS, dCache servers, mass storage system, batch system controller), Consequences: Total network outage in the morning (no Grid Operations portal at that time neither, so no downtime notifications during 2 hours). Batch downtime starts already at December 9th in the evening, back in the evening of the 10th.
- Storage: EOS ALICE DT in the morning for upgrade to latest version. Wednesday Castor upgrade for ALICE & LHCb, this morning ATLAS & CMS happened
- Grid Services; Outage of batch service on Sat morning, not possible to submit jobs. B/c of faulty configuration. Down from 6am - 12am. SIR
AOB:
Thursday
Attendance:
Experiments round table:
Sites / Services round table:
AOB: