Week of 130923
WLCG Operations Call details
To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: AndreaV/SCOD, Vitor/Grid, Ken/CMS, MariaD/GGUS, Belinda/Storage, Ivan/Dashboard
- remote: Michael/BNL, Onno/NLT1, John/RAL, Sang-Un/KISTI, Wei-Jen/ASGC, Christian/NDGF, Lisa/FNAL, Rob/OSG, Rolf/IN2P3, Salvatore/CNAF, Doug/ATLAS
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- FT3-Pilot (GGUS:97359
) Problem with a cached proxy affected functional tests to all sites. (solved)
- AFS - ~13:03 on 19-Sept spurious rm process on /afs/cern.ch/atlas/offline/* removed RW areas including panda client areas needed by Hammer Cloud. Computing operations restored the needed area from tape promptly when alerted 20-Sept. Exact cause of rm is unknown.(INC:388802
) ATLAS investigation continues.
- T0
- T1
- CMS reports (raw view) -
- It has been fairly quiet until just a few hours ago!
- Appears to be a CVMFS problem at KIT, see GGUS:97505
and GGUS:97506
.
- SAV:139882
and SAV:139885
appear to indicate EOS transition teething pains at FNAL.
- No one seems to be complaining about network connectivity to Russia, nor has anyone given me any update on the status.
- Why does CERN external networking
show as degraded to 60% when it appears that all services are green? The service has been degraded for a week; I think the usual culprit is the (irrelevant) link to the Wigner center, but at the moment that looks green too. [MariaD: this may be due to the many interventions last week. AndreaV: will follow up - discussed this with Edoardo after the meeting]
- LHCb reports (raw view) -
- Main activity are MC productions
- Fall incremental stripping campaign will be launched coming Monday 30 Sep (see also WLCG Operations Coordination meeting, 19 Sep)
- T0: ntr
- T1:
- IN2P3: Currently in DT for sl6 upgrade [Rolf/IN2P3: sorry we did a mistake, IN2P3 is marked in DT today since yesterday evening, but actually the downtime and interventions will start this evening]
- SARA: Currently in DT
- GRIDKA: In DT as of 14.00 today
Sites / Services round table:
- Michael/BNL: ntr
- Onno/NLT1:
- today's maintenance was successful
- announcement, on October 8 we will move from LHCOPN to LHCONE
- John/RAL: disk server issue for ALICE, being fixed
- Sang-Un/KISTI: ntr
- Wei-Jen/ASGC: ntr
- Christian/NDGF: ntr
- Lisa/FNAL: ntr
- Rob/OSG: ntr
- Rolf/IN2P3: nta
- Salvatore/CNAF: ntr
- Pavel/KIT [via email]: CVMFS problems are not yet quite understood and under investigation
- Belinda/Storage:
- issue with non-resolving IP address for EOS ALICE over the weekend, now fixed but a more robust fix is being developed (this is a known issue)
- EOS ATLAS crashed this morning, fixed by restarting, investigations ongoing
- Vitor/Grid: ntr
- Ivan/Dashboard: ntr
- MariaD/GGUS: Due to personal circumstances in the GGUS team, GGUS release will be postponed from this Wednesday 2013/09/25. You shall be informed about the new release date. It will be decided a.s.a.p.
AOB: none
Thursday
Attendance:
- local: AndreaV/SCOD, Jan/Storage, Ken/CMS, AleDG/ATLAS, Stefan/LHCb, Vitor/Grid, MariaD/GGUS
- remote: Dennis/NLT1, Rolf/IN2P3, Kyle/OSG, Pavel/KIT, Wei-Jen/ASGC, Sang-Un/KISTI, Gareth/RAL, Lisa/FNAL, Jeremy/Gridpp. Lucia/CNAF, Francesco/CNAF
Experiments round table:
- ATLAS reports (raw view) -
- RAL FTS3 update. Done the 25th, this was not really ok, today a patch has been applied, it seems now the situation is stable. Comments from RAL are welcome
- AutoPilotFactory: a problem was observed the 25 which lead to some sites having low number of pilots. This has been fixed.
- CMS reports (raw view) -
- Very quiet over the past few days.
- A few problems at KIT, GGUS:97506
and GGUS:97558
, looks like CVMFS difficulties, being worked on. [Pavel/KIT: will upgrade to the new CVMFS software version when this is deployable]
- What's the status of the network link to Russia? I don't know, and also don't know how to be informed. Looks like transfer tests still fail to most sites. [!AndreaV/SCOD: will follow up with the network team]
- ALICE -
- various T2 moved or moving to CVMFS
- [Francesco/CNAF: debugging tape issues for ALICE, found many files corrupted (around 1-2%), ALICE is already aware that we are working on this]
- LHCb reports (raw view) -
- Main activity are MC productions
- Fall incremental stripping campaign will be launched coming Monday 30 Sep (see also WLCG Operations Coordination meeting, 19 Sep)
- T0: ntr
- T1:
- RAL: one disk server down, technicians are looking into the issue, at the moment the files are declared unreachable within LHCb [Gareth/RAL: we hope to give back the diskserver later today, presently doing some checksumming to ensure that there is no data corruption]
Sites / Services round table:
- Christian/NDGF [via email]: we've had a fibre break in Copenhagen this morning. It has affected the OPN connection to Norway. Some Atlas and Alice data affected. Clusters are trying to find alternate connections. Cable provider estimates that link is restored at 21:00 CEST.
- Dennis/NLT1: site problems with mass storage system, cannot write to tape, hopefully should be solved by tomorrow morning
- Rolf/IN2P3: downtime finished successfully, now at 50% of SLC6, and ALICE is also using CVMFS like the other experiments
- Kyle/OSG: ntr
- Pavel/KIT: performing Cream update to EMI3, three Creams are down, this will be completed on Monday
- Wei-Jen/ASGC: ntr
- Sang-Un/KISTI: ntr
- Gareth/RAL: ntr
- Lisa/FNAL: ntr
- Jeremy/Gridpp: ntr
- Lucia/CNAF: ntr
- AndreaV/SCOD: Edoardo Martelli clarified the issue with external network SLS reported by CMS on Monday. USLHCNet has decommissioned some devices in the US that were no longer relevant to the network for CERN and the SLS status. IT-CS is working with USLHCnet to define the exact devices to be monitored: until this is clarified, a (fake) degradation of external network connectivity is expected to show up in SLS.
- Vitor/Grid: ntr
- Jan/Storage:
- there was a short interruption yesterday for EOSCMS, fixed immediately
- a problem was also found on EOSCMS due to software version mismatches, no data loss, doing some cleanup
- planning Castor upgrades on October 7 and then the following week, will be announced
- MariaD/GGUS: yesterday's GGUS release was postponed, but if people feel that they still need some features released before CHEP, then please get in touch with me
AOB:
- AndreaV/SCOD: please be aware that the CERN OpenDays will take place this weekend.