Week of 190826
WLCG Operations Call details
- For remote participation we use the Vidyo system. Instructions can be found here
.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Portal
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local: Olga (Computing), Maarten (ALICE), Michal (ATLAS), Julia (WLCG), Vladimir (LHCb), Miro (DB, Chair)
- remote: Di (TRIUMF), Xavier (KIT), Mike (ASGC), Xin (BNL), Marcelo (CNAF), Sang-Un (KISTI), Jens (NDGF), Dave (FNAL), Christoph (CMS)
Experiments round table:
- ATLAS reports ( raw view) -
- Activities:
- pilot2/singularity migration converging, remaining sites/queues handled one by one
- ongoing reprocessing campaign, i.e. intense tape staging
- sub-optimal performance in staging at pic, FZK - to be followed with sites
- Issues
- Transfers from INFN-T1 tapes were failing with "Communication error on send" (GGUS:142805
)
- Transfers from IN2P3-CC tapes were failing with "Changing file state because request state has changed" (GGUS:142818
)
- FTS channel was saturated
- Transfers from BNL-ATLAS are failing with "File is unavailable" (GGUS:142841
)
- urgent jobs stuck at ANALY_BNL_MCORE
- the queue has only one running job
- email thread with US people started
- CMS reports ( raw view) -
- Reprocessing of Run2 data is on going
- Will involve re staging of RAW from tape
- LHCb reports ( raw view) -
- Activity:
- MC, user jobs and data restripping.
- Massive staging at all T1
- Issues:
- RAL:
- GGUS:142350
; still issues accessing files on ECHO. Under investigation.
- CNAF:
Sites / Services round table:
- ASGC:
- Downtime for DPM DB update (T1/T2):
- 2019-08-26T01:00:00 ~ 2019-08-26T10:00:00 (UTC)
- Had been finished smoothly, and the downtime is over, now starting to get jobs
- BNL: Issues with dCache Chimera name server recently, caused transfer failures from time to time, under investigation. (GGUS:142841
)
- CNAF: Back in action with fully electrical continuity since Wednesday afternoon
- Issues with main router in restart. This router was already scheduled to be replaced beginning of September
- StoRM issues mentioned by LHCb and ATLAS (also in tkt) seemed to be related to the Router issue
- EGI: NC
- FNAL: NTR
- IN2P3: NC
- JINR: dCache and Enstore upgraded to the latest version (5.2.3)
- KISTI: NTR
- KIT:
- Reminder: Nearline storage outage from Sep 16th till Sep 27th for all VOs!
- Reminder: Online storage outage on Sep 17th for all VOs (GOCDB:27575
). We will try and keep the batch farm operative, so you still may send us jobs and work with remote storage at other sites.
- NDGF: NTR
- NL-T1: NC
- NRC-KI: NC
- OSG: NC
- PIC: NC
- RAL: NC
- TRIUMF: NTR
- CERN computing services:
- BDII service degraded over the weekend. Issue now resolved. (OTG:0051959
)
- Reduced capacity in HTCondor next Monday/Tuesday (OTG:0051883
) due to planned intervention (OTG:0051379
).
- CERN storage services: NTR
- CERN databases: NTR
- GGUS: NTR
- Monitoring: NTR
- MW Officer: NTR
- Networks: NTR
- Security: NTR
AOB: