Week of 150615
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Vidyo system. Instructions can be found here
.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday
Attendance:
- local: Ilija (ATLAS), Jerome (batch and grid services), Maarten (SCOD + ALICE), Massimo (storage), Prasanth (databases), Stefan (LHCb), Xavi (storage)
- remote: Christian (NDGF), Christoph (CMS), Felix (ASGC), Kyle (OSG), Lisa (FNAL), Michael (BNL), Onno (NLT1), Preslav (CMS), Rolf (IN2P3), Sang-Un (KISTI), Sonia (CNAF), Tiju (RAL), Vladimir (LHCb)
Experiments round table:
- ATLAS reports (raw view) -
- FTS upgrade to fix the issue with storm: all the FTS servers did it, thanks a lot!
- CERN-PROD: high failure rate was understood and fixed thanks to the quick interactions with CERN-IT DSS. GGUS:114293
reopened today, possibly fix was not complete.
- CMS reports (raw view) -
- File access problems at CCIN2P3: GGUS:114343
- Rolf: that issue concerns our T2
- Possible file corruption issues at CERN EOS: GGUS:114304
- Massimo: for different experiments we have seen a strong correlation with the network incident of June 11
- Some files seem not to migrate to CASTOR tape at CERN: GGUS:114282
- File transfer issues from FNAL to RAL: GGUS:114275
- Tape staging test started at CERN: GGUS:114283
(for information logging)
- Any news regarding P5-Wigner network link?
- Xavi: this is still being followed up further by network experts
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- GridPP:
- IN2P3:
- reminder: downtime tomorrow
- JINR:
- KISTI: ntr
- KIT:
- NDGF:
- downtime tomorrow 10:00-16:00 CEST for dCache upgrades
- NL-T1:
- this morning the SARA squid service crashed and was restarted
- there was a DNS issue affecting the computing cluster at SARA; it has been fixed, but the issue might not fully be gone until all previously cached information has expired
- the NIKHEF farm seemed rather quiet, maybe due to the squid problem?
- after the meeting: ALICE had 2k jobs running throughout the day...
- NRC-KI:
- OSG: ntr
- PIC: ntr
- RAL:
- reminder: network maintenance Wed afternoon
- TRIUMF:
- CERN batch and grid services: ntr
- CERN storage services:
- tomorrow new HW will be added, which should be transparent, but might affect any experiment
- Databases:
- tomorrow CMS DB firewall rules will be moved from DB triggers to
iptables
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Ilija (ATLAS), Stefan (SCOD), (MW), Oliver Gutsche (CMS), Sang-Un (KISTI), Massimo (Storage), Jerome (Grid Services), Maarten (ALICE),
- remote: Felix (ASGC), Michael (BNL), Sonia (CNAF), Lisa (FNAL), Rolf (IN2P3), Thomas (KIT), Christian (NDGF), Dennis (NL-T1), Rob (OSG), John (RAL),
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- Large backlog of FTS transfers from CERN to BNL - at the moment 30k files. We raised number of activated to 30, 60 and now 100, still rate is hovering around 1GB/s. We could possibly go to 200, but not sure if that would help. We contacted site.
- SARA - need to delete some directories (and data in them) as central deletion fails GGUS: 114430
- RAL - fails transfers as a destination due to full DATADISK. GGUS:114391
- CMS reports (raw view) -
- Rolling network intervention on Tuesday June 16th was successful, thanks. Did not cause any problems.
- Some files seem not to migrate to CASTOR tape at CERN: GGUS:114282
- needs to be reassigned to Castor support!
- File transfer issues from FNAL to RAL: GGUS:114275
- on hold for now, suspicion is co-incident with CMS jobs on the RAL farm which cause heavy WAITIO on the storage nodes, cannot be confirmed right now because load decreased
- Lisa: Transfers to RAL from FNAL, the ticket was updated and the slowness is back, John: will look into it soon
- Tape staging test at CERN ongoing: GGUS:114283
- Intermittent EOS read issues from CMS Tier-0: GGUS:113389
- Any news regarding P5-Wigner network link?
- Massimo: a report on this will be given in the WLCG Ops Coordination meeting today 15.30
- LHCb reports (raw view) -
- Validating data processing workflow
- T1
- RAL - problem with mapping for few user; Already fixed.
Sites / Services round table:
- ASGC: NTR
- BNL: NTR
- CNAF: NTR
- FNAL: NTR
- GridPP: NR
- IN2P3: Downtime yesterday went well. Some problems on Monday with electrical equipment, some worker nodes were effected, solved by now.
- JINR: NR
- KISTI: HW maintenance on network link this morning 7am - 8am to Amsterdam which was transparent.
- KIT: NTR
- NDGF: NTR
- NL-T1: request from SARA to ATLAS concerning a GGUS ticket requesting removal of several directories, SARA wants confirmation for those from ATLAS, Ilija: will come back to you
- NRC-KI: NR
- OSG: multicore accounting has been fixed and confirmation was received that the correct numbers are now in the APEL system. Will now update previous few months with good numbers
- PIC: NR
- RAL: CMS disk server unavailable this morning, shall be back tomorrow morning
- TRIUMF: NR
- CERN batch and grid services:
- CERN storage services: Last Tuesday transparent upgrade of services. Incident yesterday, all Castor DBs went down and needed to be rebooted. Chasing network corrupted files following the June 11 network incident.
- Databases: NR
- GGUS:
- Release scheduled for the 24th of June. Downtime announced on GOCDB
. The service might not be available during the intervention. More info about the release
- Grid Monitoring: NR
- MW Officer: NTR
AOB: