Week of 140421
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
- Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
- To have the system call you, click here
- In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here
. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.
General Information
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
Monday: Easter
Monday holiday
- The meeting will be held on Tuesday instead.
Tuesday
Attendance:
- local: Felix (ASGC), Maarten (SCOD), Maria A (WLCG), Ulrich (CERN grid services)
- remote: Alexander (NLT1), Antonio (CNAF), Kyle (OSG), Lisa (FNAL), Oliver (CMS), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Xavier (KIT)
Experiments round table:
- CMS reports (raw view) -
- Processing/production situation
- Queues pretty much full with 13 TeV preparation MC production and digitization/reconstruction
- EOS trouble on April 19th
- Jan caught it on our support list before I even could open a ticket (thanks, Jan!)
- Lockup of EOSCMS namespace? Solved by restart?
- CERN: xrootd trouble on April 21st
- JINR: xrootd trouble
- file announced to be served by JINR but not accessible afterwards
- GGUS:104810
- ALICE -
- KIT
- network usage by jobs was OK over the long weekend
- the jobs cap has been removed this morning and the number of concurrent jobs rose from 1k to 3k+ without incident
Sites / Services round table:
- ASGC - ntr
- CNAF - ntr
- FNAL - ntr
- IN2P3 - ntr
- KISTI - ntr
- KIT - nta
- NLT1
- yesterday morning 1 dCache pool node crashed; restarted this morning and running OK again
- OSG
- GGUS:104806
opened for documentation on adding, removing or changing resources in REBUS
- Maria: pledges or resources published by the BDII?
- Kyle: site names
- Maria: REBUS has its own support unit, so the ticket just needs to get routed there
- Maarten: will check after the meeting
- the ticket has been routed OK, but the expert may be unavailable this week
- RAL - ntr
- CERN grid services
- this morning an AFS intervention went wrong, whereby local jobs could not get AFS tokens; fixed shortly after noon
- we plan to ramp down the number of CEs submitting to the remaining SLC5 resources; further announcements will follow
AOB:
Thursday
Attendance:
- local: David (CMS), Felix (ASGC), Maarten (SCOD), Marcin (databases), Maria A (WLCG), Stefan (LHCb), Xavi (CERN storage)
- remote: Dennis (NLT1), Jeremy (GridPP), Kyle (OSG), Lisa (FNAL), Michael (BNL), Pepe (PIC), Rolf (IN2P3), Sang-Un (KISTI), Sonia (CNAF), Thomas (KIT), Tiju (RAL), Ulf (NDGF), Xavier (KIT)
Experiments round table:
- ATLAS reports (raw view) -
- Central services/T0
- T1
- FZK-LCG2 file transfer failing with SRM_GET_TURL error on the turl request (GGUS:104864
)
- RAL-LCG2 file transfer errors with "SRM_FILE_UNAVAILABLE". RAL admins reported a problem with a disk server (ELOG:49033)
- CMS reports (raw view) -
- Rather quiet two days -- generally trying to keep things busy with 13 TeV MC production and accompanying reco, with legacy reprocessing of MC to complete Run 1 analyses
- Rereco of 2011 HI data nearly complete -- successful use of CERN HLT and AI resources for reprocessing.
- "Nearly complete" -- waiting on ~dozen files inaccessible in CERN EOS INC:538717
(open)
- GGUS:104827
(closed) Transfer errors between Beijing & FNAL -- sites need to remember to use SHA1 certs until FNAL upgrades dcache to allow SHA2 in few weeks.
- SAV:143092
(open) HPSS outage at IN2P3 last week affecting outgoing transfers of files this week.
- Rolf: GGUS:104842
appears to be about the same issue?
- David: looks like it
- ALICE -
- KIT: the firewall and OPN link were saturated between 16:30 and 21:30 yesterday due to a large batch of jobs running an older analysis tag that does not have the patch for using the local SE when possible
- the jobs cap was lowered to 1.5k for the night, then lifted again this morning
- LHCb reports (raw view) -
- Incremental Stripping campaign finished - many thanks to all T1 sites !!!!
- MCsimulation, Working Group Production and User jobs.
- T0:
- Migration of user data Castor -> EOS started on Tuesday
- Xavi: ~100k out of a few M files still to be done
- T1:
- GridKa : Some 102 input files for the stripping were not accessible to jobs, (GGUS:104852
). Fixed Wed evening and jobs succeeded.
Sites / Services round table:
- ASGC - ntr
- BNL - ntr
- CNAF - ntr
- FNAL - ntr
- GridPP
- do we need to be concerned about concurrent major downtimes at T1 sites?
- after some discussion and rectification of erroneous information the conclusion was that such clashes are to be avoided to a reasonable extent, but that we do not seem to have a reason for worry in this matter
- major interventions usually are announced well in advance and (where feasible) agreed with the experiments concerned
- in the past many years there has rarely (if ever) been a clash that caused a big nuisance
- sometimes the T0 has been down, with a much bigger impact, and we just dealt with it
- IN2P3 - nta
- KISTI - ntr
- KIT
- the ATLAS ticket has been solved
- NDGF
- high ALICE traffic observed, but the system is coping OK
- Maarten: probably due to increased activities in preparation for Quark Matter 2014
(May 19-24); furthermore, the issue that was observed for KIT can also happen at other sites in principle
- Maria: you were going to open a ticket about better values for GLUE Validator limits?
- NLT1
- ATLAS opened 2 tickets:
- GGUS:102716
about slow transfers from NIKHEF to BNL: we did some tuning, now waiting for a reply
- GGUS:104769
about SARA transfer failures: the cause does not seem to be at SARA
- yesterday the dCache SRM door was down for ~1h; it has happened before and been reported to the developers
- OSG - ntr
- PIC
- our multi-core queue is seeing significant use by ATLAS and CMS; we have re-tuned the scheduler and will report in the Multi-core WG on our observations
- RAL
- the ATLAS ticket has been solved
- Tue April 29 there will be a site outage for network maintenance 07:00-17:00 local time
- CERN storage - nta
- databases - ntr
AOB:
--
SimoneCampana - 20 Feb 2014