Week of 150413
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Vidyo system. Instructions can be found here
.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday
Attendance:
- local: Alessandro (ATLAS), Belinda (storage), Lorena (databases), Maarten (SCOD + ALICE), Raja (LHCb)
- remote: Alexey (PIC), Di (TRIUMF), Dimitri (KIT), Felix (ASGC), John (RAL), Lisa (FNAL), Onno (NLT1), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI)
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- Nothing urgent to report: couple of squids down at 2 Tier1s, GGUS ticketed them and services were restarted.
- not moved back yet the DDM load on FTSes at RAL, will do in the next hours/days.
- John: various experts will be occupied at CHEP
- Alessandro: we are in communication via e-mail and this week would be OK for them; we will follow up offline
- LHCb reports (raw view) -
- User and MC jobs are running in the system. The failure rate of the jobs is very low.
- CERN & Tier-1s running fine
- Problem of failing LHCb jobs solved promptly by RAL admins last Friday evening
- FTS transfer problems to two Tier-2 sites - Grif (GGUS:112985
), Nipne (GGUS:112044
)
- Alessandro: are the issues due to the FTS? if so, we would be interested
- Raja: one looks due to the network, the other due to the local SE
- VOMS-Admin issues waiting for solution from VOMS admins : GGUS:112282
and GGUS:112279
- Maarten: the first ticket is waiting for a new release, I will "ping" the second
Sites / Services round table:
- ASGC: ntr
- BNL:
- CNAF:
- FNAL: ntr
- GridPP:
- IN2P3:
- batch system outage for upgrade: draining starting on Sun Apr 26, remaining jobs killed Mon morning Apr 27, service back at noon
- JINR:
- KISTI: ntr
- KIT: ntr
- NDGF:
- NL-T1:
- last week the disk storage was expanded to the new pledge level
- the computing cluster was expanded with 100 nodes of 24 cores each and 8 GB RAM per core
- a Squid service went down in the weekend and was restarted OK
- NRC-KI:
- OSG:
- issues with the latest monthly WLCG accounting report are being followed up, a few sites have been ticketed
- PIC:
- running with 70% of the CPU capacity; the remainder is waiting for full recovery of the power and cooling infrastructure
- RAL:
- the CASTOR upgrade to 2.1.14-15 that was canceled last week will happen this Wed
- TRIUMF: ntr
- CERN batch and grid services:
- CERN storage services:
- CASTOR-ATLAS upgrade to 2.1.15-3 tomorrow 09:00-12:00 (the other 3 experiments already have it)
- Databases: ntr
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Alessandro (ATLAS), Andrea (MW Officer), Lorena (databases), Maarten (SCOD + ALICE), Raja (LHCb)
- remote: Alexey (PIC), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), Jens (NDGF), Kyle (OSG), Lisa (FNAL), Matteo (CNAF), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Tommaso (CMS)
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- Castor intervention done yesterday ok, and EOS namespace was also restarted (quota set issue, now solved)
- FTS3 RAL put back preprod. https://fts3-test.gridpp.rl.ac.uk:8446
237 DDM ep, fts3.cern.ch 244 DDM ep, fts.usatlas.bnl.gov 138 DDM ep.
- CMS reports (raw view) -
- all quiet, not a huge activity at sites (waiting to launch RECO for 2015 MC samples)
- Eu redirector Xrootd.ba.infn.it had problems twice along the week: a daemon crashed due to "too many threads" (but apparently still below the limit set). Being investigated with developers. It should have no effect on readiness (gets a "Warning"), but for 2 sites (T2_HU_Budapest and T2_IN_TIFR) became a "Critical"; not sure why, investigating.
- reached the quota limit on EOS unmerged area, since now also the T0 writes there. Understanding and setting new quotas.
- some sites are complaining about "hot files" being requested much more than the average. This is understood at our SW level: the first pile-up file is always opened, in order to read metadata. We are trying to find solutions.
- some small issues reported:
- GGUS:113049
(CERN) Castor-T0CMS instance in bad shape for ~24 hours, then recovered. Not sure we understood why.
- GGUS:113036
T0->T1 tests were failing for a few days, which turned red all the T1s for readiness. Mostly corrected, apart from 2 days still red which will need another correction.
- GGUS:112942
(CERN) ARGUS not answering correctly for some DNs. Being followed, and the moment cured by adding servers but real issue non understood (as far as I understand).
- LHCb reports (raw view) -
- Mostly user and MC jobs in the system.
- FTS transfer problems to Nipne continuing (GGUS:112044
)
- VOMS-Admin issues waiting for solution from VOMS admins / devs : GGUS:112282
and GGUS:112279
- Raja: thanks to RAL for the smooth CASTOR upgrade on Wed
Sites / Services round table:
- ASGC: ntr
- BNL:
- CNAF: ntr
- FNAL: ntr
- GridPP:
- IN2P3: ntr
- JINR:
- KISTI: ntr
- KIT: ntr
- NDGF: ntr
- NL-T1: ntr
- NRC-KI:
- OSG: ntr
- PIC:
- CMS opened a ticket against PIC because of slow response times experienced by jobs:
- there was too much activity by CMS, running at 1.5 times their pledge plus doing many transfers etc.
- the ticket was therefore bounced back to CMS support
- RAL:
- the CASTOR upgrade on Wed went OK
- TRIUMF:
- next Wed from 17:00 to 21:00 UTC dCache will be upgraded to 2.10.24 and other services will be upgraded too
- CERN batch and grid services:
- CERN storage services:
- Databases: ntr
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
--
AndreaSciaba - 2015-02-27