Week of 130218
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance: local(Alex B, Belinda, Maarten, Stefan, Steve, Xavier);remote(Boris, Elizabeth, Gareth, Lisa, Michael, Onno, Rolf, Torre, Wei-Jen).
Experiments round table:
- ATLAS reports -
- T0/Central services
- Tier1s
- Sat pm: transfer failures to Taiwan. Attributed by site to busy disk servers, OK again and ticket close Sun night. GGUS:91581
- Sun pm: Source errors in transfers from TRIUMF-LCG2 and other CA sites. FTS cannot contact non-CA FTS servers. Site is working on it. GGUS:91588
- Tier 2 calibration centers
- Sun am: ES CALIBDISK failures of Functional test transfers, SRM down at IFIC, all file transfers failing. Failure in one RAID group, now offline, Restoring Lustre and SRM. GGUS:91586
- FYI: ATLAS AMOD(s) for this and next week not yet identified.
- LHCb reports -
- Ongoing activity as before: reprocessing, some prompt-processing, MC and user jobs.
- T0: NTR
- T1:
Sites / Services round table:
- ASGC
- ATLAS and CMS jobs affected by CVMFS 2.0.19 cache filling up due to known bug; for now mitigated by manual cleanups; should be fixed in upcoming (2.1.7) release expected in a number of days
- BNL - ntr
- FNAL - ntr
- IN2P3 - ntr
- NDGF - ntr
- NLT1
- during the weekend there was 1 dCache pool node stuck, restarted yesterday night
- OSG - ntr
- RAL
- some ongoing issues with the batch system not starting enough jobs, being investigated
- dashboards - ntr
- GGUS
- NB!!! The italian Tier1 needs to update the host certificate for their ticketing system (ticketing.cnaf.infn.it). The change will be on Wednesday 2013/02/20 around 9:30am CET. A short interrupt may be perceived in the interface with GGUS as the server needs to be rebooted. Details in Savannah:135912
- grid services - ntr
- storage
- during the weekend EOS-LHCb was unstable; after SW updates earlier today its behavior looks smoother in the monitoring
AOB:
Tuesday
Attendance: local(Alex B, Eva, Maarten, Maria D, Stefan, Steve);remote(Boris, Jeremy, Lisa, Matteo, Michael, Pepe, Rob, Rolf, Ronald, Saverio, Tiju, Wei-Jen, Xavier).
Experiments round table:
- ATLAS reports -
- NTR. Most probably no one from ATLAS can connect today. Sorry.
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0: NTR
- T1:
- IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126
).
- Rolf: the ticket is with the SAM team now; we are not aware of changes that might explain why the test works only sometimes
- Stefan: the test is failing randomly, the cause is not yet known
Sites / Services round table:
- ASGC - ntr
- BNL - ntr
- CNAF - ntr
- FNAL - ntr
- GridPP
- the multi-core ATLAS jobs mentioned in the KIT report are PROOF-Lite jobs
- IN2P3
- on March 19 there will probably be an all-day outage for electrical work, details to follow later
- KIT
- our Frontier Squid servers will be updated between 9-10 UTC tomorrow and the day after, should be transparent
- single-core queues have been misused by ATLAS users submitting multi-core jobs, ATLAS are following up
- NDGF
- we have observed transfer errors due to a network problem, being investigated
- NLT1 - ntr
- OSG - ntr
- PIC
- complete downtime on March 26 between 5-19 UTC for electrical maintenance
- RAL - ntr
- dashboards - ntr
- databases - ntr
- GGUS/SNOW
- experiments will no longer be prompted to inform Maria of important tickets that are not making progress, as such tickets can just be included in the experiment reports of the bi-weekly Operations Coordination meeting
- grid services - ntr
AOB:
Wednesday
Attendance: local(Alexei, Belinda, David, Dirk, Luca C, Maarten, Maria D, Massimo, Stefan, Steve);remote(Boris, John, Kyle, Lisa, Matteo, Michael, Pavel, Pepe, Rolf, Ron, Wei-Jen).
Experiments round table:
- ATLAS reports -
- Central services
- GGUS: Problem in opening a TEAM ticket to a specific site: The site parameters haven't been synchronized correctly with GOC DB (GGUS:91634
)
- GGUS: A shifter could not open tickets with an issue with CRL (GGUS:91610
)
- SLS for ATLAS HammerCloud has been outdated (last update: 8 Feb 2013) https://sls.cern.ch/sls/service.php?id=HC.ATLAS
- On Feb 12: "the web server that serves the SLS reports is decommissioned and I'm moving the thing to the new one."
- T1s and network
- FZK-LCG2: There are long standing ggus tickets for problems in transfers between FZK-LCG2 and UK sites (GGUS:87958
, GGUS:91439
)
- RRC-KI-T1: ATLAS has started integrating the RU-T1 (RRC-KI-T1) in ATLAS systems. FTS3 servers at RAL and CERN were used for test fie transfers.
- Alexei: the prototype T1 will be used in a reprocessing exercise
- Alexei: next week a small reprocessing campaign will run at the T1 sites
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0: NTR
- Migration CASTOR -> EOS progressing, estimated to last for another 6 weeks
- T1:
- IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126
), logfiles of failed sam probes seem to indicate that the probe is killed by the batch system (logs uploaded to GGUS ticket)
- Rolf: we have also involved our batch system experts
Sites / Services round table:
- ASGC - ntr
- BNL - ntr
- CNAF
- CREAM ce01-lcg will be down Feb 21-28 for upgrade to EMI-2 on SL6
- FNAL - ntr
- IN2P3 - nta
- KIT
- 3 Frontier Squid servers were upgraded OK today, the remaining 3 will be done tomorrow
- NDGF
- the transfer errors reported yesterday are still being investigated
- tomorrow there will be a short downtime of the SRM head node for security patching; it might even cure the transfer errors
- NLT1 - ntr
- OSG - ntr
- PIC - ntr
- RAL - ntr
- dashboards
- this morning the ATLAS job monitoring dashboard was affected by a DB problem, resulting in the job history having a few small gaps
- databases
- this morning one dashboard application was affected by a change in an Oracle query execution plan, fixed
- GGUS/SNOW
- The host certificate for the italian ticketing system (ticketing.cnaf.infn.it), announced last Monday, took place this morning and was successful.
- Next GGUS Release will be in a week, on 2013/02/27.
- An interface between GGUS and the ibergrid RT ticketing system will enter production with next week's GGUS release. The change affects PIC. In case of any problem, please open a GGUS ticket against GGUS or comment in Savannah:130314
.
- grid services
- there was a problem with the batch system dispatching jobs this morning, fixed
- storage - ntr
AOB:
Thursday
Attendance: local(Alessandro, Alex B, Belinda, Luca M, Maarten, Stefan, Steve, Ueda);remote(Boris, Gareth, Lisa, Marian, Michael, Pepe, Rob, Rolf, Rolf, Ronald, Saverio, Wei-Jen).
Experiments round table:
- ATLAS reports -
- Central services
- GGUS: Problem in opening a TEAM ticket to a specific site: GGUS:96634
verified:The site parameters haven't been synchronized correctly with GOC DB/GGUS.
- GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610
in progress: the shifter was recommended to use temporarily the account which is mapped to the certificate.
- SLS for ATLAS HammerCloud unavailable (in grey). Fixed, migration to the new hardware completed.
- T1s and network
- FZK-LCG2 from UK sites file transfer problems: 1340 failures "GRIDFTP_ER.:server err.500" from UKI-SCOTGRID-GLASGOW and 30 failures from UKI-NORTHGRID-LIV-HEP GGUS:87958
in progress updated.
- Marian: we are also looking into PerfSONAR measurements
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN, IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0:
- ALARM ticket (GGUS:91690
) for afs hosted web service which is not responding. It serves grid jobs for configuration and setup purposes
- Stefan: it was due to an accidental DoS by one machine
- Many failures in CASTOR->EOS migration because of different checksums in LFC and CASTOR
- Luca: working on it, please send the list of affected files
- Stefan: OK, ~300 files from 2008, also on tape; the issue is due to the presence or absence of leading zeroes in the checksum
- T1: NTR
Sites / Services round table:
- ASGC
- downtime Feb 25 23:00 to Feb 26 18:00 UTC for upgrades of CASTOR, DPM and storage firmware
- BNL - ntr
- CNAF - ntr
- FNAL - ntr
- IN2P3 - ntr
- KIT - nta
- NDGF
- the transfer errors were due to a network problem, things look better now
- today's SRM maintenance went OK
- NLT1
- this morning SARA had an unscheduled outage: dCache was unavailable due to a loose fiber
- OSG - ntr
- PIC - ntr
- RAL - ntr
- dashboards - ntr
- grid services - ntr
- storage
- CASTOR DB NAS will have a HW intervention 17:30-21:30 CET, should be transparent
AOB:
Friday
Attendance: local(AndreaS, Kate, Mike, Steve, Belinda, Stefan);remote(Xavier/KIT, Gareth/RAL, Wei-Jen/ASGC, Onno/NL-T1, Michael/BNL, Matteo/CNAF, Lisa/FNAL, Rolf/IN2P3, Rob/OSG, Pepe/PIC).
Experiments round table:
- ATLAS reports -
- Central services
- GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610
solved:the issue with the certificate was fixed, the shifter is able to open/update GGUS now.
- CERN/VOMS problems affecting ATLAS production and analysis jobs. (GGUS:91704
, GGUS:91706
, GGUS:91710
). Thanks to Maarten for the quick action and the ALARM (GGUS:91706
).
- ALICE reports
-
- CERN: VOMS incident (see below), alarm ticket GGUS:91706
opened yesterday evening ~20:00 [Steve: something similar happened months ago]
- CERN: EOS lost 17 files, 12 were dark data
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0:
- ALARM ticket (GGUS:91690
) for afs hosted web service which is not responding, understood and fixed
- T1: NTR
Sites / Services round table:
- ASGC: had a CASTOR crash this morning
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: ntr
- NL-T1:
- one dCache pool was stuck this night for six hours and restarted this morning. It happened already a few times, we hope that a kernel upgrade will fix the problem.
- This Monday and Tuesday, SURF-SARA will be in maintenance
- PIC: ntr
- RAL: on Thuesday morning we declared a downtime "at risk" to reboot a network switch. The effect should be minimal.
- OSG: ntr
- CERN batch and grid services: VOMS incident, wrong host certificate put in place on voms.cern.ch , IncidentVOMSFeb2013 Service broken at 16:10 on Thursday, restored Friday at 07:00 this morning.
- CERN storage services: ntr
- Dashboards: ntr
- Databases: ntr
AOB: