Week of 121008
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday
Attendance: local(AndreaS, Luc/ATLAS, Stefan/LHCb, Jarka, Massimo, Maarten, Jerome, Ian/CMS, MariaD);remote(Onno/NL-T1, Michael/BNL, Gonzalo/PIC, Tiju/RAL, Lisa/FNAL, Dimitri/KIT, Rob/OSG, Zeeshan/NDGF).
Experiments round table:
- ATLAS reports -
- T0/WLCG
- ALARM GGUS:86788
& INC:175534
LSF bsub time too big Solved
- ALARM GGUS:86883
& INC:175883
Pending jobs in dedicated cluster. Solved
- CERN GGUS:86778
& INC:175450
Get error. Pb with CERN-PROD_TMPDISK token used to store ESD before migrating to tape. To be included in setprodpath
- FTS bug: All sites should deploy the fix (Today PIC, RAL, Triumf) see GGUS:81844
. Mail sent to all clouds.
- T1
- NDGF-T1 GGUS:86770
Functional test transfers to DATADISK failing. FTS overwrite option issue?
- SARA GGUS:86889
Transfers failing from SARA to CA with available CRL expired. FTS bug suspected, Triumf has installed the patch. Ongoing
- sr #132764 Transfer failures because "Robot: ATLAS Data Management" proxy expired at Lyon & Triumf. Linked to FTS bug see bug #98002
- Maarten: the FTS developers are aware of the bug and are looking further into it; also the VOMS developers are involved. Hopefully, in the near future, an rpm will be released solving all these problems.
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Filled Tier-0 job slots with the combination of production and test Tier-0 systems. Load will be high for the next few days
- Tier-1/2:
- The downtime calendar used by CMS shows a downtime for CERN until October 25. During the meeting it turns out that it's due to a long downtime declared for a rolling upgrade of the CREAM CEs. According to Maarten, this type of intervention should be totally transparent and a downtime is not necessary.
- ALICE reports
-
- CERN: EOS was not working from Sat evening around midnight to Sun morning around 08:00, when the headnode was rebooted
- LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 2 attached T2s
- MC productions if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
Sites / Services round table:
- BNL: ntr
- FNAL: ntr
- KIT: downtime foreseen for October 22, for maintenance work on a network component. It should last from 0500 to 0730.
- NDGF: FTS transfer issue for ATLAS (see ATLAS ticket above)
- NL-T1: ntr
- PIC: last weekend the CMS transfers were heavily affected by the FTS bug. Fixed today.
- RAL: ntr
- OSG: ntr
- CERN batch and grid services: ntr
- CERN storage :CASTORCMS being upgraded; LHCb EOS was upgraded earlier today; EOS for ALICE is down. Another ALICE problem this morning: a node could not be turned on.
- Dashboards: ntr
AOB:
Tuesday
Attendance: local (AndreaV, Luc, Jerome, Massimo, Nicolo, Jarka, MariaD, Eva); remote (Saverio/CNAF, Xavier/KIT, Michael/BNL, Lisa/FNAL, Ronald/NLT1, Jhen-Wei/ASGC, Rolf/IN2P3, Gareth/RAL, Rob/OSG, Zeeshan/NDGF; Joel/LHCb).
Experiments round table:
- ATLAS reports -
- T0/WLCG
- problem with releasing poolfile catalog to /afs. Solved by Arne. [Luc: what should we do in case of such problems outside working hours, send an ALARM? Massimo: yes you can send an ALARM, then the operators will follow this up. Note that there is no formal piquet for AFS (like for CASTOR), but the operators have a list of phone numbers of the experts and they will call them one after the other, also during the night]
- CERN GGUS:86778
& INC:175450
Get error. Pb with CERN-PROD_TMPDISK token used to store ESD before migrating to tape. To be included in setprodpath [Massimo: is the action on ATLAS for this GGUS ticket? Luc: yes the action is now on ATLAS]
- T1
- SARA GGUS:86889
Transfers failing from SARA to CA with available CRL expired. Fixed (FTS & FTA reconfigured with YAIM at Triumf.)
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Filled Tier-0 job slots with the combination of production and test Tier-0 systems. Load will be high for the next few days
- Tier-1:
- FNAL: 2 files lost, retransferring: SAV:132819
- RAL: FTS delegation issue, tomcat restarted SAV:132812
and GGUS:86775
[Maarten: FTS developers recommend that all sites should deploy the latest FTS patch. If the patch is deployed, these issues can be simply solved by a restart; is the patch is not installed, a non trivial cleanup is needed in these cases]
- Tier-2:
- T2_US_Wisconsin: MC production failing, looks like a black hole node: SAV:132837
- T2_EE_Estonia: Power cut caused data loss (and correspondingly MC merge job failure): SAV:132839
and SAV:132846
- ALICE reports
-
- CERN: since ~11:00 CEST all accesses to files on the ALICE_DISK CASTOR pool should be going via EOS, which will redirect a request to CASTOR if it does not have the file; so far this seems to be working OK
- LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 2 attached T2s
- MC productions if resources available
- T0:
- T1:
- GRIDKA: Input data resolution problem (GGUS:86720
) and staging problems (GGUS:80794
)
- [Joel: many thanks to RAL for constantly keeping us informed using twitter! It would be nice if all T1's did the same. Gareth: thanks for the appreciation]
Sites / Services round table:
- Saverio/CNAF: ntr
- Xavier/KIT: ntr
- Michael/BNL: ntr
- Lisa/FNAL: ntr
- Ronald/NLT1: ntr
- Jhen-Wei/ASGC: ntr
- Rolf/IN2P3: ntr
- Gareth/RAL: had planned a Castor upgrade for LHCb for today, but we postponed it because yesterday we noticed a small issue in ATLAS after the upgrade two weeks ago, that we still do not completely understand - apologies to LHCb
- Rob/OSG,: ntr
- Zeeshan/NDGF: ntr
- Massimo/Storage:
- did two CASTOR interventions, one with downtime for ALICE and a transparent one for LHCb; will do one for CMS with 3 hour downtime on Thursday [Nicolo: is EOS intervention for CMS on Thursday confirmed? Massimo: yes]
- [AndreaV: was the AFS problem yesterday only affecting ATLAS? We noticed some AFS problems also for the LCG nightlies software. Massimo: cannot say yet, these issues are still being followed up]
- Eva/Databases: ongoing transparent intervention on the storage of the RAC holding ATLAS, LHCb offline and PDBR
- Jerome/Grid: ntr
- Jarka/Dashboard: ntr
- MariaD/GGUS: ntr
AOB: none
Wednesday
Attendance: local(AndreaS, Stefan, MariaD, LucaC, LucaM, Jerome, Alessandro, Jarka, Nicolò);remote(Salvatore/CNAF, Lisa/FNAL, Ron/NL-T1, Rob/OSG, Zeeshan/NDGF, Rolf/IN2P3-CC, Jhen-Wei/ASGC, Pavel/KIT).
Experiments round table:
- ATLAS reports -
- T0//T1
- EOS upgrade taking longer than expected
Ale: discussing the possibility to partially roll back the upgrade, as ATLAS has some tight deadlines affected by the EOS downtime.
LucaM: tomorrow also a CASTOR upgrade is foreseen: is that a problem?
Ale: no, that impacts different workflows, not as critical as for EOS.
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Filled Tier-0 job slots with the combination of production and test Tier-0 systems. Load will be high for the next few days, enabling spillover to public queues.
- Tier-1:
- Replicating data from RAL to PIC, KIT and FNAL to run reprocessing
- PIC: one potentially corrupt input file causing job failures, SAV:132864
- Tier-2:
- T2_TR_METU: MC jobs and SAM tests were failing for issues on the CE, fixed: SAV:132837
and SAV:132785
- T2_FR_IPHC: MC jobs failing, probably for file access problem SAV:132869
- ALICE reports
-
- CERN: EOS-ALICE instabilities due to very high numbers of concurrent clients and requests; we thank the EOS team for their ongoing debugging and tuning efforts!
LucaM: we lowered the number of opens not to overload the system.
- LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 2 attached T2s
- MC productions if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
- GRIDKA: Input data resolution problem (GGUS:86720
) has gone down, staging errors (GGUS:87061
) have gone down, staging efficiency not high enough (GGUS:80794
)
Sites / Services round table:
- ASGC: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3-CC: ntr
- KIT: ntr
- NDGF: ntr
- NL-T1: ntr
- RAL: ntr
- OSG: ntr
- CERN batch and grid services: ntr
- CERN storage: ntr
- Dashboard: ntr
- Databases: ntr
- GGUS:
- Is ASGC related to GStat support via GGUS? We have email address gstat-support@listsNOSPAMPLEASE.grid.sinica.edu.tw for this Support Unit but tickets are not answered for months / years. If yes, we 'd be grateful for amy assistance in getting the line more responsive. Tickets concerned: GGUS:64388
, GGUS:84420
, GGUS:82608
, GGUS:81461
.
[Jhen-Wei will check with his colleagues why these tickets were not answered.]
- Requires comments from the Tier0 service managers: Now that GGUS and SNOW Requests are fully interfaced, we 'd like to include them in the monthly test ALARMs on release date. We suggest to replace the ALICE test ALARM (now being an Incident) by a Request. How would you like the email notification in this case, see how we do the TEAM-to-ALARM upgrade test in Savannah:132626#comment3
. [Ale suggests to use TEAM tickets for the test]
AOB:
Thursday
Attendance: local(AndreaS, Alexandre, Massimo, Alessandro, Jerome, Nicolò, MariaD, Eva, Stefan);remote(Rolf/IN2P3-CC, WooJin/KIT, Ronald/NL-T1, Kyle/OSG, Lisa/FNAL, JhenWei/ASGC, Gonzalo/PIC, Gareth/RAL, Zeeshan/NDGF, Salvatore/CNAF).
Experiments round table:
- ATLAS reports -
- T0//T1
- EOS upgrade was not smooth as expected. Users report instabilities, though jobs are not affected due to retry policies. Thanks to IT-DSS for their support, still we need to investigate the present situation.
- CASTOR libraries upgrade (INC:177202
). We noticed that the VOBOXes used for building the ATLAS software some new packages were pushed by Quattor, breaking the software compilation. There is a plan to fix it with PH-SFT and IT-DSS but will need two weeks to apply the fix on the ATLAS side.
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Backlog on T0 decreasing.
- Tier-1:
- PIC: one potentially corrupt input file causing job failures, SAV:132864
, still no update in the ticket. [Gonzalo: the file CRC checksum if OK and the file is accessible and has several replicas; still need to test opening it. Will update the ticket]
- CNAF: investigating several corrupt unmerged files, causing merge job failures, SAV:132720
- ASGC: several files stuck in tape migration for up to two weeks, GGUS:87262
- Tier-2:
- T2_FR_IPHC: MC jobs failing, probably for file access problem - site admin reported that storage should be OK now, SAV:132869
- ALICE reports
-
- CERN: we thank IT-DSS for their ongoing efforts to improve the performance of EOS-ALICE, which now also receives a lot of read traffic for conditions data.
- LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 2 attached T2s
- MC productions if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
- GRIDKA: Staging efficiency not high enough for current reprocessing activities (GGUS:80794
)
- CNAF: slowness of the file system causing FTS failures and jobs not being able to download their input data, fixed now
Sites / Services round table:
- ASGC: the tickets about Gstat (see yesterday's report) are being updated now. For the future, will make sure they get automatically redirected to our internal ticketing system.
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: The LHCOPN link to IN2P3 went down yesterday morning and it was fixed 30' ago.
- NDGF: ntr
- NL-T1: ntr
- PIC: On Tuesday evening, during a consistency check, we accidentally deleted 200,000 ATLAS files (SAV:98070
). ATLAS was informed and is taking action.
- RAL: ntr
- OSG: working on a potential attachment synchronisation problem with GGUS
- CERN batch and grid services: ntr
- CERN storage services
- This week we upgraded EOSALICE and EOSATLAS, and the latter was unusable until 0430; this morning was down for 30' due to a restart; still investigating. An issue causing slowdowns for ALICE was understood and fixed. For this reason we cancelled the EOSCMS upgrade foreseen for today; will do it next week if possible.
- CASTOR is down for a scheduled upgrade and the CASTOR client was upgraded yesterday. Issues with it were not expected because it had been tested on LXPLUS. If necessary will downgrade it on the relevant ATLAS VOBOXes.
- Dashboards: ntr
- Databases: ntr
- GGUS:
AOB:
Friday
Attendance: local(AndreaS, AlexandreB, Alessandro, Jerome, Stefan, Maarten, Massimo);remote(Stefano/CMS, Salvatore/CNAF, Lisa/FNAL, Jeremy, Rolf/IN2P3-CC, Xavier/KIT, Oscar/NDGF, Ronald/NL-T1, Dimitrios/RAL, Kyle/OSG).
Experiments round table:
- ATLAS reports -
- T0/T1
- CERN-PROD ALARM GGUS:87285
: CASTOR disk server HW problem. Still to be understood if the DS will come back online today or not. [Massimo: the vendor is looking into it, now the machine is up but cannot yet connect to it; if the problem is a broken disk, files might have been lost. Alessandro: it would not be a disaster as lost data can be regenerated]
- ATLAS DDM functional tests were not working since Monday. Problem now is understood and fixed. The reason was that to create the new datasets ATLAS send LSF jobs, and those were landing on SLC6 WNs. The reason has been understood and fixed (INC:177415
)
- CMS reports -
- note CMS Computing Run Coordinator change: Stefano takes over from Nicolo' since noon today
- LHC / CMS
- Machine development, then physics running at full luminosity from Saturday morning.
- CERN / central services and T0
- Tier-1:
- FNAL: temporary failure in SAM tests - squid test unable to load local config, now OK SAV:132922
- Tier-2:
- ALICE reports
-
- CERN: EOS-ALICE conditions data read load was reduced by temporarily re-enabling the old SE from which the data were copied to EOS; improvements on the EOS side should already allow switching back - probably Monday morning
- LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 2 attached T2s
- MC productions if resources available
- New GGUS (or RT) tickets
- T0:
- CERN : cleaning TMPDIR on lxbatch (GGUS:86039
) [Stefan: the amount of failures caused by this problem was decreased to an almost negligible level]
- T1:
- GRIDKA: Staging efficiency not high enough for current reprocessing activities (GGUS:80794
), data access problems for jobs reading from tape cache (GGUS:87318
) [Xavier: we have not made progress on this and we cannot expect to for the rest of the year, but at least we can solve critical issues very fast. An overview is in GGUS:87061
. Alessandro: in view of the reprocessing campaign due to start in two week ATLAS would need to know what would be the maximum sustainable data rate from tape and possibly prestage well in advance]
- IN2P3: "buffer" disk space not migrated fast enough to tape storage (GGUS:87293
), risked to run full on disk storage, mitigated by moving user space into disk space and increasing of transfer rate, FTS transfers failing because of "expired proxy" (GGUS:87321
)
- CNAF: "buffer" disk space increasing, transfer rate to tape storage increased
Sites / Services round table:
- CNAF: the SIR for the LHCb storage outage will be ready next week
- FNAL: ntr
- IN2P3-CC: ntr
- KIT: ntr
- NDGF: ntr
- NL-T1: the SARA tape system is in downtime, so some files on tape are not available.
- RAL: ntr
- OSG: ntr
- CERN batch and grid services: ntr
- CERN storage services: ntr
- Dashboards: ntr
AOB:
--
JamieShiers - 18-Sep-2012