Week of 121008

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(AndreaS, Luc/ATLAS, Stefan/LHCb, Jarka, Massimo, Maarten, Jerome, Ian/CMS, MariaD);remote(Onno/NL-T1, Michael/BNL, Gonzalo/PIC, Tiju/RAL, Lisa/FNAL, Dimitri/KIT, Rob/OSG, Zeeshan/NDGF).

Experiments round table:

  • ATLAS reports -
    • T0/WLCG
      • ALARM GGUS:86788 & INC:175534 LSF bsub time too big Solved
      • ALARM GGUS:86883 & INC:175883 Pending jobs in dedicated cluster. Solved
      • CERN GGUS:86778 & INC:175450 Get error. Pb with CERN-PROD_TMPDISK token used to store ESD before migrating to tape. To be included in setprodpath
      • FTS bug: All sites should deploy the fix (Today PIC, RAL, Triumf) see GGUS:81844. Mail sent to all clouds.
    • T1
      • NDGF-T1 GGUS:86770 Functional test transfers to DATADISK failing. FTS overwrite option issue?
      • SARA GGUS:86889 Transfers failing from SARA to CA with available CRL expired. FTS bug suspected, Triumf has installed the patch. Ongoing
      • sr #132764 Transfer failures because "Robot: ATLAS Data Management" proxy expired at Lyon & Triumf. Linked to FTS bug see bug #98002
    • Maarten: the FTS developers are aware of the bug and are looking further into it; also the VOMS developers are involved. Hopefully, in the near future, an rpm will be released solving all these problems.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Filled Tier-0 job slots with the combination of production and test Tier-0 systems. Load will be high for the next few days
    • Tier-1/2:
      • NTR
    • The downtime calendar used by CMS shows a downtime for CERN until October 25. During the meeting it turns out that it's due to a long downtime declared for a rolling upgrade of the CREAM CEs. According to Maarten, this type of intervention should be totally transparent and a downtime is not necessary.

  • ALICE reports -
    • CERN: EOS was not working from Sat evening around midnight to Sun morning around 08:00, when the headnode was rebooted

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 2 attached T2s
    • MC productions if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:

Sites / Services round table:

  • BNL: ntr
  • FNAL: ntr
  • KIT: downtime foreseen for October 22, for maintenance work on a network component. It should last from 0500 to 0730.
  • NDGF: FTS transfer issue for ATLAS (see ATLAS ticket above)
  • NL-T1: ntr
  • PIC: last weekend the CMS transfers were heavily affected by the FTS bug. Fixed today.
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage :CASTORCMS being upgraded; LHCb EOS was upgraded earlier today; EOS for ALICE is down. Another ALICE problem this morning: a node could not be turned on.
  • Dashboards: ntr

AOB:

Tuesday

Attendance: local (AndreaV, Luc, Jerome, Massimo, Nicolo, Jarka, MariaD, Eva); remote (Saverio/CNAF, Xavier/KIT, Michael/BNL, Lisa/FNAL, Ronald/NLT1, Jhen-Wei/ASGC, Rolf/IN2P3, Gareth/RAL, Rob/OSG, Zeeshan/NDGF; Joel/LHCb).

Experiments round table:

  • ATLAS reports -
    • T0/WLCG
      • problem with releasing poolfile catalog to /afs. Solved by Arne. [Luc: what should we do in case of such problems outside working hours, send an ALARM? Massimo: yes you can send an ALARM, then the operators will follow this up. Note that there is no formal piquet for AFS (like for CASTOR), but the operators have a list of phone numbers of the experts and they will call them one after the other, also during the night]
      • CERN GGUS:86778 & INC:175450 Get error. Pb with CERN-PROD_TMPDISK token used to store ESD before migrating to tape. To be included in setprodpath [Massimo: is the action on ATLAS for this GGUS ticket? Luc: yes the action is now on ATLAS]
    • T1
      • SARA GGUS:86889 Transfers failing from SARA to CA with available CRL expired. Fixed (FTS & FTA reconfigured with YAIM at Triumf.)

  • CMS reports -
    • LHC / CMS
      • Machine development
    • CERN / central services and T0
      • Filled Tier-0 job slots with the combination of production and test Tier-0 systems. Load will be high for the next few days
    • Tier-1:
      • FNAL: 2 files lost, retransferring: SAV:132819
      • RAL: FTS delegation issue, tomcat restarted SAV:132812 and GGUS:86775 [Maarten: FTS developers recommend that all sites should deploy the latest FTS patch. If the patch is deployed, these issues can be simply solved by a restart; is the patch is not installed, a non trivial cleanup is needed in these cases]
    • Tier-2:
      • T2_US_Wisconsin: MC production failing, looks like a black hole node: SAV:132837
      • T2_EE_Estonia: Power cut caused data loss (and correspondingly MC merge job failure): SAV:132839 and SAV:132846

  • ALICE reports -
    • CERN: since ~11:00 CEST all accesses to files on the ALICE_DISK CASTOR pool should be going via EOS, which will redirect a request to CASTOR if it does not have the file; so far this seems to be working OK

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 2 attached T2s
    • MC productions if resources available
    • T0:
    • T1:
      • GRIDKA: Input data resolution problem (GGUS:86720) and staging problems (GGUS:80794)
      • [Joel: many thanks to RAL for constantly keeping us informed using twitter! It would be nice if all T1's did the same. Gareth: thanks for the appreciation]

Sites / Services round table:

  • Saverio/CNAF: ntr
  • Xavier/KIT: ntr
  • Michael/BNL: ntr
  • Lisa/FNAL: ntr
  • Ronald/NLT1: ntr
  • Jhen-Wei/ASGC: ntr
  • Rolf/IN2P3: ntr
  • Gareth/RAL: had planned a Castor upgrade for LHCb for today, but we postponed it because yesterday we noticed a small issue in ATLAS after the upgrade two weeks ago, that we still do not completely understand - apologies to LHCb
  • Rob/OSG,: ntr
  • Zeeshan/NDGF: ntr

  • Massimo/Storage:
    • did two CASTOR interventions, one with downtime for ALICE and a transparent one for LHCb; will do one for CMS with 3 hour downtime on Thursday [Nicolo: is EOS intervention for CMS on Thursday confirmed? Massimo: yes]
    • [AndreaV: was the AFS problem yesterday only affecting ATLAS? We noticed some AFS problems also for the LCG nightlies software. Massimo: cannot say yet, these issues are still being followed up]
  • Eva/Databases: ongoing transparent intervention on the storage of the RAC holding ATLAS, LHCb offline and PDBR
  • Jerome/Grid: ntr
  • Jarka/Dashboard: ntr
  • MariaD/GGUS: ntr

AOB: none

Wednesday

Attendance: local(AndreaS, Stefan, MariaD, LucaC, LucaM, Jerome, Alessandro, Jarka, Nicolò);remote(Salvatore/CNAF, Lisa/FNAL, Ron/NL-T1, Rob/OSG, Zeeshan/NDGF, Rolf/IN2P3-CC, Jhen-Wei/ASGC, Pavel/KIT).

Experiments round table:

  • ATLAS reports -
    • T0//T1
      • EOS upgrade taking longer than expected

Ale: discussing the possibility to partially roll back the upgrade, as ATLAS has some tight deadlines affected by the EOS downtime.
LucaM: tomorrow also a CASTOR upgrade is foreseen: is that a problem?
Ale: no, that impacts different workflows, not as critical as for EOS.

  • CMS reports -
    • LHC / CMS
      • Machine development
    • CERN / central services and T0
      • Filled Tier-0 job slots with the combination of production and test Tier-0 systems. Load will be high for the next few days, enabling spillover to public queues.
    • Tier-1:
      • Replicating data from RAL to PIC, KIT and FNAL to run reprocessing
      • PIC: one potentially corrupt input file causing job failures, SAV:132864
    • Tier-2:
      • T2_TR_METU: MC jobs and SAM tests were failing for issues on the CE, fixed: SAV:132837 and SAV:132785
      • T2_FR_IPHC: MC jobs failing, probably for file access problem SAV:132869

  • ALICE reports -
    • CERN: EOS-ALICE instabilities due to very high numbers of concurrent clients and requests; we thank the EOS team for their ongoing debugging and tuning efforts!

LucaM: we lowered the number of opens not to overload the system.

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 2 attached T2s
    • MC productions if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GRIDKA: Input data resolution problem (GGUS:86720) has gone down, staging errors (GGUS:87061) have gone down, staging efficiency not high enough (GGUS:80794)

Sites / Services round table:

  • ASGC: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3-CC: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: ntr
  • Dashboard: ntr
  • Databases: ntr

  • GGUS:
    • Is ASGC related to GStat support via GGUS? We have email address gstat-support@listsNOSPAMPLEASE.grid.sinica.edu.tw for this Support Unit but tickets are not answered for months / years. If yes, we 'd be grateful for amy assistance in getting the line more responsive. Tickets concerned: GGUS:64388, GGUS:84420, GGUS:82608, GGUS:81461.
      [Jhen-Wei will check with his colleagues why these tickets were not answered.]
    • Requires comments from the Tier0 service managers: Now that GGUS and SNOW Requests are fully interfaced, we 'd like to include them in the monthly test ALARMs on release date. We suggest to replace the ALICE test ALARM (now being an Incident) by a Request. How would you like the email notification in this case, see how we do the TEAM-to-ALARM upgrade test in Savannah:132626#comment3. [Ale suggests to use TEAM tickets for the test]

AOB:

Thursday

Attendance: local(AndreaS, Alexandre, Massimo, Alessandro, Jerome, Nicolò, MariaD, Eva, Stefan);remote(Rolf/IN2P3-CC, WooJin/KIT, Ronald/NL-T1, Kyle/OSG, Lisa/FNAL, JhenWei/ASGC, Gonzalo/PIC, Gareth/RAL, Zeeshan/NDGF, Salvatore/CNAF).

Experiments round table:

  • ATLAS reports -
    • T0//T1
      • EOS upgrade was not smooth as expected. Users report instabilities, though jobs are not affected due to retry policies. Thanks to IT-DSS for their support, still we need to investigate the present situation.
      • CASTOR libraries upgrade (INC:177202). We noticed that the VOBOXes used for building the ATLAS software some new packages were pushed by Quattor, breaking the software compilation. There is a plan to fix it with PH-SFT and IT-DSS but will need two weeks to apply the fix on the ATLAS side.

  • CMS reports -
    • LHC / CMS
      • Machine development
    • CERN / central services and T0
      • Backlog on T0 decreasing.
    • Tier-1:
      • PIC: one potentially corrupt input file causing job failures, SAV:132864, still no update in the ticket. [Gonzalo: the file CRC checksum if OK and the file is accessible and has several replicas; still need to test opening it. Will update the ticket]
      • CNAF: investigating several corrupt unmerged files, causing merge job failures, SAV:132720
      • ASGC: several files stuck in tape migration for up to two weeks, GGUS:87262
    • Tier-2:
      • T2_FR_IPHC: MC jobs failing, probably for file access problem - site admin reported that storage should be OK now, SAV:132869

  • ALICE reports -
    • CERN: we thank IT-DSS for their ongoing efforts to improve the performance of EOS-ALICE, which now also receives a lot of read traffic for conditions data.

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 2 attached T2s
    • MC productions if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GRIDKA: Staging efficiency not high enough for current reprocessing activities (GGUS:80794)
      • CNAF: slowness of the file system causing FTS failures and jobs not being able to download their input data, fixed now

Sites / Services round table:

  • ASGC: the tickets about Gstat (see yesterday's report) are being updated now. For the future, will make sure they get automatically redirected to our internal ticketing system.
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: The LHCOPN link to IN2P3 went down yesterday morning and it was fixed 30' ago.
  • NDGF: ntr
  • NL-T1: ntr
  • PIC: On Tuesday evening, during a consistency check, we accidentally deleted 200,000 ATLAS files (SAV:98070). ATLAS was informed and is taking action.
  • RAL: ntr
  • OSG: working on a potential attachment synchronisation problem with GGUS
  • CERN batch and grid services: ntr
  • CERN storage services
    • This week we upgraded EOSALICE and EOSATLAS, and the latter was unusable until 0430; this morning was down for 30' due to a restart; still investigating. An issue causing slowdowns for ALICE was understood and fixed. For this reason we cancelled the EOSCMS upgrade foreseen for today; will do it next week if possible.
    • CASTOR is down for a scheduled upgrade and the CASTOR client was upgraded yesterday. Issues with it were not expected because it had been tested on LXPLUS. If necessary will downgrade it on the relevant ATLAS VOBOXes.
  • Dashboards: ntr
  • Databases: ntr
  • GGUS:

AOB:

Friday

Attendance: local(AndreaS, AlexandreB, Alessandro, Jerome, Stefan, Maarten, Massimo);remote(Stefano/CMS, Salvatore/CNAF, Lisa/FNAL, Jeremy, Rolf/IN2P3-CC, Xavier/KIT, Oscar/NDGF, Ronald/NL-T1, Dimitrios/RAL, Kyle/OSG).

Experiments round table:

  • ATLAS reports -
    • T0/T1
      • CERN-PROD ALARM GGUS:87285 : CASTOR disk server HW problem. Still to be understood if the DS will come back online today or not. [Massimo: the vendor is looking into it, now the machine is up but cannot yet connect to it; if the problem is a broken disk, files might have been lost. Alessandro: it would not be a disaster as lost data can be regenerated]
      • ATLAS DDM functional tests were not working since Monday. Problem now is understood and fixed. The reason was that to create the new datasets ATLAS send LSF jobs, and those were landing on SLC6 WNs. The reason has been understood and fixed (INC:177415)

  • CMS reports -
    • note CMS Computing Run Coordinator change: Stefano takes over from Nicolo' since noon today
    • LHC / CMS
      • Machine development, then physics running at full luminosity from Saturday morning.
    • CERN / central services and T0
      • Backlog on T0 cleared.
    • Tier-1:
      • FNAL: temporary failure in SAM tests - squid test unable to load local config, now OK SAV:132922
    • Tier-2:
      • NTR

  • ALICE reports -
    • CERN: EOS-ALICE conditions data read load was reduced by temporarily re-enabling the old SE from which the data were copied to EOS; improvements on the EOS side should already allow switching back - probably Monday morning

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 2 attached T2s
    • MC productions if resources available
    • New GGUS (or RT) tickets
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 ) [Stefan: the amount of failures caused by this problem was decreased to an almost negligible level]
    • T1:
      • GRIDKA: Staging efficiency not high enough for current reprocessing activities (GGUS:80794), data access problems for jobs reading from tape cache (GGUS:87318) [Xavier: we have not made progress on this and we cannot expect to for the rest of the year, but at least we can solve critical issues very fast. An overview is in GGUS:87061. Alessandro: in view of the reprocessing campaign due to start in two week ATLAS would need to know what would be the maximum sustainable data rate from tape and possibly prestage well in advance]
      • IN2P3: "buffer" disk space not migrated fast enough to tape storage (GGUS:87293), risked to run full on disk storage, mitigated by moving user space into disk space and increasing of transfer rate, FTS transfers failing because of "expired proxy" (GGUS:87321)
      • CNAF: "buffer" disk space increasing, transfer rate to tape storage increased

Sites / Services round table:

  • CNAF: the SIR for the LHCb storage outage will be ready next week
  • FNAL: ntr
  • IN2P3-CC: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: the SARA tape system is in downtime, so some files on tape are not available.
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Dashboards: ntr

AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2012-10-12 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback