Week of 150330

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Alessandro (ATLAS), Belinda (storage), Iain (grid services), Lisa (FNAL), Maarten (SCOD + ALICE)
  • remote: Christoph (CMS), Di (TRIUMF), Felix (ASGC), Kyle (OSG), Matteo (CNAF), Onno (NLT1), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Ulf (NDGF)

Experiments round table:

  • CMS -
    • EOS space where CMS Tier-0 applications write ran over quota - GGUS:112726
      • Clearly a mistake by CMS
      • EOS team kindly increased quota for CMS
      • Needs cleanup campaign by CMS
      • SRM SAM failures likely related (GGUS:112719)

  • ALICE -
    • CASTOR at CERN: offline activities hampered by disk server issues
      • one went offline unnoticed for a while (rebooted OK)
      • others suffered network issues (ongoing)
    • Team ticket GGUS:112716 for VOMS service managers: Sao Paulo admin certificate being rejected as expired, while the proxy worked OK in other usage (being investigated).

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF:
    • We'll have a reboot of the headnodes tomorrow (Tuesday) morning. Mostly security patches requiring reboots, but also minor dCache work.
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services:
    • ALICE VOMS issue being looked into
    • various user tickets received about CASTOR operations failing in the batch farm; will assign them to the CASTOR team instead
  • CERN storage services:
    • CASTOR-CMS upgrade foreseen for Tue Apr 7, 09:00-12:00 (TBC)
    • CASTOR-ATLAS upgrade foreseen for Wed Apr 8, 09:00-12:00 (TBC)
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Jerome (grid services), Maarten (SCOD + ALICE)
  • remote: Christoph (CMS), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), Jeremy (GridPP), John (RAL), Michael (BNL), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Thomas (KIT), Ulf (NDGF)

Experiments round table:

  • CMS -
    • Tier-1 sites rather busy with DIGI-RECO of Upgrade MonteCarlo
    • Again some problems with Squid/Frontier infrastructure
      • Still not fully understood

  • ALICE -
    • CASTOR at CERN:
      • reads and writes failed as of 04:00 CEST today
      • due to various other activities on the server side, the disk servers did not have enough transfer slots left
      • the number of slots has been increased and the problem was cured, thanks!
    • team ticket GGUS:112716 for the VOMS service managers, mentioned on Monday:
      • Sao Paulo admin certificate was being rejected as expired, while the proxy worked OK in other usage
      • the cause turned out to be an expired CRL of the Brazilian CA!
      • that in turn was caused by the CRL URL being firewalled for IPv6
        • this is at least the 4th such incident with a CA...
      • GGUS:112774 opened for the VOMS devs to improve the error message...

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
  • FNAL:
  • GridPP: ntr
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL:
    • CVMFS 2.1.20 (next version) being run OK in production since 3 weeks
    • CASTOR upgrade to 2.1.14-15 next Wed
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services:
  • Databases:
  • GGUS:
    • Last test alarm of the release (INFN-T1) solved one week after its creation.
  • Grid Monitoring:
  • MW Officer:

AOB:

  • ATTENTION: next meeting on Tuesday April 7 !
  • Have a good Easter break !

-- AndreaSciaba - 2015-02-27

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2015-04-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback