Week of 150316

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria D. (SCOD), Maarten (ALICE), Massimo Lamanna (CERN Data mgnt), Ale di Gi (ATLAS), Mark Slater (LHCb), Ulrich S. (CERN Grid Services).
  • remote: Dimitri (KIT), Hung-Te Lee (ASGC), Rolf Rumler (IN2P3), Michael Ernst (BNL), Onno Zweers (NL_T1), Tiju (RAL), Ulf (NDGF), Sang-Un (KISTI), Pepe Flix (PIC and CMS), Matteo (CNAF), Kyle (OSG), Di Q. (Triumf).

Experiments round table:

ATLAS, ALICE and CERN Data Mgnt suggested a SIR from the CERN network team from the last Thu-Fri trouble as the CERN Service Status Board was giving some info but people weren't feeling the issues are clear until a solution was given past 9am on Fri. Maria D. as SCOD will follow-up with the network team.

  • ALICE -
    • CERN:
      • CASTOR team restored previous behavior of staging through Xrootd - thanks!
      • 2 team tickets opened on Fri because of issues due to network problem
      • 1 team ticket GGUS:112354 opened on Sun because of expired PK-GRID-CA CRL

  • LHCb reports (raw view) -
    • Distributed computing dominated by Monte Carlo and user activities.
      • T0: VOMS tickets : 112279, 112281
      • T1:
        • Issues with large number of SRM requests to PIC and IN2P3. We're investigating what the cause is.
Discussion at the meeting with contribution from NL_T1, PIC and IN2P3 led to the conclusion that the dCache update might be linked to the appearance of the many SRM requests being queued. Mark/Pepe will open a GGUS ticket to the dCache dev. team and sites will add their experience in the ticket diary.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: not connected
  • GridPP: not connected
  • IN2P3: nta. Commented on the LHCb SRM issue above.
  • JINR: not connected
  • KISTI: ntr
  • KIT: ntr
  • NDGF: will do dcache updates (probably 2.12.2) on Wednesday morning 7:30 - 8:30 UTC (from Ulf by email).
  • NL-T1: nta. Commented on the LHCb SRM issue above. They have a single instance so they can't see which is the affected VO but observe getTURL problems since the dCache upgrade to v.2.10.20.
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: working on the SRM problem presented by LHCb
  • RAL: ntr
  • RRC-KI: not connected
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Databases: not present
  • GGUS: not present
  • Grid Monitoring: not present
  • MW Officer: reports on Thursdays

AOB:

Thursday

Attendance:

  • local: Maria D. (SCOD), Maarten (ALICE), Ale di Gi & David Cameron (ATLAS), Mark Slater (LHCb), Ulrich S. (CERN Grid Services), Andrea Manzi (MW Officer), Pablo Saiz (GGUS & Grid Monitoring).
  • remote: Thomas Hartmann (KIT), Rolf Rumler (IN2P3), Dennis van Dok (NL_T1), Gareth Smith (RAL), Ulf (NDGF), Pepe Flix (PIC and CMS), Matteo (CNAF), Kyle (OSG), Di Q. (Triumf).

Experiments round table:

  • ATLAS reports (raw view) -
    • ADCR database (used by many ADC services) down for a few hours on Wed morning

  • CMS reports (raw view) -
    • Since the CMS Computing and Offline week is ongoing, nobody will be available to join most likely
      • Please address follow-up questions to Christoph
    • Rather successful test of distributed PromptRECO using ~50% of Tier-1 CPU resources in multi-core mode
    • Had some issues with EOS, all under investigation by experts:

  • ALICE -
    • CASTOR at CERN: instabilities in the last few days, affecting both online and offline transfers
      • root cause was Oracle deciding to change the execution plan of some standard (and already optimal) queries
      • on top of that there were a few badly behaved nodes plus some problems in the monitoring
      • thanks for the prompt support in these matters!

  • LHCb reports (raw view) -
    • T0: VOMS tickets : 112279, 112281
    • T1:
      • Recent issue with overloading dCache SRMs with gfal_getturlfromsurl requests: Found to be due to a bug fix in the srm_ifce library which is used by gfal 1. Temporary fix of rolling back to the previous version is in place. Will soon be constructing TURLs using string manipulation instead. Many, many thanks for the help in debugging this from both the sites and the dCache developers. Ticket is GGUS:112413. Sites (IN2P3 and PIC mostly) which have put a special set-up to debug this issue may restore the original parameters, if they wish.

Sites / Services round table:

  • ASGC: not connected
  • BNL: not connected
  • CNAF: ntr
  • FNAL: not connected
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: not connected
  • KIT: Two days ago a network problem was diagnosed (30% packet loss). Now fixed.
  • NDGF: Technical info from Ulf, entered in the vidyo chat window: LHCb is doing the wrong library call to check if a file exists. They now check out a TURL, which pins the file and causes work at the SE. They could just do a stat()-equivalent too NB!! dCache 2.10.latest, 2.11.latest and 2.12.2 are buggy, the pool might crash on startup if it got shut down while a file was written! NDGF experienced this yesterday while doing an update to dCache 2.12.2. They had to stop, patch and proceed with the update.
  • NL-T1: ntr
  • OSG: ntr
  • PIC: Pepe is grateful to Marc (LHCb) for the help with the SRM problem debugging. Nothing else to report.
  • RAL: Investigating a problem with a pair of network routers. The primary failed at 8am UTC today and the other didn't take over automatically, as it should. If they need to cause a stoppage for further testing they will publish a warning in GOCDB.
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services: not present
  • Databases: not present
  • GGUS:
    • Network maintenance on central KIT infrastructure might affect GGUS from 23-Mar-15 06:00:00 to 23-Mar-15 09:00:00
    • Next GGUS release from 25-Mar-15 07:00:00 to 25-Mar-15 09:00:00 UTC.
  • Grid Monitoring: ntr
  • MW Officer: StoRM rel. 1.11.8 is available and includes a fix for a critical issue. Being deployed at CNAF at first. More at the WLCG Ops Coord meeting in a few minutes.

AOB: The CERN network team provided this page describing the 12-13 March incident. This article was considered insufficient in detailing the impact on services and the time it took to recover. Further discussion at the WLCG Ops Coord meeting a few minutes later concluded on the T0 manager (Maite) to discuss on a clear SIR format with the T0 network team.

-- AndreaSciaba - 2015-02-27

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Mar-15.pptx r1 manage 2868.8 K 2015-03-17 - 09:38 PabloSaiz  
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2015-03-19 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback