Week of 150608

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Alandes (Chair), Daniele Bonacorsi (CMS), Nacho Barrientos (Grid and Batch), Andrei Dumitru (DB), Alessandro di Girolamo (ATLAS), Maarten Litmaath (ALICE), Stefan Roiser (LHCb)
  • remote: Lisa Giacchetti (FNAL), Michael Ernst (BNL), Jose Flix (pic), Kyle Gross (OSG), Dmytro Karpenko (NDGF), Felix Lee (ASGC), Di Qing (TRIUMF), Rolf Rumler (IN2P3), Gareth Smith (RAL), Sang Un Ahn (KISTI), Sonia (CNAF), Onno Zweers (NL T1)

Experiments round table:

Sonia summarises the details reported at the GGUS ticket. Alessandro explains that a fix is needed ASAP as this could become critical if it's not fixed in the next one or two days. Alessandro fears that this could be related to the GFAL 2 upgrade in FTS 3 carried out last week. It seems it only affects FTS 3 and StoRM. It's very likely that the problem is on the StoRM side but the cause it's not clear yet. Sonia will pass to the StoRM experts the urgency of this issue.

  • CMS reports ( raw view) -
    • Services
      • AFS: Jun-07, Vincenzo (Innocente) reported slowness of scram/cmsRun, mostly cmsRun waits for “afs_cv_wa”. Jan confirmed the release being used is on the overloaded "q.cms.ibv1_s6a491"
      • we are considering to add the AFS console per-volume monitoring in our (CMS) shift instructions
      • CVMFS: Jun-04, SLS available dropped to ~60%. 2 Stratum_1 services affected (ASGC and RAL)
        • restored within hrs, no actions taken, no GGUS either
    • question: P5->T0(Wigner) packet loss discussed at the WLCG 'daily’ call on Jun-04 with CERN-IT/DSS. Any recent news?
    • T1:
      • since May-04: T1_FR_CCIN2P3 debugging stuck outgoing transfers, GFAL-related, testing a solution..
        • followed up in several GGUS, parent is GGUS:113510, symptoms also others e.g. GGUS:113937. Sebastien and CCIN2P3 closely following
        • last update by Sebastien on Jun-05: wrote a lcg-utils based script that does the work, put it into prod restarting the PhEDEx staging agent. Need to monitor if no other actions are needed (e.g. buffer->disk copy is automatic or needs actions?), but good progress.
      • since Jun-05: 2 files stuck in a transfer request to DESY, being served from FNAL
        • debugged: 1 file invalidated, and 1 file (slow transfer from FNAL) rerouted from Rome (GGUS:114125, in progress)
      • since Jun-05: # running jobs at FNAL decreases below 500 with thousands of pending jobs
        • debugged: 2 issues identified (glide in factory-level), "a corrupt record in the Accountant and an invalid priority factor caused negotiation issues”. Both resolved by now, last GGUS-documented check showed 1300+ 8-core pilots starting (GGUS:114124, being watched still, but to be closed soon)

  • ALICE -
    • high activity
    • CERN: CASTOR file access instabilities were solved Sun morning - thanks!
      • the number of activity slots per disk server was doubled.

  • LHCb reports ( raw view) -
    • first offline data workflow processing over the week-end
    • T0
      • offline processing so far at CERN only
      • identified merging job running for 39 1/2 hours (GGUS:114152), asked LSF to check

Nacho reports that apparently the relevant batch node has nothing wrong but he has passed the ticket to the batch experts who are now looking at it.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: There will be a downtime on Tuesday 16th. The batch system will start draining from the previous night. dCache will be unavailable from 9am to 2pm. All services should be back to operation in the evening.
  • JINR: NA
  • KISTI: NTR
  • KIT: NA
  • NDGF: There was a power outage during the weekend in Copenhagen. Storage is partially down and Computing is all unavailable. It is expected that all services are back to operation by tomorrow.
  • NL-T1: dCache space manager unstable since upgrade 2.10. Sometimes there are transfer errors and the space manager needs to be restarted. This has happened a few times. Maria asks whether dCache developers have been contacted and Onno replies that not yet. They are doing internal investigations for the time being.
  • NRC-KI: NA
  • OSG: NTR
  • PIC: NTR
  • RAL: An outage is scheduled for 1h on Wednesday 7th of June. It will happen at 13:00 UTC.
  • TRIUMF: NTR

  • CERN batch and grid services:
  • CERN storage services: NA
  • Databases: NTR
  • GGUS: NTR
  • Grid Monitoring: NTR
  • MW Officer: NTR

AOB:

Thursday

Attendance:

  • local: Maria Alandes (Chair), Christoph Paus (CMS), Alberto Rodriguez (Grid and Batch), Alessandro di Girolamo (ATLAS), Ilija Vukotic (ATLAS), Maarten Litmaath (ALICE), Stefan Roiser (LHCb), Andrea Manzi (MW Officer)
  • remote: Lisa Giacchetti (FNAL), Michael Ernst (BNL), Jose Flix (pic), Rob Quick (OSG), Dmytro Karpenko (NDGF), Felix Lee (ASGC), Di Qing (TRIUMF), Rolf Rumler (IN2P3), Gareth Smith (RAL), Lucia (CNAF), Dennis van Dok (NL T1), Thomas Hartmann (KIT), Christoph Wissing (CMS)

Experiments round table:

  • ATLAS reports ( raw view) -
    • FTS upgrade: would be really useful if the FTS3 servers could be upgraded today. The reason is that otherwise we can have jobs which will be stuck, since we use the FTS server "at a destination" and the new workflow with FTS3 is that there are staging plus transfers... so not easy to isolate the sources.
    • CERN-PROD: high failure rate on ATLAS dedicated cluster. Now almost understood thanks to the quick interactions (and fixes) with CERN-IT PES

It is clarified during the meeting that neither CMS nor LHCb have suffered from the latest issue in FTS because they didn't use it for staging at CNAF yet.

  • CMS
    • Intermittent EOS write issues from CMS Tier-0 GGUS:113389
    • Any updates on the network link P5-Wigner?
      • Addendum after the meeting: There is a (stuck) ticket: RQF0464111, thanks to Daniele for finding this back for me
    • Another Addendum: CERN storage experts reported about One CMS file was found corrupted on CASTOR at CERN, could be replaced from a still existing disk replica

Maria mentions wrt the P5-Wigner issue, that she pinged last Monday the experts at CERN who were very busy and will have a look at the problem as soon as possible. Maria will check again with them. Christoph agrees to include more details and a ticket if any after the meeting. Other experiments do not seem to be affected by this.

  • ALICE -
    • CERN: half of myproxy.cern.ch was timing out during the night (team ticket GGUS:114251)

  • LHCb reports ( raw view) -
    • offline processing progressing well,
    • T0
      • SRM_FILE_BUSY issue re-surfaced for pit export (GGUS:114188)
      • identified merging job running for 39 1/2 hours (GGUS:114152), asked LSF to check, re-submission of the jobs lasted 2 hours
    • T1
      • access to SARA storage continuing from different other sites (and SARA/NIKHEF), checking with one other site, they have not deployed the latest fetch-crl version which fixes the CRL outdated version (html header issue)
      • move to T1 and T2 sites for first offline processings, first runs distributed at CNAF, SARA & RAL, workflows all running fine

Maarten suggests to send an EGI broadcast after the meeting to remind sites to update the fetch-crl package to the latest version. ATLAS explains they don't see this issue because they use FTS3 for file transfers where the fetch-crl package is up to date. LHCb uses direct access through GFAL from the WN to the storage. ALICE is not affected by this and CMS doesn't use SARA. This may explain why other experiments are not suffering from this.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR. Michael confirms FTS will be upgraded in the next hours
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NA
  • KIT: A short network outage happened on Tuesday night. Everything was restored in 2h.
  • NDGF: NTR
  • NL-T1: Dennis explains all WNs at NIKHEF have fetch-crl up to date. If there are issues coming from them, a GGUS ticket should be opened because normally this shouldn't happen.
  • NRC-KI: NA
  • OSG: Accounting records will be resubmitted for the month of May since there are some wrong numbers.
  • PIC: NTR
  • RAL: Reminder: An outage is scheduled for 1h on Wednesday 7th of June. It will happen at 13:00 UTC.
  • TRIUMF: NTR

  • CERN batch and grid services:
    • A faulty router caused network issues during last night (OTG0022230), affecting the whole infrastructure and a possibly some grid services like MyProxy.
    • CERN FTS gfal upgrade 2.9.2 -> 2.9.3 10:00 CEST on Monday 15th OTG0022253. Intervention should last 5 to 10 minutes and be fully transparent. This version us running on pilot since Tuesday(?) this week.
  • CERN storage services: NA
  • Databases: NA
  • GGUS: NA
  • Grid Monitoring: NR
    • ATLAS request for recomputation: GGUS:114067
    • ATLAS request for recomputation: GGUS:114182 Closed
    • LHCB and CMS request for recomputation: GGUS:114099
      • CMS request closed
  • MW Officer: NA

It is mentioned that requests for recomputation should be submitted per affected VO. One ticket for two VOs is wrong, LHCb didn't seem to see the one opened for LHCb and CMS.

AOB:

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2015-06-11 - MaartenLitmaathSecondary
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback