Week of 141020

 

WLCG Operations Call details

 

  • At CERN the meeting room is 513 R-068.
 
  • For remote participation we use the Vidyo system. Instructions can be found here.
 

General Information

 

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
 

Monday

Attendance:

  • local: Maria Dimou (chair & notes), Kate Dziedziniewicz-Woycik (CERN DB), Tsung-Hsun Wu (ASGC), Michail Salichos (LHCb), Maarten Litmaath (ALICE), Alessandro di Girolamo (ATLAS).
  • remote: Ulf Tigerstedt (NDGF), Onno Zweers (NL_T1), Dea-Han Kim (KISTI), Michael Ernst (BNL), Tiju Idicula (RAL), Lisa Giachetti (FNAL), Lucia (CNAF), Rob Quick (OSG), Dimitri (KIT), José Hernandez (CMS).

Experiments round table:

 

  • ATLAS reports (raw view) -
    • Daily Activity overview
      • FTS3 in RAL seems in trouble, but not clear yet the status of FTS in general: many links are empty, no FTS jobs... to be followed up --- understood by noon, on ATLAS Central Catalog side
      • Rucio metadata chat: DB update of schema can't be done this week, next week could be possible. Discussions on details of the various metadata.
    • CentralService/T0/T1s
      • GGUS:109418 after the VOMS server was fixed there was another problem with the VOMS attributes propagated. Shaojun (PandaServer) said he fixed something, we have to ask.
        • the ATLAS proxies should be valid for 96hours, i.e. the VOMS server issues should not be so critical.
        • in Rucio proxy is renewed each hour for 96hours . to be checked that alerts are sent out in case of not working.
 
  • CMS reports (raw view) -
    • Fairly quick recovery of CMS services after the power cut at CERN on Oct 16th. Everything working at ~1am. Just needed to restart the stuck CERN CMS xrootd redirector the next morning (GGUS:109392).
    • Global xrootd redirector intermittently unavailable since Oct 16th. Investigating (GGUS:109435)
    • The production DB for the VOMS service at CERN went down on Saturday (GGUS:109421). It was not possible to obtain VOMS proxies. It took ~12 hours to recover.
    • On Saturday overload of the the CMS Frontier launchpad servers due to a production workflow downloading a too large file (> 1GB), a tarball of files required for event generation. File too large to be cached and therefore downloading by every single job. Workflow aborted. Studying an alternative to handle this files.
  ATLAS asked for a new policy statement to be communicated from CERN IT DB for the current era when WLCG Critical services (e.g. VOMS, FTS) now depend on Databases on Demand. Databases being a critical service themselves, a 'best effort' support level is not sufficient for any of them. Even when a service restart fixes a problem, an expert investigation is necessary to conclude this is the needed action.  
  • ALICE -
    • central services: fast recovery after the power was restored Thu evening
    • CERN: VOMS servers were working again Sat evening before ALICE VOBOXes ran out of their proxies
 
  • LHCb reports (raw view) -
    • MC and User jobs. Prestaging from tapes for future processing.
    • T0: GGUS ticket was opened for lhcb voms servers (GGUS:109417)
    • T0: Ongoing testing using FTS3 for staging files from archives
    • T1: NTR
 

Sites / Services round table:

 

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: not connected
  • JINR: not connected
  • KISTI: ntr
  • KIT: ntr
  • NDGF:
    • 30' downtime tomorrow Tue 21/10 to upgrade the dCache headnodes. Database machines will be rebooted. All in GOCDB.
    • A network problem appeared this morning. Now it is fixed.
    • xrootd v.4.0.3 in EPEL7 contains a serious problem that leads to memory corruption and eventual daemon crash. Discovered at logrotate time. The script running at 3am systematically coincided (entailed?) the crash.
  • NL-T1: Next Thursday, warning downtime to fix a hardware issue with a dCache pool. Files will be unavailable for about 1 hour.
  • OSG: ntr
  • PIC: ntr (sent by email)
  • RAL:
    • An ARC CE issue is being investigated.
    • The FTS3 upgrade went well today.
  • RRC-KI: not connected
  • TRIUMF: not connected
 
  • CERN batch and grid services: ntr
  • CERN storage services: not present
  • Databases: Last Friday's powercut affected the integrations database only. The CMS online DB is being moved from Point 5 to the Computer Centre for puppetisation. They wll be unavailable for 30'.
  • GGUS: not present
  • Grid Monitoring: not present
  • MW Officer: only on Thursdays.
AOB:

 

Thursday

Attendance:

  • local: Maria Dimou (chair & notes), Michail Salichos (LHCb), Marcin Blaszczyk (CERN DB), Alessandro Fiorot (CERN Data Mgnt), Hervé Rousseau (CERN Data Mgnt), Felix Lee (ASGC), Andrea Manzi (MW Officer), Daniel Pek (CERN Grid Services).
  • remote: Alessandro di Girolamo (ATLAS), Ulf Tigerstedt (NDGF), Dea-Han Kim (KISTI), Lisa Giachetti (FNAL), Lucia (CNAF), Gareth (RAL), José Hernandez (CMS), Pavel Weber (KIT), Lucia (CNAF), Rolf Rumler (IN2P3), Dennis van Dok (NL_T1 by email).

Experiments round table:

 

  • ATLAS reports (raw view) -
    • Daily Activity overview
      • ProdSys2 : we want to have 10/15%. The problem is that physics validation is not done yet, we don't need anything too complex (it's not various releases, it's just prodsys1 vs prodsys2 sample A). but there is no way to have lightweight validation. validation people would like to do the validation on rel19, but we don't have no sample A. take a bit to produce it, to process it, etc etc. Plan is: we go on in producing MC with ProdSys2. we will do physics validation, with rel 19, but in 2/3 weeks. as soon as physics validation will be done, then, if we see problems (remote chance), we will redo the sample. Low priority will go in prodsys2. We are now waiting for Jose and Andreas Hoecker on a rule to define which samples should go in prodsys2. It will be MCORE. Bulk submission could be "the tricky". To be checked that the various part of chains can be done with different number of cores (merging is done internally, so we need to check ProdSys2).
      • ProdSys2 Rucio validation: we will inject a chain. we will do with evgen without inputs. Container are part of postproduction, we won't care till the end of the chain.
      • rucio slowness understood. HAproxy issue. Graphite monitoring to be improved to get the real rucio performances.
        • rucio auth: PandaServer can go directly to rucio without authentication
        • metrics to be created -- up to Rucio to define them. metrics
      • Metadata: number of events: already existing, but it should also be reported per dataset. lumiblock similar (but not needed). provenance and physigroups and transient also agreed.
    • CentralService/T0/T1s  

  • CMS reports (raw view) -
    • Today between 17h to 18h the ELOG will be migrated to puppet-managed machines at the CERN AI infrastructure.
    • CMS CVMFS installation server down since the power cut. Original ticket GGUS:109474 opened on Mon 20th escalated to ALARM ticket GGUS:109500 since new CMS releases needed to be published urgently. Prompt reaction to alarm ticket rebooting the machine in question.
 
  • ALICE -
    • CERN: CEs were publishing bad job numbers (GGUS:109538)
      • this time a bug was found that was responsible
    • most pilot jobs are observed to exit after running just a single task, while they normally have plenty of time left for other tasks; this is being debugged
 
  • LHCb reports (raw view) -
    • MC and User jobs. Prestaging from tapes for Stripping21 campaign has started.
    • Starting from November, staging load will be increase steadily.
    • T0: NTR
    • T1: Launched replication of more than 30K files, discovered a few hiccups mostly at CASTOR, requested support from RAL (Shaun DeWitt). It was discovered that CASTOR is using the minimum value out of bringonline timeout and pin lifetime. Since pin lifetime was initially set to 1K seconds, transfers were timing out, once increased started progressing again
 

Sites / Services round table:

 

  • ASGC: ntr
  • BNL: not connected
  • CNAF: ntr
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: we couldn't hear frown
  • KIT: ntr
  • NDGF: Despite of several network problems (network to CERN was down for 3 hours today and also to other sites last night), all services continued functioning.
  • NL-T1: ntr
  • OSG: not connected
  • PIC: not connected
  • RAL: Investigating ARC CE problems.
  • RRC-KI: not connected.
  • TRIUMF: not connected.
 
  • CERN batch and grid services:
  • CERN storage services: ntr
  • Databases: The move of the CMS online DB to the CERN CC for puppetisation was done on Tuesday successfully.
  • GGUS: not present.
  • Grid Monitoring: not present.
  • MW Officer: Will try to find an environment to test xrootd v.4.0.4 which came out yesterday with the fix for the logrotate problem, reported by NDGF on Monday.

AOB: Time change to winter this Sunday 26/10 (One more hour to sleep! smile )

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2014-10-23 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback