Week of 161212

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Alberto (monitoring), Andrea (MW Officer), David (databases), Gavin (computing), Hugo (storage), Julia (WLCG), Maarten (SCOD + ALICE), Maria A (WLCG), Maria D (WLCG), Marian (networks), Petr (ATLAS), Vincent (security), Xavi (storage)
  • remote: Andrew (NLT1), Antonio (CNAF), David (FNAL), Di (TRIUMF), Dmytro (NDGF), FaHui (ASGC), Federico (LHCb), John (RAL), Kyle (OSG), Pepe (PIC), Preslav (KIT), Rolf (IN2P3), Sang-Un (KISTI), Stephan (CMS), Victor (JINR), Xin (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Production
      • High level of running jobs ~250k (except during Sunday/Monday morning, because of issues with FTS)
      • updated fairshare to finish derivation tasks (data 2015 already exists, data 2016 and mc 2015 must be reprocess with latest fixes)
      • problematic overlay transformation tasks - job efficiency was low during wednesday
      • reprocessing beamspot data (reading data from tapes), prepared 600M evtget MC for xmas to keep available resources busy
      • PanDA pilot update (problems with wrong eventType in traces fixed, this had an impact on rucio metadata / behavior)
    • Services
      • migration of 120 virtual machines to new OpenStack / hardware (mostly live migration transparent to users, only two machines needs manual intervention)
      • Frontier services overloaded (overlay transformation tasks, created special tag and PanDA queue with limited number of running tasks, redirect more sites to CERN Frontiers)
      • OS upgrade to CC7 for all rucio services
        • one server was automatically pushed to load-balancer before it was fully configured - immediately fixed
        • missing certificate proxy on rucio servers caused deletion problem (again promptly fixed and did not cause any damage - just less free scratch space)
        • 5 sites that use DPM+WebDAV servers configured long time ago by YAIM had incompatible list of ciphers for HTTPS - fixed without causing troubles
        • expired certificate proxy on FTS
      • FTS - expired ATLAS proxy
        • our script that delegate refreshed certificate proxy did not work on CC7 and existing proxy expired at 23:00 on Saturday (valid for 96 hours)
        • traced to key stored in credential cache on FTS server that was signed by MD5 and CC7 no longer support this signature algorithm

  • CMS reports ( raw view) -
    • no major issues
    • Large Monte Carlo production for Moriond 2017 in progress.
    • Data of two tapes lost at JINR, about 4 TB (GGUS:125473, GGUS:125501). We expect more tapes going bad due to the heavy seek with interleaved datasets. Discussion with site/storage admins in progress to setup file families and tune Enstore/dcache at the site. Some datasets are regenerated to overcome the staging bottleneck/stay on schedule with the MC production.
      • Stephan: around 800 files were lost from the first tape, but they had replicas elsewhere; the second tape may be similarly affected
      • Maarten: if the matter gets a lot worse, a short Service Incident Report will be requested from JINR
    • Tier-0 finished processing of 2016 collision data
      • ~1.5 PB transfer backlog
        • Xavi: can you provide more details? Would you need more GridFTP doors?
        • Stephan: the backlog is in transfers from EOS to T1 sites; it is "just" work to be done, nothing bad
      • flocking of production jobs to Tier-0 pool enabled Friday
    • Some remaining issues, CouchDB /ASO, on user analysis side due to high load.
    • PhEDEx /FTS upgrade progressing at sites 5/7 Tier-1 and 15/56 Tier-2 still need to upgrade

  • ALICE -
    • CERN: VOMS service misbehaving since Sat late evening
      • Team ticket GGUS:125521 opened Sun evening
      • see computing services report below
    • Very high activity since yesterday, 100k+ jobs for many hours today

  • LHCb reports ( raw view) -
    • CERN: VOMS service misbehaving since Sat late evening
    • Productions running fine. More data than expected from Proton-Ion Ion-Proton collisions.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • EGI:
  • FNAL:
    • we will upgrade FTS and PhEDEx on Dec 14
  • GridPP:
  • IN2P3: On Sunday afternoon, IN2P3-CC had an incident with its NIS server which had to be rebooted. We are sorry that because of this 110 grid jobs terminated in error, mainly from ALICE.
  • JINR: All is OK, but almost no stage-in requests from the middle of the week until this morning.
  • KISTI: NTR
  • KIT: ALICE xrootd servers were under high load and were failing during the weekend. They were restarted and are OK now. A portion of job submissions to the ARC CEs is failing due to gridftp timeouts.
  • NDGF: NTR
  • NL-T1: NTR
  • NRC-KI:
  • OSG:
    • we will continue following up with CMS sites on their PhEDEx upgrades
  • PIC:
    • reminder of tomorrow's downtime for upgrading dCache etc.
  • RAL: We are planning to update firmware on some disk servers on Wednesday 14th. We are testing this at the moment. We will set a GOCDB warning.
    • John: some 25 to 30 machines are involved
  • TRIUMF: NTR

  • CERN computing services:
    • lcg-voms2.cern.ch failures since Sat evening (OTG:0034703)
      • incident resolved as of Mon morning
      • Gavin: there appears to have been an IPv6-related problem after the service VMs
        were migrated during the weekend; we have asked the network team to look into it
  • CERN storage services:
    • FTS : an issue with credential renewal affected all ATLAS jobs starting from Sat 10 at 23:00 CET (OTG:0034713). The problem affected also the other 2 FTS instances ( BNL and RAL). While a mitigation has been applied the permanent solution to the issue has been also found ( to be circulated soon by the devs to the FTS admins)
    • CASTOR upgrade for ALICE and ATLAS Tue Dec 13, 10:00-12:00 (OTG:0034674)
    • CERN AFS external disconnection test postponed to Wed February 1st 2017 09:00 CET (CERN AFS servers will not be available from non-CERN networks. Within CERN, the service stays available.)
      • Federico: LHCb would like the AFS intervention to be postponed a few weeks more,
        to avoid potential problems on WNs that happen to have AFS mounted;
        we would like to avoid unnecessary disruptions while preparing for Moriond
      • Xavi: we will follow up with the AFS team
  • CERN databases:
    • ALICE Online DB switchover test:
      • Data Guard switchover test will be performed for ALICE online (ALIONR) database on Wednesday 14th of December. At 10:00 h primary database in P2, and standby database in CC will switch roles. There will be short time at the beginning of the intervention where services will be down. Then IP aliases will be swapped between the 2 databases and services started on the CC cluster. Later on, at 14:00 h the switchover back will be performed to restore the original configuration. The aim of this exercise is to verify that all ALICE Online applications will be able to survive the disconnection and reconnection to CC cluster and that they can use the CC cluster for daily operations.
      • Consequences: ALICE Online DB services will not be available at 10:00 and 14:00 for approx. 15-20 minutes. There is also a risk that some applications will not work after the switchover to CC - identification of such issues is the goal of this exercise.
      • SSB Article: OTG:0034578

  • GGUS: Reminder! Year-End info on GGUS: During the Year End holiday period, if you observe a problem with GGUS and you cannot submit a ticket via https://ggus.eu, please contact the KIT site operator (info in https://goc.egi.eu/portal). The KIT Emergency telephone number is +49 721 60828383
  • Monitoring:
    • Draft reports for the November 2016 availability sent around
  • MW Officer: NTR
  • Networks: NTR
  • Security: NTR

AOB:

  • Indico down today 17:00-20:00 UTC for SW update (OTG:0032127)

  • Julia thanked Maria Alandes and Maria Dimou each with flowers for their
    many contributions to WLCG Operations Coordination in the past years
    and wished them well in their new activities in the IT department !
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng flowers.png r1 manage 2444.6 K 2016-12-12 - 19:54 MaartenLitmaath Flowers!
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2016-12-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback