Week of 180820

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes



  • local: Alberto (Monit), Borja (Chair, Monit), Gavin (Computing), Julia (WLCG), Marian (Network), Michal (ATLAS), Vladimir (LHCB)
  • remote: Darren (RAL), Dave (FNAL), David B (IN2P3), Di (TRIUMF), Jeff (OSG), Kenneth (CMS), Marcelo (FNAL), Xavier (KIT), Xin (BNL), Victor (JINR)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • normal activities
    • Problems
      • T1_DATADISKs getting full in the last week - there are few PB of small files to delete - they are now mixed with bigger files so deletion can keep up
      • EOSATLAS namespace powercycle incident (OTG:0045385) - after the issue was resolved, transfers started working but the deletion was still failing (GGUS:136727)
      • central service monitoring grey (GGUS:136733) - there was an issue with the meter cluster during Thursday night, it was solved during morning, backlog processing finished in the afternoon
      • Taiwan-LCG2 power cut (GGUS:136690) - solved
      • RAL ECHO problems with swapping nodes (bug in Ceph) - solved on Friday
      • transfers from pic to all clouds fail with "Transfer canceled because the gsiftp performance marker timeout of 360 seconds has been exceeded, or all performance markers during that period indicated zero bytes transferred" (GGUS:136778) - started this morning

  • CMS reports ( raw view) -
    • Our difficulties with short jobs continues; for instance, we had a workflow in the system that wanted to create 130M jobs that would run only a few minutes each. We are busy getting these cleaned out of the system and trying to understand why such jobs would need to exist in the first place.
    • This has jammed up the production system and kept us down to an average over the week of 125k cores for production, although we are recovering now, at this writing the value is 172k cores.
    • Analysis averaged 72k cores, presumably picking up some of the production slack.
    • I haven't dug into the details, but the downtime at T1_UK_RAL didn't show up properly in our downtime calendar until Thursday.
    • Have we gotten the SIR from T1_DE_KIT about their incident of the previous week?

SIR from KIT was already there at the time of the meeting.

  • ALICE -
    • Apologies: ALICE operations experts will not attend today
    • Normal activity levels
    • KIT: 15 files lost due to a damaged tape cartridge

Sites / Services round table:

  • BNL: Migrating NFS home directories to a new appliance this Tuesday, should be transparent to grid jobs though.
  • EGI: NC
  • IN2P3: NTR
  • KIT:
    • Uploaded a SIR about the incident with the CMS dCache database of the week before last week.
    • Instabilities with ARCs - in particular arc-[13]-kit.gridka.de - continue. Mounting working directories via NFS seems to be unable to sustain the I/O load reliably.
  • NL-T1: NTR (Cannot join. Suddenly I get: Device is not compatible with this version. Has Vidyo been upgraded?)

Yes, Vidyo was updated a couple of weeks ago.

  • NRC-KI: NC
  • OSG: NTR
  • PIC: NC
  • RAL: The upgrade of Echo was completed successfully on Thursday (16/8/18), with a greatly reduced memory usage. The cluster was allowed to recover overnight. Everything appeared to be working well on the Friday (17/8/18), and there is currently no evidence of data loss. We therefore ended the downtime at Friday 12:00 UTC (17/8/18). As a precaution for the weekend, we limited the ATLAS (and CMS), quota on our batch farm to 50% of its nominal amount. Assuming we encounter no problems we intend to lift this on Monday (20/8/18).
  • TRIUMF: Data migration is still ongoing.

  • CERN computing services:
    • Working with security team on response to L1 Terminal Fault vulnerabilities. Batch nodes will likely need reboot over next weeks (no huge rush yet). Service nodes and their hypervisors will also need a reboot. Detailed announcements will be made via the IT SSB.
  • CERN storage services: NC
  • CERN databases: NC
  • GGUS: NC
  • Monitoring: NTR
  • MW Officer: NC
  • Networks: GGUS:136332 GGUS:136606 IHEP-CN to RU network performance issues - current route goes via Internet2
  • Security: NTR


Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2018-08-20 - BorjaGarridoBear
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback