Week of 200518

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local:
  • remote: Miro (Chair, DB), Christoph (CMS), Maarten (ALICE), Vladimir (LHCb), Xavier (KIT), Jose (PIC),Dave (FNAL), Darren (RAL), Onno (NL-T1), Alberto (Monitoring), Vincent (Security), Ivan (ATLAS), Julia (WLCG), Borja (Monitoring), Elena (CNAF), Kate(DB)

Experiments round table:

  • ATLAS: NTR

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity:
      • Ongoing WG MC productions, Heavy Ion 2018 stripping validation
    • Issues: Nothing new to report

Sites / Services round table:

  • ASGC: NC
  • BNL: NTR
  • CNAF: NTR
  • EGI: NC
  • FNAL:
    • Suffered a DNS failure in which CNAMEs were wiped out sitewide on Saturday for about 4-5 hours. In this incident found some of the FNAL endpoints disappeared out of ETF and the ETF check_mk. Maarten pointed out in meeting that this means the CMS VO feed depends on valid DNS entries, and needs to be followed up on the CMS side. Indeed ETF should have continued to show the hosts in its interface and that they were failing. Will follow up in CMS, reopened CMS ggus: GGUS:146998
  • IN2P3: a dCache server crashed yesterday morning. It was restarted this morning. LHCb TEAM ticket raised (#147004) (and now closed) for files unavailability.
  • JINR: NTR
  • KISTI: NC
  • KIT:
    • The CMS dCache database crashed on Tuesday morning with a segmentation fault. We've spend the entire day to restore it from Monday's full backup. As to the cause for this incident, we're not sure. It might be related to the frequently running dumping procedure, that is supposed to produce a dump within 10 minutes, but somehow takes many hours now, such that two dump procedures overlapped (and 90 minutes later the database crashed). We continue investigating, the cronjob producing the dumps has been suspended for the time being.
    • Tomorrow 10 a.m. CEST our tape system will be in downtime. GOC-DB downtimes have been announced for LHCb and the others separately.
  • NDGF: NC
  • NL-T1: Downtime tomorrow: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=28736
  • NRC-KI: NC
  • OSG: NC
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR (Canadian Holiday today)

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring:
    • We had an issue running the Monalisa collector for Xrootd transfers, that translated into 24 hours of data lost between 05/13 10:00 AM and 05/14 10:00 AM. This situation was the first time happening and we have already identified the needed changes to make sure no data are lost if it happens again. Sorry for the inconveniences
    • OTG:0056514 - Issue with ETF HTCondor pool caused SAM/ETF CMS job submissions to be stale for several hours
  • MW Officer: NTR
  • Networks: NTR
  • Security: Ongoing incident(s) in multiple HPC sites in multiple European countries (and beyond): https://csirt.egi.eu/academic-data-centers-abused-for-crypto-currency-mining/ (TLP:WHITE). A new broadcast will be sent soon.

AOB:

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2020-05-18 - DavidMason
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback