Week of 181022

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Julia (WLCG), Kate (WLCG, DB, chair), Maarten (WLCG, ALICE), Vincent (security), Ivan (ATLAS), Roberto (storage), Marcelo (LHCb, CNAF)
  • remote: Andrew (NL-T1), Darren (RAL), David (IN2P3), Dmytro (NDGF), Di (TRIUMF), Dave M (FNAL), Vincenzo (EGI), Sang-Un (KISTI), Xin (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Transfers
      • We have saturated several links over the weekend with a pile-up premixing test workflow (CERN-PIC, IN2P3). The workflow is aborted and the problem should have been mitigated

  • CMS reports ( raw view) -
    • Disk space situation became to critical last week that there was an emergency stop ~Monday
      • Only Release Validation and recovery workflows
      • Some cleaning on Tuesday and Wednesday
      • Resuming of most important campaigns on Thursday and Friday
    • CERN EOS file access issues GGUS:137830
      • Likely related to highIO workflow targeting ~30k cores at CERN
    • Repeated problems to reach websites hosted on CERN EOS GGUS:137768

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User and MC jobs
    • Site Issues

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR
  • CNAF: Atlas queues are in drain mode for the Storage and Farming intervention that is going to happen tomorrow (23/10)
    • Alice will not be closed
  • EGI: NTR
Vincenzo asked for a list of sites affected by the TLS issue. Maarten replied he has a list of both SEs and CREAM CEs affected. EGI broadcast is planned - to be followed up offline.
  • FNAL:NTR
  • IN2P3: NTR
  • JINR: Still restoring broken files from tapes after h/w crash of raid adapter last week. Slow but no lost files so far.
  • KISTI: NTR
  • KIT:
    • Two pool nodes for LHCb froze on Friday and Saturday last week, both had to be rebooted several times. We assume this was caused by an issue with the BIOS version, which we have to solve asap.
    • Tomorrow the dCache instance "cmssrm-kit.gridka.de" will be in downtime for updating dCache, Postgres and GPFS as well as deploying IPv6 together with high availability for the SRM endpoint.
  • NDGF: Two CEs are having troubles at the moment, also one tape endpoint is not functioning well. It's being investigated.
  • NL-T1: Retired ~1000 cores of 6 year old compute nodes and added ~1700 of new cores (AMD EPYC 7551P)
  • NRC-KI: nc
  • OSG: nc
  • PIC: nc
  • RAL: NTR
  • TRIUMF: Last Friday around 10:00am (local time) there was a site wide power outage which was the result of a contractor accidentally drilling into a high-voltage duct bank that serves power to the TRIUMF site. The power was restored around 5:00pm (local time) Saturday and all nodes at old data centre were online before 8:30pm. Though we already moved the majority of services and storage to new data centre, the WebDAV service for storage was not available during the power outage and also the computing resources were reduced. Besides, during the power outage due to an issue with the diesel backup system, the lab core network service such as the DNS server had been down about 3 hours too, this affected our Tier-1 services at another data centre.

  • CERN computing services: nc
  • CERN storage services: NTR
  • CERN databases: Rolling intervention in CMSR to replace memory module tomorrow morning (OTG:0046496)
  • GGUS: NTR
  • Monitoring: nc
  • MW Officer: Latest Globus was released to EPEL production on Oct 16th. There is no urgent need to update. All TLS versions are supported in the version released.
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2018-10-22 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback