Week of 200427

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: -
  • remote: Andrei (DB), Andrew (TRIUMF), Borja (Chair, Monitoring), Christoph (CMS), Darren (RAL), Dave (FNAL), Elena (CNAF), Gavin (Computing), Maarten (ALICE), Maria (Storage), Pepe (PIC), Renato (LHCb), Sabine (ATLAS), Sang-Un (KISTI), Xin (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Stable running apart on Wednesday 22: the cvmfs deployment of a rucio release (no issue with the release itself) lead to massive SIGBUS failures and massive site blacklisting. Rolling back to previous config and killing zombie jobs that appeared there allows to recover job slots. Issue difficult to diagnose (no meaningful logs) and fully understood on Thursday: there were some overwriting of rucio path in cvmfs and a mixture of cached and new .pyc files deployed in the pilot, breaking links. This was not detected by the test. Deployment procedure updated to prevent this. Running at full capacity Thursday end of afternoon.
    • Deployment of unified queues continuing
    • Running 40k-50k of COVID jobs (will decrease when CERN-P1 will be used elsewhere)
    • Network question: INFN-T1 complaining because it receives too many data from general network. This were data consolidation activities,data coming from LRZ that has very good connectivity but is not connected to LHCONE (GGUS:146661). How should this be handled (Multihop seems not a good idea, FTS can't limit wrt to network)? What are the sites (T1) requirements?

We will involve the Network team so they can provide advice on the question. Maarten already mentioned he doesn't think there is an immediate solution, but should be discussed as it's a valid complaint.

Here are the advices provided by the Network team:

  • Initial recommendation from the network team is for LRZ to join the LHCONE.
  • If this is not doable the advice is to send this data via sites that can support such heavy loads, like CERN for example.

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity:
      • Preparing for the next Stripping round
      • Ongoing WG MC productions
    • Issues:

Sites / Services round table:

  • ASGC: NC
  • BNL: NTR
  • CNAF: NTR
  • EGI: NC
  • FNAL: NTR
  • IN2P3: NTR
  • JINR: NTR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NC
  • NL-T1: NC
  • NRC-KI: NC
  • OSG: NC
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases: A number of Oracle databases, including CMSONR, will unavailable due to network intervention next Wednesday morning (OTG:0054761)
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
      • A downtime has been scheduled for 06:30-09:30 UTC
      • Test alarms will be submitted as usual
  • Monitoring:
    • Final reports for March 2020 availability sent around
  • MW Officer: NC
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2020-04-27 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback