Week of 180604

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (chair, WLCG, DB), Borja (monitoring), Herve (storage), Vincent (storage)
  • remote: Konrad K (LHCb), Darren (RAL), Andrew (NL-T1), Sang Un (KISTI), Marcelo (CNAF), Christoph (CMS), Di (TRIUMF), Peter L (ATLAS), Dave M (FNAL), Xin (BNL), Victor (JINR), David B (IN2P3), Xavier (KIT), Jens (NDGF)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Overall production smooth running
    • ATLAS concern of EOS stability
      • Response and followup support excellent. CERN-IT please comment on the general issue.
      • For example, looking at month of May we have significant outages reported on 4th, 9th, 25th, 31st.
    • Friday night Rucio proxy delegation issue caused all FTS transfers to fail
      • Fixed Saturday morning, issue understood
      • couple of misguided tickets opened, apologies

  • CMS reports ( raw view) -
    • CPU usage: ~245k cores (~200k production, ~45k analysis)
    • CMS EOS outage on May 30th early morning OTG:0044168
    • Agreed to write Streamer files from CMS Storage Manager to EOS pools at Meyrin only GGUS:135413
    • Some trouble with remote xrootd access to CMS EOS at CERN GGUS:135340, GGUS:135460, GGUS:135479 (Sorry to many tickets made via various CMS members)
    • All accidentally deleted data last week could be recovered
      • Build-in redundancy and safety margins kicked in
      • Additional CPU (for full reprocessing/replaying several days of data taking) resources and human effort required of course

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
    • Site Issues
      • NTR

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR
  • CNAF: Problem with gpfs on some worker nodes that resulted in failed tests for CMS - Under investigation
  • EGI: nc
  • FNAL: Reminder about downtime on the 20/21st of July
  • IN2P3: Schedule maintenance on Tuesday 12th June. dCache SE will be off for upgrade. Thus no jobs except for ALICE will be possible.
  • JINR: No problems in general. Major upgrade of dCache is planned from 13:00 till 22:00 on June 7 (Thursday). Downtime announced.
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: Machines locking up with the CPU soft lookup, issue went away without any action. Did any other site experienced it?
  • NRC-KI: nc
  • OSG: nc
  • PIC: nc
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: nc
  • CERN storage services:
    • EOSCMS: Locking issue left the instance unresponsive on Wed, 30th May. Manual intervention necessary to restart the namespace
    • EOSATLAS: Headnode got killed and then took longer than usual to boot, FSTs stopped (because they can't talk to the headnode), and when restarting ask the MGM for a full list of their local file.
      • Two bugs uncovered and fixed ( EOS-2600 and EOS-2601) in next release 4.2.24
      • Full report in an e-mail to eos-announce-atlas@cern.ch on 25/05 at 10:56 CEST
Christoph reminded about the xrootd issue. Herve reported that it was discussed in EOS-CMS meeting earlier today. European redirector's are suspected as culprit but more investigation is needed before restart is requested. Peter asked about the expected new release. Herve reported it will be available in days and deployed transparently.

  • CERN databases:
  • GGUS: NTR
  • Monitoring:
    • Draft reports for the May 2018 availability sent around
  • MW Officer:
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2018-06-04 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback