Week of 180813

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Alberto (Monitoring), Andrei (DB), Borja (Chair/Monitoring), Jiri (ATLAS), Roberto (Storage), Vincent (Security)
  • remote: Christoph (CMS), Dave M (FNAL), David B (IN2P3), Di (TRIUMF), Jens (NDGF), Marcelo (CNAF), Sang Un (KISTI), Stefan (LHCB), Xavier (KIT)

Experiments round table:

  • ATLAS reports ( raw view) -
    • 290 - 350 k grid jobslots used, peak of 500k jobslots used on HPCs (CORI)
    • changes in pilot led to changes in protocol used to access some resources
      • CERN EOS used via srm by some jobs
      • fixed planned for this week (change in rucio)
    • wrong ST used at PIC for a small fraction of files, issue understood
    • TRIUMF tapes blacklisted for writing (for a period of migration)
    • T1_DATADISKs getting full, deletion in progress

  • CMS reports ( raw view) -
    • CPU resource usage over the week: ~145k cores for Production and ~68k cores for Analysis
      • CMS Global Pool is suffering from many way too short jobs (less than 30min)
        • Mainly caused by buggy workflows
        • Leads to bad Pool utilization and bad CPU efficiency
    • Data loss at KIT: GGUS:136673
      • ~23k files from Disk
      • ~2700 files that were meant to go to tape
      • ~70k temporary files waiting to be merged to larger files

  • ALICE -
    • Apologies: ALICE operations experts will not attend today
    • NTR, at least until this afternoon

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data
      • User and MC jobs
    • Site Issues
      • NTR

Sites / Services round table:

  • ASGC: NC
  • BNL: NTR
  • CNAF: NTR
  • EGI: NC
  • FNAL: NTR
  • IN2P3: IN2P3-CC will be in maintenance on September 18th, a Tuesday. As usual details will be available one week before the event.
  • JINR: NTR
  • KISTI: NTR
  • KIT:
    • Several ARC-CEs disabled by jobs with very large output files each (17G-22G) filling up disks for input/output staging.
    • On Monday an issue with multipath made the metadata for our large GPFS cluster inaccessible for about one hour. The consequence was, that usage of the GPFS storage backend for the SEs for Alice, ATLAS and CMS was hanging indefinitely for a while.
    • Thursday the database for CMS' dCache SE crashed with no space left on device. Production was moved to warm stand-by node within the next hour. However, the very next day, the database was switch back again, because it became apparent that the stand-by node was lagging behind about one week of changes (which caused the disk space exhaustion in the first place). Now CMS only has to invalidate about 320k files that were created between Thursday and Friday...

Jiri asked why the discrepancy in numbers reported by KIT and CMS, no initial idea of the reason, it will be sorted out by KIT and CMS.

  • NDGF:
    • One CE is being rebuilt. Another one is in scheduled downtime. So CE capacity is limited this week.
    • dCache pool upgrade on Wednesday. Atlas and Alice data offline for 5-10 minutes during the morning (not all pools upgraded at the same time to not affect write ability).
  • NL-T1: NTR
  • NRC-KI: NC
  • OSG: NTR
  • PIC: NC
  • RAL: NC
  • TRIUMF: The migration of all data on disks to the new storage system at new data centre is still ongoing, probably needs another two weeks to complete. The network was more or less saturated by this, thus the data transferring to our tape system at the new data centre was affected, had to ask ATLAS to delay the T0 data export to our tape system.

  • CERN computing services: NC
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS: NC
  • Monitoring: NTR
  • MW Officer: NC
  • Networks: NC
  • Security: EGI's Communication Challenge mostly finished, few sites did not respond yet...

Vincent mentioned a bug discovered in a tool used by CMS (SCRAM), that could cause it to go in an endless loop. Bug report, Potential fix

AOB:

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2018-08-14 - VincentBrillault
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback