Week of 180219

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (WLCG, DB), Julia (WLCG), Maarten (ALICE, WLCG), Belinda (storage), Michal (ATLAS), Vincent (sec), Marian (networks), Alberto (monitoring), Borja (monitoring)
  • remote: Jens (NDGF), Dennis (NL-T1), Stephan (CMS), Marcelo (CNAF), Xavier (KIT), Darren (RAL), Di Qing (TRIUMF), Dave M (FNAL), Vladimir (LHCb), Elizabeth (OSG), Pepe (PIC)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • normal activities
      • second half of reprocessing started
    • Problems
      • some overlay tasks are causing frontier degradation - cap on number of jobs decreased
      • rucio overload - sites could see decrease in number of mcore jobs and increase in transferring jobs
      • Deletion at BNL failed (GGUS:133551) - configuration updated
      • Transfers to RAL-LCG2-ECHO fail with "Address already in use" (GGUS:133399) - fixed
      • Transfer failures from INFN-T1 via RAL FTS (GGUS:133320)
      • Transfer from CERN-PROD_datadisk fail with "No such file or directory" (GGUS:133414) - still being investigated
      • Transfers from IN2P3-CC_DATATAPE fail with "Changing file state because request state has changed" (GGUS:133545)
      • Transfers from TAIWAN-LCG2_DATADISK fail with SRM_INVALID_PATH (GGUS:133546) - files not stored into the DPM successfully, will be declared lost

  • CMS reports ( raw view) -
    • no major grid computing issues
    • SAM3 wasn't updating for a few days end of last week, recovered
    • running smoothly at about 160k cores
      • over 40% analysis share during the week dropping to about 25% over the weekend
    • mid-week global run of the experiment this week

  • ALICE -
    • Normal activity level on average
      • It was low on Wed and Thu

  • LHCb reports ( raw view) -
    • Activity
      • HLT farm fully running
      • MC simulation and user jobs
      • 2017 data restripping should be started
    • Site Issues
      • NTR

Sites / Services round table:

  • ASGC: nc
  • BNL: nc
  • CNAF: Recovery followup
    • Still working on the 1st power line to restore full continuity
      • At present UPS only for storage and network
      • Waiting for green lights from support to power on again the farm (Other down of power probable)
    • Tomorrow starting the cabling of new storage (2017 tender)
      • When ready, it will be used for LHCb
    • Ready to put in production disk for CMS and Atlas
      • Coordination with experiments needed to reopen services
      • Atlas buffer for tape already in production
    • Alice: reinstalling servers (disk ready)
    • CMS and Atlas have opened a ticket to trace restart activities
    • Installation of WNs at CINECA will start next week
      • 500 Gbps VPN (RTT: 0.4 ms)
      • ~ 170 kHS06
      • CentOS 7
      • Singularity
    • Atlas ticket for issues on data transfer RAL - CNAF (GGUS:133320). The issue should be fixed, needs to be tested.
    • GDB presentation on 14th February

Service VO Status Expected restart date Readiness GGUS ticket CNAF comment VO comment
Electric power line - Maint. 20.02, 21.02 Not Ready   Primordial for the rest  
Tape buffer ALICE OK - Production      
Tape buffer ATLAS OK - Production      
Tape buffer CMS OK   Ready      
Tape buffer LHCb Being Rebuilt   Not Ready      
Disk ALICE Parity OK   Ready      
Disk ATLAS Parity OK   Ready      
Disk CMS Degraded Parity   Ready   raid5 in a few LUNs, raid6 in the others Disks to be replaced
Disk LHCb Degraded Parity Production in March Not Ready   raid5 in all LUNs  
Computing farm -   Not Ready        

Maarten expressed thanks to CNAF for both the extensive report and the work done

  • EGI: nc
  • FNAL: NTR
  • IN2P3: IN2P3-CC will be maintenance on March 13th, a Tuesday. As usual details will be available one week before the event.
  • JINR: NTR
  • KISTI: nc
  • KIT: Issues with ATLAS' tape dCache pools on Saturday early morning resolved by on-call engineer.
  • NDGF: Downtime starting after today's meeting for one of our biggest CEs, Abisko. Should be done within 48 hours.
  • NL-T1:
    • We are still having problems with our virtualisation platform which hosts most of our core services.
    • Investigations have pointed to issues on the iSCSI layer (the spectre microcode patches are probably not at fault here).
    • There have been nearly daily occurrences of a sudden drop in throughput to the backend storage with stalling systems and services as a result.
    • This resulted --again-- in service unavailability over the weekend until we managed to migrate at least a few key services to another platform.
    • There will be some planned downtime the coming week to migrate other VMs away from the troubled systems while we keep investigating the iSCSI issues with the vendor.
  • NRC-KI: nc
  • OSG: NTR
  • PIC: ATLAS overlay jobs have caused storage overload and the number of job slots had to be reduced
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services:
    • EOSALICE is suffering from frequent crashes these days: identified root cause, fixed version will be deployed today or tomorrow
  • CERN databases: ALICE online database suffered short downtime on Friday, it was related to a storage intervention done earlier that day
  • GGUS: NTR
  • Monitoring:
    • SAM monthly reports: Waiting for recomputations on the draft distributed last week.
    • XrootD ATLAS FAX (US) data not included in the WLCG reports.
  • MW Officer: NC
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2018-02-20 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback