Week of 200406

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: -
  • remote: Andrew (TRIUMF), Borja (Chair, Monitoring), Darren (RAL), Dave (FNAL), David (IN2P3), Gavin (Computing), Julia (WLCG), Maarten (ALICE), Michal (ATLAS), Pepe (PIC), Remy (Storage), Vincent (Security), Vladimir (LHCb), Xin (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • Ongoing reprocessing
      • COVID-19
        • We have decided to focus on " folding@home"
        • Currently 10% of T0 resources (4k slots) to test the workflow in production
        • Under deployment: Unpledged resources i.e. Sim@P1
        • Possible to add further grid sites within existing distributed computing infrastructure
    • Issues:
      • "gsiftp performance marker timeout" transfer failures from IN2P3-CC (GGUS:146325)
      • IN2P3-CC Frontier was degraded (GGUS:146353)
        • one node was affected by OpenStack incident
      • "No such file or directory" transfer failures from INFN-T1 (GGUS:146367)
        • problems with restarting services after a reboot - fixed
      • FZK-LCG2: job stage-out timeouts (GGUS:146356)
        • being investigated with the help of rucio devs
      • Transfers to FZK tapes failed with "Request timed out" (GGUS:146386)
        • a communication issue for the dCache tape cache pool
      • Deletion failures at RAL (GGUS:146360)
        • the deletion rate was too high - after it dropped, errors disappeared
      • Deletion failures at BNL (GGUS:146365)
        • the deletion rate was too high - after it dropped, errors disappeared
      • deletion errors at INFN-T1 (GGUS:146411)
        • file limits for storm-webdav service increased
      • monitoring

  • CMS reports ( raw view) -
    • Likely nobody CMS available for the call due to overlapping meetings - sorry.
    • Anyway no major problems

  • ALICE -
    • Mostly business as usual.
    • No major issues.
    • Folding@Home contributions started on Fri:
      • Up to 5k concurrent jobs (up to 4% of the resources).
      • 30k+ jobs done so far, < 1k errors.
      • ALICE site contributions are shown on the CERN team page.

Sites / Services round table:

  • ASGC: NC
  • BNL: HTCondor upgraded on CEs, then have a problem with group quota scheduling (starving mcore jobs). Workaround in place, production back to normal. Investigation continues with the HTCondor team.
  • CNAF: NTR
  • EGI: NC
  • FNAL: Access to Covid-19 flows open, for the time being only tests are submitted.
  • IN2P3:
    • 110 TB ALICE disk server down last week has been put back to production last Tuesday March 31st (RAID card changed)
    • SAM tests failing for ATLAS and CMS: under investigation.
  • JINR: NTR
  • KISTI: NC
  • KIT:
    • Downtime on Wednesday was a success in most regards.
      • We've received reports from LHCb, that access to some files is very slow (GGUS:146379). So far we could not find the origin for that issue.
      • Additionally we've updated dCache to the latest released version. In the aftermath of that update however, we experienced unreliable internal communication between certain dCache services, as reported by ATLAS (GGUS:146386). Those are resolved since Friday.
    • Regarding the stage-out issues reported by ATLAS (GGUS:146356), we're waiting for more information about the actual task performed during stage-out. As far as we can tell, the computing node, storage element and network are all working just fine. Due to a complete lack of information about the stage-out, we cannot help resolving the issues any further.
  • NDGF: NTR
  • NL-T1: NC
  • NRC-KI: NC
  • OSG: NC
  • PIC: Regular incidents in the datacentre are being accumulated and solved once per week via one person physically going to fix them. Transparent for the users and working fine so far.
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services:
    • EOSATLAS instance update tomorrow 7th of April 10AM OTG:0055687
    • Production FTS ATLAS instance will be down on the 7th of April from 9:00 to 9:30 due to a database host and storage migration. OTG:0055665
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
    • Draft reports for the March 2020 availability sent around
    • Proposal to change current FTS efficiency plot from "average" to "time weighted average"
      • See attached slides
      • Agreed already by some VOs

Consensus on going ahead with the FTS efficiency change.

  • MW Officer: NC
  • Networks: NTR
  • Security: NTR

AOB:

  • NOTE: the operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
  • Have a good Easter break !
Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r27 - 2020-04-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback