  • local: -
  • remote: Andrew (TRIUMF), Borja (Chair, Monitoring), Darren (RAL), Dave (FNAL), David (IN2P3), Gavin (Computing), Julia (WLCG), Maarten (ALICE), Michal (ATLAS), Pepe (PIC), Remy (Storage), Vincent (Security), Vladimir (LHCb), Xin (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • Ongoing reprocessing
      • COVID-19
        • We have decided to focus on " folding@home"
        • Currently 10% of T0 resources (4k slots) to test the workflow in production
        • Under deployment: Unpledged resources i.e. Sim@P1
        • Possible to add further grid sites within existing distributed computing infrastructure
    • Issues:
      • "gsiftp performance marker timeout" transfer failures from IN2P3-CC (GGUS:146325)
      • IN2P3-CC Frontier was degraded (GGUS:146353)
        • one node was affected by OpenStack incident
      • "No such file or directory" transfer failures from INFN-T1 (GGUS:146367)
        • problems with restarting services after a reboot - fixed
      • FZK-LCG2: job stage-out timeouts (GGUS:146356)
        • being investigated with the help of rucio devs
      • Transfers to FZK tapes failed with "Request timed out" (GGUS:146386)
        • a communication issue for the dCache tape cache pool
      • Deletion failures at RAL (GGUS:146360)
        • the deletion rate was too high - after it dropped, errors disappeared
      • Deletion failures at BNL (GGUS:146365)
        • the deletion rate was too high - after it dropped, errors disappeared
      • deletion errors at INFN-T1 (GGUS:146411)
        • file limits for storm-webdav service increased
      • monitoring

  • CMS reports ( raw view) -
    • Likely nobody CMS available for the call due to overlapping meetings - sorry.
    • Anyway no major problems

  • ALICE -
    • Mostly business as usual.
    • No major issues.
    • Folding@Home contributions started on Fri:
      • Up to 5k concurrent jobs (up to 4% of the resources).
      • 30k+ jobs done so far, < 1k errors.
      • ALICE site contributions are shown on the CERN team page.

Sites / Services round table:

  • ASGC: NC
  • BNL: HTCondor upgraded on CEs, then have a problem with group quota scheduling (starving mcore jobs). Workaround in place, production back to normal. Investigation continues with the HTCondor team.
  • EGI: NC
  • FNAL: Access to Covid-19 flows open, for the time being only tests are submitted.
  • IN2P3:
    • 110 TB ALICE disk server down last week has been put back to production last Tuesday March 31st (RAID card changed)
    • SAM tests failing for ATLAS and CMS: under investigation.
  • KIT:
    • Downtime on Wednesday was a success in most regards.
      • We've received reports from LHCb, that access to some files is very slow (GGUS:146379). So far we could not find the origin for that issue.
      • Additionally we've updated dCache to the latest released version. In the aftermath of that update however, we experienced unreliable internal communication between certain dCache services, as reported by ATLAS (GGUS:146386). Those are resolved since Friday.
    • Regarding the stage-out issues reported by ATLAS (GGUS:146356), we're waiting for more information about the actual task performed during stage-out. As far as we can tell, the computing node, storage element and network are all working just fine. Due to a complete lack of information about the stage-out, we cannot help resolving the issues any further.
  • NL-T1: NC
  • NRC-KI: NC
  • OSG: NC
  • PIC: Regular incidents in the datacentre are being accumulated and solved once per week via one person physically going to fix them. Transparent for the users and working fine so far.
  • RAL: NTR

  • CERN computing services: NTR
  • CERN storage services:
    • EOSATLAS instance update tomorrow 7th of April 10AM OTG:0055687
    • Production FTS ATLAS instance will be down on the 7th of April from 9:00 to 9:30 due to a database host and storage migration. OTG:0055665
  • CERN databases:
  • Monitoring:
    • Draft reports for the March 2020 availability sent around
    • Proposal to change current FTS efficiency plot from "average" to "time weighted average"
      • See attached slides
      • Agreed already by some VOs

Consensus on going ahead with the FTS efficiency change.

  • MW Officer: NC
  • Networks: NTR
  • Security: NTR


  • NOTE: the operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
  • Have a good Easter break !

