Week of 230403

WLCG Operations Call details

  • The connection details for remote participation are provided on this agenda page.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Julia (WLCG), Maarten (ALICE + WLCG)
  • remote: Andrew P (NLT1), Andrew W (TRIUMF), Christoph (CMS), David B (IN2P3-CC), David M (FNAL), Doug (BNL), Gavin (computing), Marian (networks + monitoring), Priscilla (ATLAS), Steven (storage), Vincenzo (CNAF), Xavier (KIT)

Experiments round table:

  • CMS reports (raw view) -
    • CMS is going to enable also user data management with Rucio
      • Requires a few changes in the name space with proper permissions and mapping of the Rucio proxy
      • A number GGUS tickets opened to sites to coordinate the implementation

  • ALICE
    • NTR

Sites / Services round table:

  • ASGC:
  • BNL: Recent Data Carousel activity has triggered a large number of these errors - SOURCE SRM_GET_TURL The source file is not ONLINE. The issue has been reported to the FTS developers and is documented in these JIRA tickets - FTS-1901 and FTS-1900
  • CNAF: tomorrow we will have an intervention on the LHC/OPN access router by GARR...we only expect a performance degradation, not a network down... https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=33802
  • EGI:
  • FNAL: NTR
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT: In an attempt to fix ongoing issues for the CMS WebDAV SAM tests, we updated the dCache version for the WebDAV doors on Thursday. The update was successful and the observed Java exceptions went away. Sadly though, the SAM tests are still failing for indeterminable reason.
    • Xavier: the tests failed only during weekends
    • Christoph: might be due to productions launched for those weekends
    • David M: those may also have caused errors in our dCache
  • NDGF:
  • NL-T1: Nikhef - A raid 6 storage array failed last week when two discs failed during the rebuild of a third failed disc. This put 80 TB of local copies of atlas data at risk. The hardware support engineer was able to recover the array sufficiently to allow the array to rebuild into a degraded but working state. Another rebuild to replace the second failed disc is currently underway. In parallel to the rebuild the data is being migrated from the affected dcache pools to known good pools. Currently it appears that one file has been lost, but we will know for sure once the data migration is as complete as possible.
  • NRC-KI:
  • OSG:
  • PIC: Tomorrow Tuesday April 4th, PIC will be in complete scheduled downtime due to the building's yearly electrical maintenance. PIC will be in OUTAGE from 04-04-2023 04:00 [PIC local time] until 04-04-2023 23:59 [PIC local time]. HTCondor will be stopped at the beginning of the downtime. Have a nice easter!
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases:
  • GGUS:
    • The monthly update on Wed last week ran into serious problems,
      resulting in a downtime of 26 hours
      • The update was successful in the end
      • The alarm tests were incomplete and will be redone Wed this week
      • A short SIR has been added to the WLCG SIR archive
    • Instabilities of the e-mail engine were noticed in the last few days
      • Ticket update e-mails were not always sent to all parties involved
      • The cause lies in a reset of a critical field when a ticket is updated,
        which then also prevents further updates of the given ticket...
        • Update: tickets can still be updated via e-mail
      • The problem is being investigated with top priority
  • Monitoring:
    • Distributed draft SiteMon availability/reliability reports for March 2023
  • Middleware: NTR
  • Networks: High packet loss on the LHCOPN link to KISTI - significant drop in throughput to/from T1s - KREOnet was contacted
  • Security:

AOB:

  • NOTE: the operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
  • Have a good Easter break !
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2023-04-03 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback