Week of 191028

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Alberto (Monitoring), Borja (Chair, Monitoring), Claire (ATLAS), Olga (Computing), Renato (LHCb), Roberto (Storage), Vincent (Security)
  • remote: Andrew (NL-T1), Caio (CMS), Darren (RAL), Dave (FLAB), David (IN2P3), Di (TRIUMF), Marcelo (CNAF), Mike (ASGC), Pepe (PIC), Xavier (KIT)

Experiments round table:

Vincent mentioned it's true we have probes to check if services are up and running but not to check who has access. In any case it seems it was read only access but this things should be prevented.

  • ALICE -
    • NTR, at least until Sun late evening

  • LHCb reports ( raw view) -
    • Activity:
      • MC, user jobs and data restripping.
      • Continuing staging (tape recall) at all T1s
    • Issues:
      • CNAF: Files can not be accessed (ticket: GGUS:143816, in progress)
      • GridKa: 1 ARC CE unavailable (ticket: GGUS:143814, in progress )
      • IN2P3: It's the only T1 that is not ready for singularity. No representative!

Renato asked two things for IN2P3, they need to have a representative to talk to and when they are going to update to CC7 and provide singularity. David mentioned, a new representative will be elected early November and they just had a meeting for adaptation to singularity, he will give more news about this when ready.

Sites / Services round table:

  • ASGC:
    • Will have a downtime in 2019-11-04T02:00:00 to 2019-11-04T14:00:00 (UTC) for power maintenance.
  • BNL: NTR
  • CNAF: NTR
  • EGI: NC
  • FNAL: On Friday, October 25, a water leak was found in pipes serving the building housing the CMS T1 tape libraries. Water supply to the building has needed to be shut down, preventing humidity control for the libraries. Humidity levels dropped below safe operating levels by Friday evening, and the decision was taken to pause the CMS tape libraries (no tape activity) to prevent damage/data loss. They remained paused through the weekend, and we are evaluating options today, as repair looks to be a lengthy process. Once water and humidity control is restored to the building, tape libraries are expected to need to continue to be paused for an additional 1-2 days to re-acclimatize the tapes before tape access may be restored.
  • IN2P3: IN2P3-CC will be in maintenance on Nov. 26th, a Tuesday. As usual details will be available one week before the event. CEs and SEs are foreseen to be in downtime for the whole day.
  • JINR: NTR
  • KISTI: NTR
  • KIT:
    • GGUS:143699: We've recovered from the dCache database incident for LHCb on Saturday, Oct 19th. The usual service quality and reliability is provided for again. However, we found that 76 file were lost for dCache due to the incident; 39 of which LHCb was able to recover from other sources.
    • GGUS:143814: We've just declared a downtime in GOCDB regarding the inability to successfully submit new jobs to ARC-CEs 3 and 4 at KIT (GOCDB:27913). The issue is under investigation and we will of course end the downtime as early as possible.
  • NDGF: NC
  • NL-T1: NTR
  • NRC-KI: NC
  • OSG: NC
  • PIC: A couple of downtimes announced, Pepe will add more information about them.
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: On Friday Argus nodes returned authentication errors due to daemons not restarting after certificate update. GGUS:143812.
  • CERN storage services: NTR
  • CERN databases: NC
  • GGUS: NTR, at least until Sun late evening
  • Monitoring: NTR
  • MW Officer: NC
  • Networks: NTR
  • Security: NTR

AOB:

  • NOTE: Next week the CHEP 2019 Conference will be taking place in Adelaide.
  • The operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGOpsMeetingWeek191028
Topic revision: r19 - 2019-10-30 - MaartenLitmaath
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback