Week of 230424

WLCG Operations Call details

  • The connection details for remote participation are provided on this agenda page.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Julia (WLCG), Maarten (Alice + WLCG), Panos (WLCG, Chair)
  • remote: Steve (CERN computing), Alessandro (ATLAS), Borja (Monitoring), Christoph (CMS), Daren (RAL), David B. (IN2P3-CC), David M. (FNAL), Dimitrios (ATLAS), Douglas (BNL), Ivan (ATLAS), Mario (ATLAS), Petr (ATLAS), Priscilla (ATLAS), Ville (), Xavier (KIT)

Experiments round table:

  • ATLAS reports ( raw view) -
    • SSO/VOMS incidents: details at the following link
      • SSO is vital for access of online data taking experts to P1 resources. Lack of SSO would have resulted in potential data loss if LHC dumped beams and prepared for the next fill as experts were unable to connect to make the necessary changes.
      • Due to the expired certificate ATLAS job submission infrastructure could not sustain ATLAS GRID infrastructure utilised for approximately 12 hours 0n 22.04.2023
        • Maarten: Which certificate?
        • Priscilla: The CERN Grid CA one.
    • RU-Protvino-IHEP "suspended" in GGUS, not possible to open a ticket with the site (related to EGI decision WLCG MB December 2022). How should we communicate issues with the site?
      • Maarten: Contact info can be found in GocDB and people can request the necessary role to access this information.
      • Ivan: Asked for instructions on how one can request GocDB roles.
      • Maarten: Will follow up and provide a link to the instructions

  • CMS reports ( raw view) -
    • CMS is contact with various sites to get 2023 pledges configured consistently in all relevant systems
      • Direct communication with site contacts, no GGUS
    • Otherwise nothing major to high light
    • Panos: The CMS Rucio certificate has expired together with the CERN Grid CA one on Saturday, since then CMS Rucio can't do any transfers. A new certificate will be issued and will be uploaded to the production nodes.
    • Maarten: The issue with the Rucio certificate is probably related to the expiration of the CERN Grid CA one, since a certificate cannot expire after the certificate that has been used to sign it.

  • ALICE
    • NTR
      • Maarten: Alice was not affected in any major way from the CERN Grid CA certificate expiration.

  • LHCb reports ( raw view) -
    • Xavier: From Wednesday to Friday it looked like LHCb has lost all site availability.
    • Borja: ETF test was not reporting to SAM properly. Things are normal since Friday.
    • Maarten: A/R corrections will need to be applied to the affected period.

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF: NTR
  • EGI:
  • FNAL: both dCache services are smoothly running the latest version
  • IN2P3: disks on XRootD service for ALICE lost on April 15th have been restored. Data are available again.
  • JINR:
  • KISTI:
  • KIT: Downtime at-risk tomorrow to update the NXOS of a storage switch pair.
  • NDGF: Some Vega ATLAS dcache pools crashed on Sunday and today. Restarted.
  • NL-T1: NTR
  • NRC-KI:
  • OSG:
  • PIC: NTR
  • RAL: NTR
  • TRIUMF:

  • CERN computing services:
    • WLCG IAM upgrade to 1.8.1, 26th at 08:30 1 OTG:0076817
    • Hammercloud atlas submission node - VM migration - 09:30 Wednesday 26th - Transparent except for network latency for seconds.
    • Old CERN intermediate certificate expired at the weekend.
      • Package CERN-CA-certs-20220329-1 was released more than one year ago containing.
                    # openssl x509 -in /etc/ssl/certs/CERN_Grid_Certification_Authority.pem  -noout -subject -dates      
                    subject= /DC=ch/DC=cern/CN=CERN Grid Certification Authority
                    notBefore=Apr 22 11:10:16 2013 GMT
                    notAfter=Apr 22 11:20:16 2023 GMT
                    
        AND ALSO its replacement:
                    # openssl x509 -in /etc/ssl/certs/CERN_Grid_Certification_Authority\(1\).pem  -noout -subject -dates 
                    subject= /DC=ch/DC=cern/CN=CERN Grid Certification Authority
                    notBefore=Mar 29 08:24:22 2022 GMT
                    notAfter=Mar 29 08:34:22 2032 GMT
                    
        this was meant to provide a transparent migration however:
        • Several services (e.g. internal batch web tool, WLCG IAM) were using an old static container image.
        • Several services (e.g. myproxy ) required a restart to pick up new cert.
        • Several services (e.g. hammercloud, roger, (CC7, python3 urllib3 at least) do not ignore the expired certificate and fail.
        • Several services (e.g SSO) had hardcoded configuration to a particular CA file.
        • Some combination of all the above.
    • Package CERN-CA-certs-20230421-1.el7 has now been released that no longer contains the expired certificate.
    • Please report continuing service problems and one of the above remedies should help.
    • Next expiry is at least on a Monday morning 2032.
    • Lots of actions will follow.
      • Mario: There is another certificate in the CERN bundle expiring on July 29th 2023.
      • Maarten: Will make a note to follow this up.
      • Douglas: Do we have instructions on how people can updated their browsers to use the new CERN CA certificate?
      • Maarten: The new certificate and the necessary instructions should be available on ca.cern.ch. It should be enough to just add the new certificate without removing the old one (at least for Firefox).
      • added after the meeting: please see https://ca.cern.ch/cafiles/certificates/list.aspx?ca=grid

  • CERN storage services:
  • CERN databases:
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
      • A downtime has been scheduled from 06:00 to 08:00 UTC
      • Test alarms should be submitted as usual
  • Monitoring:
    • Final SiteMon reports for March 2023 sent around
  • Middleware: NTR
  • Networks: NTR
  • Security:

AOB:

  • Mario: ATLAS has switched to the Tape REST API for CERN (CTA) and KIT (dCache) without any problems. We will continue with the rest of the sites.
  • Space tokens presently still require SRM in dCache, a replacement by Quotas is being discussed.

  • NOTE: the operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
  • Have a good May Day holiday !
Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2023-04-24 - JosepFlix
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback