WLCG Operations Coordination Minutes, July 7th, 2022

Highlights

Agenda

https://indico.cern.ch/event/1179217/

Attendance

  • local:
  • remote: Andrew (TRIUMF), Borja (monitoring), David Cameron (ATLAS + ARC), David M (FNAL), Doug (BNL), Gianfranco (Bern), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Mark (LHCb + Birmingham), Matt (Lancaster), Ofer (BNL), Panos (WLCG), Pepe (PIC), Stephan (CMS), Tigran (dCache)
  • apologies:

Operations News

  • the next meeting is planned for Sep 1

Special topics

SRR for federated storage

see the presentation

Discussion

  • Gianfranco:
    • LGTM, but what would be the timeline?

  • Tigran:
    • the technical features are already there
    • it is up to sites to configure such artificial extra shares

  • Julia:
    • the WSSA development is expected to be done in a matter of weeks
    • then we will need to validate the code with the SRR for NDGF

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • Run 3 started OK

ATLAS

  • Successful start to Run-3!
  • Smooth running with 500-700k cores including 200k from HPC
  • Legacy VOMS servers incident last week (OTG:0071740)
    • No impact on central services since IAM VOMS was working but unfortunately coincided with an ATLAS tutorial and caused problems for registering new users

Discussion

  • Maarten:
    • a big issue was discovered by accident when we tried to
      understand why one user account kept getting disabled in IAM
    • VOMS-Admin was hammering the VOMS DB with failing connections
    • as a protective measure, the DB then banned the VOMS hosts
    • we needed to keep VOMS-Admin switched off to prevent the
      more critical VOMS services from being impacted
    • a quick fix finally became available by Tue late evening
    • it got implemented and deployed on Wed morning
    • a proper fix was implemented in the days that followed

CMS

  • June was rather quiet, few issues at CMS
    • accidental SAM and HammerCloud dataset deletion at sites, restored quickly
  • running smoothly with 375k cores
    • usual production/analysis split of 75% and 25%
    • significant contribution from HPCs 20k to 70k
    • main production activity Run 2 ultra-legacy Monte Carlo
  • impact of Russian invasion/sanctions significant for CMS Tier-1
    • tape data relocation ongoing
  • deletion of unused datasets and tape space recovery
  • waiting on python3 version/port of HammerCloud
  • EOS Erasure Coding issues, at CERN and Vienna, not yet resolved
  • WebDAV commissioning effectively complete
    • all Tier-0,1,2 sites ready/in production
    • most Tier-3: ready/in production; expect one more sites to be ready very soon

LHCb

  • Generally smooth running
  • No significant issues with respect to Run 3 start
  • Discussions (and some progress) ongoing with long standing Tier 1 tickets/issues
  • Running 100+K jobs at present

  • Question: Is there/what is the best way for sites to indicate reboot campaigns of workers nodes in GOCDB?

Discussion

  • Maarten:
    • sites should at least inform their customers when
      it is known there will be a significant impact
      • e.g. many jobs getting killed
    • the GOCDB is the official place to indicate any outages
    • an at-risk / warning outage can be declared when the
      impact is expected to be minor
      • e.g. temporary capacity reduction due to draining + reboot

Task Forces and Working Groups

GDPR and WLCG services

Discussion

  • Julia:
    • customization would typically concern the data retention period

Accounting TF

  • NTR

dCache upgrade TF

  • Majority of sites did enabled SRR on the dCache frontend, some tickets are pending waiting for content validation.

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • New flow for XRootD transfer monitoring has been enabled in few sites and data produced is being validated/corrected as needed
  • Site network monitoring script has been deployed at AGLT2, integration in MONIT has been requested

Network Throughput WG


WG for Transition to Tokens and Globus Retirement

Discussion

  • Maarten:
    • there will be an AuthZ WG update in the GDB next week

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2022-07-05
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2022-07-11 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback