WLCG Operations Coordination Minutes, May 6, 2021

Highlights

Agenda

https://indico.cern.ch/event/1035151/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra (ATLAS + Manchester + WLCG), Andreas (KIT), Andrew (TRIUMF), Borja (monitoring), Catalin (EGI), Christoph (CMS), Concezio (LHCb), Daniel (security), David Cameron (ATLAS + ARC), David Cohen (Technion), Eric (IN2P3), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Mihai (FTS), Nikolay (monitoring), Panos (WLCG), Pedro (monitoring), Renato (CBPF), Shawn (networks), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • on June 3-4 there will be the WLCG Token Hackathon for CEs

  • hence the next Ops Coordination meeting is planned for June 10

Special topics

SiteMon migration to CRIC topology, wrap up

see the presentations

Discussion

  • Julia:
    • this concludes a long process of several months during which
      a significant number of obstacles were found and fixed:
      thanks to all parties involved !

DC monitoring workshop summary

see the presentation

Discussion

  • Julia: is the info collected from sites static?
  • Alessandra: yes, e.g. the expected bandwidth
  • Julia: might we put such info into CRIC?
  • Alessandra:
    • it would imply development work
    • we need to allow comments and descriptions
  • Julia: we can look into it when the requirements have become clear

  • Alessandra: Xrootd info needs to be massaged and there must be no duplicate info
  • Julia:
    • ALICE info in the transfer dashboard comes from Xrootd clients
    • didn't we foresee changing to client-side monitoring for all experiments?
  • Alessandra: the Xrootd monitoring architecture is to be reviewed indeed

  • Julia:
    • the CRIC usage for network aspects is being discussed with Shawn and Marian
    • it concerns network topology and perfSONAR
    • there will be another meeting next week
  • Alessandra: I will join

  • Borja:
    • the Monit team would like to be more directly involved in these matters
    • that approach has worked well for the needs of CMS and ATLAS in the past
  • Alessandra:
    • we have a Mattermost channel for those who actually are driving activities
    • I will send the details

  • Julia: are there regular meetings?
  • Alessandra: only between the core people ATM
  • Julia: with regular Ops Coordination reports, more volunteers may be found
  • Alessandra:
    • we already have regular reports to DOMA and the GDB
    • discussion slots could be helpful instead

Middleware News

Discussion

  • Maarten: is ATLAS OK with the campaign to get ARC sites to upgrade to >= 6.10.2?
  • David Cameron:
    • yes, but note there will be a 6.12 release soon to adjust the APEL workflow
    • sites would need to upgrade to that before the old message brokers are stopped
    • it looks unlikely that all sites will have upgraded by the end of May
    • it would hence be good if those brokers could still be there in June
  • Catalin: we can look into that
  • Julia: there was no response yet from sites willing to help validate the new code
  • Maarten:
    • the ARC team could send another message e.g. to the wlcg-arc-ce-discuss list
    • we do not want to bother sites with multiple upgrades in quick succession
    • we thus wait for 6.12 to become available and validated

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Occasional drops in activity have occurred due to:
    • lack of MC productions;
    • Task Queue DB overloads.
  • More use of 8-core job slots is expected in the coming months.
    • Already in production or tested at a few sites.
    • Driven by new JAliEn CE services on the sites' VOboxes.

ATLAS

  • Mostly stable production with around 500-700k cores running, including substantial contribution (up to 250k powerful cores) from new Vega HPC in Slovenia
  • HTTP-TPC - ATLAS deadline 31 May for T1/2
    • 17 sites still on SRM/GridFTP, mainly:
      • 5 OSG sites waiting for Xrootd update
      • 6 Storm / 2 Echo sites
  • Several new campaigns coming up including new workflows (resimulation of existing HITS, new fast simulation, and MC overlay)
  • Oracle 19c update completed for ATLAS offline DB (CERN, TRIUMF, IN2P3) mostly smoothly. ADC DB (panda, rucio) to be scheduled
  • Bug in ARC CE interaction with HTCondor batch system version 8.9.9 and 9.0.0 - sites should not upgrade to this version of HTCondor until the bug is fixed in ARC
  • Started discussing and defining operational procedures for critical services for situations where SNOW response is too slow

Discussion

  • Julia: we will follow up on the support procedures for critical services

CMS

  • Christoph Wissing will be DOMA Coordinator from CMS side
  • running smoothly at around 340k cores
    • usual production/analysis split of 3:1
    • main production activity Run 2 ultra-legacy Monte Carlo
    • HPC allocations contributing about 30k cores
  • CMS FTS3 stuck/overload end of April
    • GSIFTP/GridFTP session limit reached at several sites
  • WebDAV storage endpoint setup and commissioning in progress
    • good progress, about 11 Tier-2 disk sites remaining
    • will start on Tier-3 disk sites next/soon
  • don't have replies yet from all sites regarding 2021 pledge availability, breakdown, and locally managed space quota

LHCb

  • Stable production activities at an average of 130k cores
    • Dominated by MC production
    • Run2 data reprocessing campaigns started
      • 2017 data staged from T0/T1 tapes, ongoing validations, then in production
      • 2018 and 2016 will follow
    • ongoing P8 --> EOS/CTA tests
  • long-standing tickets at RAL
    • GGUS:142350 open since July 2019 on vector reads
      • random failures of user (and some production) jobs
    • GGUS:150653 https on the RAL FTS instance fail with permission errors
    • GGUS:150898 on checksum issues

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • Many problems detected for T2 sites not properly reporting for APEL. Follow up through GGUS tickets.

Discussion

  • Renato:
    • we have not published any accounting data for a year now
    • we have a ticket open with the APEL team, but it moves very slowly
  • Julia:
    • please send the ticket number and we will follow up

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 76 done: 35 ARC, 38 HTCondor, 1 both, 2 none
  • 5 sites have or plan for ARC, 2 are considering it
  • 5 sites have or plan for HTCondor, 1 is considering it, 2 consider using SIMPLE
  • 1 site has switched to K8s with accounting pending

dCache upgrade TF

  • Will soon start campaign to upgrade for a new version which allows to provide proper SRR

DPM upgrade TF

  • 6 sites (out of 49) to go. Some of those which have upgraded are still being tested for TPC

StoRM upgrade TF

  • 14 sites upgraded, 2 StoRM storages decommissioned, 7 to accomplish. TPC testing for sites which upgraded is ongoing.

Information System Evolution TF

  • SiteMon application migrated to CRIC topology
  • First prototype of the network topology in CRIC is being evaluated by the network team
  • Progressing with SRR deployment. Starting with new campaign for dCache storage upgrade for proper SRR generation

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Network Throughput WG


  • perfSONAR infrastructure - 4.3.4 is the latest release (security and bugfix release, please update ASAP)
  • WLCG/OSG Network Monitoring Platform
    • Work is on-going to resolve the issues reported in collaboration with the perfSONAR developers - most of the issues already fixed, waiting for 4.4 release
  • DC Monitoring Workshop (https://indico.cern.ch/event/1027287/) follow up
    • Planning to run a campaign for all sites to update to 4.4. (should be released within the next 2-3 weeks) and fix issues with perfSONAR toolkits
    • Review existing dashboards and update to cover DC use cases (will look into dashboards combining FTS/perfSONAR and network traffic)
    • We plan to make network alerting service available before DC starts (it's currently in development with initial testing already on-going)
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Transition to Tokens and Globus Retirement WG

  • token transition milestone backlog has been reduced
  • further details in WLCGTokensGlobusWG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • Thomas:
    • Sylabs have just announced a community fork of Singularity
    • how might that affect our dependence on Singularity?
  • Maarten: that should surely be discussed in the WLCG Containers WG
  • Stephan:
    • Dave Dykstra is already involved, let's wait for his assessment
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2021-05-10 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback