WLCG Operations Coordination Minutes, April 1, 2021

Highlights

Agenda

https://indico.cern.ch/event/1023225/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra D (Napoli), Alessandra F (Manchester + ATLAS), Alessandro (EGI), Andrew (TRIUMF), Catalin (EGI), Concezio (LHCb), Dave M (FNAL), David B (IN2P3-CC), David Cohen (Technion), Eric (IN2P3), Federico (LHCb), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (EGI), Nikolay (monitoring), Panos (WLCG), Ron (NLT1), Stephan (CMS)
  • apologies:

Operations News

  • The next meeting is planned for May 6

  • Migration of APEL publisher from ActiveMQ
    • ARC and APEL phase-out of ActiveMQ
      • ARC team not able to join the meeting today
      • We need +1 month to test changes properly compared to timeline proposed by APEL
      • We need help in recruiting WLCG production sites to perform testing - can you help with that?
      • We assume the service is up and running for testing?

Discussion

  • Alessandro:
    • the ActiveMQ infrastructure is run by GRNET
    • because of maintenance concerns, it should be phased out ASAP
    • postponing that until the end of May looks doable, though

Special topics

Monitoring workshop introduction

see the presentation

Discussion

  • Alberto:
    • we will participate, show what is available and see what needs to be added

  • Julia:
    • we will have various matters to consider:
      • can the sites' own monitoring be exposed?
      • can we correlate info for the same transfer coming from various sources?
      • does the Xrootd monitoring provide what we need?
      • etc.

  • Alessandra F:
    • by the end of Sep we should have enough for the first step
    • we could hold the workshop around the end of April
    • it could look like a pre-GDB and is a kick-off event
    • we will also need to monitor tape staging rates

  • Dave M: is the scope just about September for now?

  • Maarten:
    • we can discuss both long- and short-term goals
    • by September we already like to have more available than what exists now

  • Alessandra F:
    • there is a lot of information already available today
    • we need to extract it from experiment dashboards and put it together
    • and see if sites can export their data

Middleware News

  • Useful Links
  • Baselines/News
    • HTCondor 8.8.13 was released on March 23
      • It contains the ARC client fix needed for ARC CE >= 6.10.1
      • CMS pilot factories are being upgraded
      • ATLAS pilot factories had already been upgraded with a pre-release
      • SAM ETF has been upgraded for CMS
      • SAM ETF for ATLAS has been upgraded today
      • ALICE, LHCb and other DIRAC VOs are not affected
      • when all upgrades have been done and things look OK for some days,
        an ARC CE upgrade campaign will be launched in collaboration with EGI

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal to high activity on average.
  • No major issues.
  • Preparing for Run-3 multicore MC jobs at a number of sites.

ATLAS

  • Apologies, probably no one from ATLAS can attend the meeting today
  • Mostly stable production with around 500k cores running, including 100k from HLT farm and 30k from BOINC
  • Several new campaigns coming up including new workflows (re-simulation of existing HITS and MC overlay)
  • Some ARC sites will require availability corrections for the whole of March due wait for ETF to update to newest HTCondor release
  • HTTP-TPC status: 56 out of 94 sites done (including T3 and special resources)
    • Progress blocked at some sites waiting for dCache restart to pick up new CAs
  • CREAM status:
    • No more ATLAS jobs to CREAM CEs from today (from ATLAS or ETF)
    • SARA on track with ARC CE
    • Still 4 other sites (2 T2, 2 T3) with no alternative ready, but negligible loss of resources

CMS

  • running smoothly at around 310k cores
    • usual production/analysis split of 3:1
    • main production activity Run 2 ultra-legacy Monte Carlo
    • HPC allocations contributing about 15k cores
  • new HTCondor release with ARC-CE v6.10 hash DN fix deployed in CERN, UCSD factory and ETF (Fermilab factory to be upgraded)
  • need to switch DN used for Rucio data transfers; sites using gridmap file need to add new DN by next week! (VOMS extension authorization preferred by CMS)
  • made another push to switch remaining single-core CE entries to multi/eight-core entries; last Tier-2 sites switched, now only Tier-3 sites with single-core entries
  • WebDAV storage endpoint setup and commissioning in progress
    • target date goal is May 1st
    • good progress, about 75% of Tier-1,2s ready
  • switched ARC-CE with working A-REX interface to use it for SAM tests until new HTCondor/ARC-CE interface is ready; Thanks Marian!
  • contacted sites regarding 2021 pledge availability, breakdown, and locally managed space quota
  • other VOs seeing xrootd auth failures after daylight-savings transition: first hour of certificate validity not usable until service restart

LHCb

  • running smoothly at 120-140k cores : increase of ~10% because of:
    • better utilization of existing resources
    • running over-pledge on few sites
    • dominated (90%) by MC production with Run1 and Run2 conditions
    • HPC: a few hundred cores at SDumont (Brazil). Starting to exploit MareNostrum but it's not easy
  • Castor to CTA migration
    • "done". But several issues arose when going into production. Some tickets open, led to fixes in middleware.
    • main issue is still in using RAL. GGUS:151081
  • Decommissioning of CREAM-CEs almost completely done.
  • On payload isolation: using Singularity on several WLCG CEs. No issues found ATM.

Task Forces and Working Groups

GDPR and WLCG services

  • Updated list of services
  • Some progress has been achieved in approval of the CERN ROPOs. We need to review the list of the WLCG services which need to enable privacy notice. Quite possible that for some of them ROPOs have been approved.

Accounting TF

  • Accounting task force meeting discussed the plan for integration of the new benchmark into the accounting workflow. Will be presented at the April GDB

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 70 done: 34 ARC, 33 HTCondor, 1 both, 2 none
  • 6 sites have or plan for ARC, 2 are considering it
  • 10 sites have or plan for HTCondor, 1 is considering it, 2 consider using SIMPLE
  • 1 site has switched to K8s with accounting pending
  • 0 tickets without reply

dCache upgrade TF

  • Almost done. One site to go.

DPM upgrade TF

  • 80% of sites upgraded.

StoRM upgrade TF

  • 70% of sites upgraded

Information System Evolution TF

  • ATLAS migrated to atlas-cric. AGIS will be retire in the coming months

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Julia:
    • the transition from SiteMon to CRIC had to be further postponed
    • several inconsistencies were found and not all could be resolved yet
    • the CRIC team will keep working on those matters with high priority
    • the expectation is for the transition to be possible in a few weeks

Network Throughput WG


  • perfSONAR infrastructure - 4.3.4 is the latest release (security and bugfix release, please update ASAP)
  • WLCG/OSG Network Monitoring Platform
    • Work is on-going to resolve the issues reported in collaboration with the perfSONAR developers
  • EU perfSONAR User Workshop will take place 14-15th of April (https://events.geant.org/event/388/overview)
  • EU project ARCHIVER reported initial feedback to the contractors based on the perfSONARs cloud testing (DESY, PIC and CERN participated in this activity, EBI will join)
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
    • AGLT2 inbound - Potential network issue was confirmed for all inbound traffic to AGLT2 (due to latency EU is more impacted than US)
    • CNAF -> AGLT2 - Reported by ATLAS, under investigation, confirmed to be separate from the issue above
    • SAMPA/RNP <- GEANT - Under investigation, performance degraded, RNP following up with GEANT ops
    • CERN -> Australia-ATLAS - LHCONE routing issues - already resolved

Traceability WG

Transition to Tokens and Globus Retirement WG

  • token transition milestone document has a large backlog to be processed
  • further details in WLCGTokensGlobusWG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2021-04-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback