WLCG Operations Coordination Minutes, Dec 2, 2021
Highlights
Agenda
https://indico.cern.ch/event/1101195/
Attendance
- local:
- remote: Alastair (RAL), Alberto (monitoring), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester + WLCG), Alexey (CRIC), Andrew (TRIUMF), Borja (monitoring), Brian (RAL), Christoph (CMS), Darren (RAL), Dave M (FNAL), David Cameron (ATLAS + ARC), David Cohen (Technion), Edoardo (networks), Eric (IN2P3), Giuseppe (CMS), Henryk (LHCb + NCBJ-CIS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Masahiko (Tokyo), Matt D (Lancaster), Nikolay (monitoring), Panos (CRIC), Pedro (monitoring), Riccardo (WLCG), Rizart (WLCG), Shawn (networks + MWT2), Stephan (CMS), Thomas (DESY)
- apologies:
Operations News
- the next meeting is planned for Jan 27
Special topics
Network topology in CRIC. Information per site to be provided.
see the
presentation
Discussion
- Stephan:
- the plans look geared towards standard grid sites,
whereas we also have HPC, cloud and opportunistic resources
- on the one hand the requested configuration details seem complex,
on the other hand they may be too simplistic for all use cases
- for example, let's assume MWT2 uses Google resources that were
already used by another site: how might that be handled?
- Shawn:
- the information in CRIC is meant for static sites
- cloud resources are often hidden behind what sites expose
- support of dynamic sites is challenging, it could come in version 2
- Stephan:
- we already declare temporary RC sites in CRIC
- it would be good to allow those network details to be exported in JSON format
- the association of specific networks to specific experiments could
be incompatible with opportunistic use of computing resources
- Shawn:
- such assocations are not meant to limit the use of resources
- they allow resources to be matched with their main customers
- Stephan:
- can the first match be made to have the highest priority?
- we want to avoid that many subnets might need to be listed
- Shawn:
- we can consider that indeed
- such matters are exactly what we wanted to discuss
- Julia:
- there will be a convenient API for the use cases we want to support
- Alexey:
- JSON exports of those details are already possible now
- opportunistic resources for ATLAS are always declared in CRIC
- Shawn:
- static sites are OK, dynamic resources may be tricky
- Alastair:
- the requirements on page 2 look fine, but then the presentation
shifts more and more toward "give us all your information"
- CRIC will never be the source of truth
- it will at most have a copy of a site's own configuration details
- the focus should be on the LHCONE use cases
- Shawn:
- the CRIC network information is about well-defined sites
- it will help us with several important use cases
- it also presents us with opportunities to identify issues with
the quality of the provided data
- Alastair:
- sites not on LHCONE are less likely to be dedicated to WLCG and
it may be more difficult to get such information from them
- Julia:
- we can at least implement consistency and sanity checks
- Maarten:
- the proposed changes need to be implemented and realistic example
configurations should be provided before the campaign gets launched
- Julia:
- we will fix ambiguities and do tests with a few sites, there is no rush
- a further presentation will be given in the Dec 8 GDB
WLCG Monitoring Task Force
see the
presentation
Discussion
- Alessandra F:
- the presence of the experiments in the TF is fundamental
- Julia:
- we may also work with some experiments just on specific topics
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual
- Thanks to all sites and best wishes for 2022
ATLAS
- Running 600-700k cores with up to 300k from opportunistic EuroHPC
- Mainly Run 2 reprocessing and multi-core event generation
- SAM/ETF tests and Harvester can now submit submit with tokens to OSG 3.6 CEs
- No issues seen so far with Frontier test (redirecting all traffic to CERN since 15 Nov)
- Artificial stress now being added to find the limits
- SRM+HTTPS: fts3-atlas upgraded on Tuesday and gfal configured to allow the use of https in srm <-> srm transfers. All smooth. Today rucio configured to allow https also in mixed protocol transfers https <-> srm.
- Happy end of year holidays to all!
CMS
- running smoothly at around 320k cores
- usual production/analysis split of 3:1
- HPC allocations contributing between 5k and 60k cores
- main production activity Run 2 ultra-legacy Monte Carlo
- processing of parked B data progressing well
- successful operation of the detector/DAQ during pilot beam test and CRUZET
- Identity and Access Management, IAM, server moved into production
- WebDAV SAM test made mandatory, commissioning at a few sites ongoing
- slow local/remote data access investigation continues at RAL
- coordinating xrootd v5 upgrade with sites
LHCb
- Smooth running at 140k cores
- Due to issues, reprocessing (stripping) of 2016 is waiting for the second request for validation
- Best wishes for 2022
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
- Meeting
to discuss integration of the new benchmark in the accounting workflow has been held on the 25th of November
Archival Storage WG
Containers WG
CREAM migration TF
This TF has been closed as of Dec 1.
Details
here
Final summary:
- 90 tickets
- 84 done: 39 ARC, 40 HTCondor, 1 both, 1 K8s, 3 none
- 6 unsolved
dCache upgrade TF
- In the beginning of the next year will launch campaign to enable SRR by the dCache sites
Information System Evolution TF
- Progress in enabling network topology in CRIC. See presentation
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
Network Throughput WG
- perfSONAR infrastructure - 4.4.1 is the latest release (please update, we also recommend rebooting all nodes after update)
- WLCG/OSG Network Monitoring Platform
- Work is on-going to resolve issues reported to the perfSONAR ream
- Issue causing perfSONARs to hit resource limits (number of threads) will be fixed in the next release
- Meeting with CRIC team took place last week to discuss use cases for the perfSONAR topology in CRIC
- Recent and upcoming WG updates:
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Traceability WG
Transition to Tokens and Globus Retirement WG
- Progressing via Authorization WG meetings
- Example: design of token exchange workflows involving FTS and Rucio
Discussion
- Stephan:
- what is the status of token support in ARC?
- we may need to be concerned about discontinuation of
X509 support in the US pilot factories
- David Cameron:
- the latest version of ARC supports tokens for jobs that
do not require the CE to do any data handling
- the majority of sites can thus take advantage already
- Maarten:
- we foresee CE upgrade campaigns early next year,
but they may take many months, as usual
- it would be good for experiments to allow X509 still to
be used for ARC CEs for the time being
- and for HTCondor CEs in EGI
- HTCondor-G (sic) should be fine with that
- as of ~Feb, HTCondor CEs in OSG will only support tokens
- the last HTCondor CE version featuring X509 job submission
reaches its EOL in the autumn of next year
- EGI sites should have upgraded by that time
- Julia:
- do all experiments know what they need to do for tokens?
- Maarten:
- they are all represented in the Authorization WG
- ATLAS and CMS already prepared for the changes in OSG
- for ALICE and LHCb these matters were less urgent
- DIRAC developers started implementing token support months ago
- for ALICE it is on the JAliEn roadmap
Action list
Specific actions for experiments
Specific actions for sites
AOB
- THANKS for your help in making 2021 another successful year for WLCG !
- Further challenges and opportunities await us in 2022...