WLCG Operations Coordination Minutes, March 4, 2021
Highlights
Agenda
https://indico.cern.ch/event/1012467/
Attendance
- local:
- remote: Alessandro P (EGI), Andrea C (CNAF), Andrea R (CNAF), Andreas H (DESY-ZN), Andrew (TRIUMF), Andrii Lytovchenko (DIRAC), Carmelo (CNAF), Christoph (CMS), Cristian (EOS), Daniel (security), Daniele (CNAF), Dave D (FNAL), David B (IN2P3-CC), David Cameron (ARC + ATLAS), David Cohen (Weizmann), David S (ATLAS), Eisaku Sakane (NII), Enrico (CNAF), Eric (IN2P3), Fabrizio (DPM), Federica Agostini, Federico S (LHCb), Federico V (CNAF), Francesco (CNAF), Frederique (LAPP), Giuseppe (CMS), Hannah (CERN AAI), Jeny (OSG security), Lucia (CNAF), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Mihai (FTS), Nikita (CMS), Panos (WLCG), Petr (ATLAS), Philippe (ATLAS), Stephan (CMS), Steven (FNAL), Stu (FNAL), Tanya (FNAL), Valeria (EGI), Vanessa (IN2P3-CC), Vincenzo (CNAF), Xinli Liu
- apologies:
Operations News
- the next meeting is planned for April 1 (sic)
Special topics
ARC CE release update and plans
see the
presentation
Discussion
- Maarten:
- some LHC experiments use personal instead of robot certificates for some workflows
- in those cases the DNs are not associated with personal activities, though
- Federico:
- we also use personal certificates for some cases
- is the REST interface stable?
- David Cameron:
- the interface specification is stable
- it may take some months before we can declare REST ready for production
- if you use ARC clients, you are shielded from such details
- let us know if you have questions about the interface
- Maarten:
- hopefully in the next 2 weeks we can upgrade the remaining HTCondor-G submitters
- ATLAS already did their pilot factories
- CMS and SAM ETF are waiting for the official HTCondor release
- ALICE and LHCb are not affected
- when the submitters are OK, we will do a campaign to get all ARC CEs updated
- this matter may essentially be solved in the next month or two
- Stephan:
- does the REST interface still need a lot of work?
- we would need a second campaign to have sites upgrade ARC CEs for that
- could we wait and have a single campaign instead?
- David Cameron:
- we may still need a few months for the REST interface
- the privacy issues need to solved much sooner
- Maarten:
- also mind the new interface might have performance or reliability issues
- debugging those could take yet more time
- Stephan:
- we found LDAP updates lagging a lot when ARC CEs are very busy
- would that be avoided with the REST interface?
- David Cameron:
- LDAP performance can be affected by various things
- please get in touch and we can help debug such matters
WLCG transition to tokens and Globus retirement
see the
presentation
NOTE - the presentation has been
updated on March 5 with these changes:
- page 12, 1st item: the reference to SRM has been removed
- in DOMA a plan has been agreed to phase out SRM later
- page 13, 4th item: it now refers to Dave Dykstra's token client
Discussion
Page 3:
- Steven:
- have you considered how US VOs like DUNE may be affected by such changes?
- for example, EOS needs lists of users from VOMS-Admin
- US VOs use SciTokens
- Maarten:
- those are good points
- we had already identified EOS as a case for which a solution is needed
- we also need to take SciTokens into consideration
- Andrea C:
- we could add an interface to IAM for making
grid-mapfiles
if needed
- we have the SCIM API for the equivalent for tokens
- Maarten:
- we could add a legacy interface, but it might only be useful for a few years
- the legacy interface is for X509, which we will move away from
- hopefully we can avoid having to put development effort there
- we need a similar interface for tokens
- Stephan:
- can we have a graceful transition from VOMS-Admin to IAM?
- can they live side by side for some months?
- when IAM becomes the master, maybe we can keep a stale VOMS-Admin for a while?
- Maarten:
- we intend to populate IAM initially from VOMS-Admin
- this allows relevant parties to try IAM functionalities in parallel
- in other respects the infrastructure will keep working as before
- when things look sufficiently OK for a VO, we switch to IAM for that VO
- we do not foresee syncing IAM back into VOMS-Admin
- we should have solutions for the relevant VOMS-Admin use cases beforehand
Page 7:
- Andrea C:
- the VOMS library could be used for dealing with X509 proxies
- it does not depend on Globus
- Maarten:
- that is a nice option for SW that needs to support X509 longer
Page 12:
- Andrea C:
- IAM already has MyProxy-like functionality that can be developed further
- Maarten:
- to be followed up in the AuthZ WG
- David Cameron:
- could MyProxy be developed further for X509 and have Globus GSI replaced?
- Maarten:
- if needed, that can be considered indeed
Page 13:
- Dave D:
- I developed a token client, not a service
- it consist of htgettoken
and htvault-config
- it already has been integrated with HTCondor workflows
- Maarten:
- I will correct that in the presentation (done)
- I think we would like to have equivalents for certain MyProxy functionalities
- Dave's token client might be at the core of that
- to be followed up in the AuthZ WG
- Andrea C:
- IAM already supports handling SSH keys
- it could be useful for a GSI-OpenSSH equivalent
- Maarten:
- in any case GSI-OpenSSH can wait while we have bigger matters to deal with
Page 14:
- Dave D:
- HTCondor said at OSG All Hands Meeting they will support GSI as long as GSI library remains supported
- David Cameron:
- when OSG CEs no longer support X509, will that affect what jobs can do?
- Maarten:
- payloads can still be equipped with X509 proxies
- some VOs already give payloads their own proxies
- each affected VO will need to look into how to run on OSG resources
- Stephan:
- OSG sites would need to upgrade a few months before Feb 2022, by Oct-Nov
- that would give us only about half a year to sort out the remaining issues
- Maarten:
- the timeline is not cast in concrete
- it was a reasonable proposal that can still be moved by some months
- OSG certainly do not want to disrupt their flagship VOs CMS and ATLAS
- Steven:
- other VOs should not be disrupted either!
WG for Transition to Tokens and Globus Retirement
- please find further details here
Middleware News
- UMD repository
currently is down because of a security incident
- The intention is to have it restored by Friday EOB
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High to very high activity levels on average.
- New record reached: 183k concurrently running jobs.
- No major issues.
ATLAS
- Mostly stable production with 450-500k cores running
- DBoD issue on 28 Feb:
- 6pm harvester job submission system stopped working due to OTG:0062530
- 11pm ALARM ticket GGUS:150774
was submitted, thanks to DBoD experts the problem was solved
- After comments in Monday’s WLCG ops meeting ATLAS procedures were reviewed and found to be adequate
- HTCondor-ARC issue: harvester HTCondor machines updated with patched binary on 2 March
- Pushing ahead with plan to move all possible sites to HTTP-TPC by end of May
- In parallel we encourage sites to move away from GridFTP (to HTTP or Xrootd) for local access
- CREAM decommissioning:
- 5 sites still rely on CREAM CE:
- We propose stopping ATLAS job submission to all CREAM CEs at the end of March
CMS
- running smoothly at around 330k cores
- usual production/analysis split of 3:1
- main processing activities:
- Run 2 ultra-legacy Monte Carlo
- Run 2 pre-UL Monte Carlo
- Run 2 re-miniAOD
- HPC allocations contributing about 30k cores
- site run very well last month; Thanks to CNAF/CINECA, KIT, and RAL for contributing well above/double pledge!
- enough processing/analysis work remaining in the queue
- most CMSWeb services migrated into Kubernetes instance; a few services remaining on the to-do list
- ARC-CE v6.10 instances at CMS sites are all locally patched; Thanks to colleagues at JINR to identify the hash DN issue and patch and the HTCondor team to address this quickly; CMS plans to switch quickly after it's released next week.
- WebDAV storage endpoint setup and commissioning in progress
- target date goal is May 1st
- good head start due to ongoing volunteer campaign
- a presentation on the CERN e-group replacement plan in one of the next WLCG Ops Coordination meetings would be appreciated
- hoping for a agreement on a long-lifetime CentOS replacement; also very important impact to online/data recording/detector operations
LHCb
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Containers WG
CREAM migration TF
Details
here
Summary:
- 90 tickets
- 63 done: 31 ARC, 30 HTCondor, 2 none
- 8 sites have or plan for ARC, 4 are considering it
- 13 sites have or plan for HTCondor, 3 are considering it, 3 consider using SIMPLE
- 1 ticket without reply
dCache upgrade TF
DPM upgrade TF
StoRM upgrade TF
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
Network Throughput WG
- perfSONAR infrastructure - 4.3.3 is the latest release
- WLCG/OSG Network Monitoring Platform
- Agreed that CRIC will store the aggregated perfSONAR topology (GOCDB/OSG/NREN/etc.)
- Work on publishing directly from perfSONAR toolkits - on-going
- An issue was identified with central configuration (psconfig/PWA), which is being investigated in collaboration with perfSONAR developers (psconfig degraded for now)
- psconfig/PWA lead developer has left, we're waiting for his replacement to follow it up
- Another potential bug has been identified, this time directly in the toolkit, which causes workload issues and impacts testing - detailed bug report was sent to the developers
- EU project ARCHIVER has started to use perfSONARs to test cloud connectivity (DESY, PIC and CERN are participating in this activity)
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- AGLT2 inbound - potential network issue was confirmed for all inbound traffic to AGLT2 (due to latency EU is more impacted than US)
- CNAF -> AGLT2 - reported by ATLAS, under investigation, right now this looks to be separate from the issue above
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB