WLCG Operations Coordination Minutes, March 4, 2021

Highlights

Agenda

https://indico.cern.ch/event/1012467/

Attendance

  • local:
  • remote: Alessandro P (EGI), Andrea C (CNAF), Andrea R (CNAF), Andreas H (DESY-ZN), Andrew (TRIUMF), Andrii Lytovchenko (DIRAC), Carmelo (CNAF), Christoph (CMS), Cristian (EOS), Daniel (security), Daniele (CNAF), Dave D (FNAL), David B (IN2P3-CC), David Cameron (ARC + ATLAS), David Cohen (Weizmann), David S (ATLAS), Eisaku Sakane (NII), Enrico (CNAF), Eric (IN2P3), Fabrizio (DPM), Federica Agostini, Federico S (LHCb), Federico V (CNAF), Francesco (CNAF), Frederique (LAPP), Giuseppe (CMS), Hannah (CERN AAI), Jeny (OSG security), Lucia (CNAF), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Mihai (FTS), Nikita (CMS), Panos (WLCG), Petr (ATLAS), Philippe (ATLAS), Stephan (CMS), Steven (FNAL), Stu (FNAL), Tanya (FNAL), Valeria (EGI), Vanessa (IN2P3-CC), Vincenzo (CNAF), Xinli Liu
  • apologies:

Operations News

  • the next meeting is planned for April 1 (sic)

Special topics

ARC CE release update and plans

see the presentation

Discussion

  • Maarten:
    • some LHC experiments use personal instead of robot certificates for some workflows
    • in those cases the DNs are not associated with personal activities, though

  • Federico:
    • we also use personal certificates for some cases
    • is the REST interface stable?

  • David Cameron:
    • the interface specification is stable
    • it may take some months before we can declare REST ready for production
    • if you use ARC clients, you are shielded from such details
    • let us know if you have questions about the interface

  • Maarten:
    • hopefully in the next 2 weeks we can upgrade the remaining HTCondor-G submitters
    • ATLAS already did their pilot factories
    • CMS and SAM ETF are waiting for the official HTCondor release
    • ALICE and LHCb are not affected
    • when the submitters are OK, we will do a campaign to get all ARC CEs updated
    • this matter may essentially be solved in the next month or two

  • Stephan:
    • does the REST interface still need a lot of work?
    • we would need a second campaign to have sites upgrade ARC CEs for that
    • could we wait and have a single campaign instead?

  • David Cameron:
    • we may still need a few months for the REST interface
    • the privacy issues need to solved much sooner

  • Maarten:
    • also mind the new interface might have performance or reliability issues
    • debugging those could take yet more time

  • Stephan:
    • we found LDAP updates lagging a lot when ARC CEs are very busy
    • would that be avoided with the REST interface?

  • David Cameron:
    • LDAP performance can be affected by various things
    • please get in touch and we can help debug such matters

WLCG transition to tokens and Globus retirement

see the presentation

NOTE - the presentation has been updated on March 5 with these changes:

  • page 12, 1st item: the reference to SRM has been removed
    • in DOMA a plan has been agreed to phase out SRM later
  • page 13, 4th item: it now refers to Dave Dykstra's token client

Discussion

Page 3:

  • Steven:
    • have you considered how US VOs like DUNE may be affected by such changes?
    • for example, EOS needs lists of users from VOMS-Admin
    • US VOs use SciTokens

  • Maarten:
    • those are good points
    • we had already identified EOS as a case for which a solution is needed
    • we also need to take SciTokens into consideration

  • Andrea C:
    • we could add an interface to IAM for making grid-mapfiles if needed
    • we have the SCIM API for the equivalent for tokens

  • Maarten:
    • we could add a legacy interface, but it might only be useful for a few years
    • the legacy interface is for X509, which we will move away from
    • hopefully we can avoid having to put development effort there
    • we need a similar interface for tokens

  • Stephan:
    • can we have a graceful transition from VOMS-Admin to IAM?
    • can they live side by side for some months?
    • when IAM becomes the master, maybe we can keep a stale VOMS-Admin for a while?

  • Maarten:
    • we intend to populate IAM initially from VOMS-Admin
    • this allows relevant parties to try IAM functionalities in parallel
    • in other respects the infrastructure will keep working as before
    • when things look sufficiently OK for a VO, we switch to IAM for that VO
    • we do not foresee syncing IAM back into VOMS-Admin
    • we should have solutions for the relevant VOMS-Admin use cases beforehand

Page 7:

  • Andrea C:
    • the VOMS library could be used for dealing with X509 proxies
    • it does not depend on Globus

  • Maarten:
    • that is a nice option for SW that needs to support X509 longer

Page 12:

  • Andrea C:
    • IAM already has MyProxy-like functionality that can be developed further

  • Maarten:
    • to be followed up in the AuthZ WG

  • David Cameron:
    • could MyProxy be developed further for X509 and have Globus GSI replaced?

  • Maarten:
    • if needed, that can be considered indeed

Page 13:

  • Dave D:
    • I developed a token client, not a service
    • it consist of htgettoken and htvault-config
    • it already has been integrated with HTCondor workflows

  • Maarten:
    • I will correct that in the presentation (done)
    • I think we would like to have equivalents for certain MyProxy functionalities
    • Dave's token client might be at the core of that
    • to be followed up in the AuthZ WG

  • Andrea C:
    • IAM already supports handling SSH keys
    • it could be useful for a GSI-OpenSSH equivalent

  • Maarten:
    • in any case GSI-OpenSSH can wait while we have bigger matters to deal with

Page 14:

  • Dave D:
    • HTCondor said at OSG All Hands Meeting they will support GSI as long as GSI library remains supported

  • David Cameron:
    • when OSG CEs no longer support X509, will that affect what jobs can do?

  • Maarten:
    • payloads can still be equipped with X509 proxies
    • some VOs already give payloads their own proxies
    • each affected VO will need to look into how to run on OSG resources

  • Stephan:
    • OSG sites would need to upgrade a few months before Feb 2022, by Oct-Nov
    • that would give us only about half a year to sort out the remaining issues

  • Maarten:
    • the timeline is not cast in concrete
    • it was a reasonable proposal that can still be moved by some months
    • OSG certainly do not want to disrupt their flagship VOs CMS and ATLAS

  • Steven:
    • other VOs should not be disrupted either!

WG for Transition to Tokens and Globus Retirement

  • please find further details here

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High to very high activity levels on average.
    • New record reached: 183k concurrently running jobs.
  • No major issues.

ATLAS

  • Mostly stable production with 450-500k cores running
  • DBoD issue on 28 Feb:
    • 6pm harvester job submission system stopped working due to OTG:0062530
    • 11pm ALARM ticket GGUS:150774 was submitted, thanks to DBoD experts the problem was solved
    • After comments in Monday’s WLCG ops meeting ATLAS procedures were reviewed and found to be adequate
  • HTCondor-ARC issue: harvester HTCondor machines updated with patched binary on 2 March
  • Pushing ahead with plan to move all possible sites to HTTP-TPC by end of May
    • In parallel we encourage sites to move away from GridFTP (to HTTP or Xrootd) for local access
  • CREAM decommissioning:

CMS

  • running smoothly at around 330k cores
    • usual production/analysis split of 3:1
    • main processing activities:
      • Run 2 ultra-legacy Monte Carlo
      • Run 2 pre-UL Monte Carlo
      • Run 2 re-miniAOD
    • HPC allocations contributing about 30k cores
    • site run very well last month; Thanks to CNAF/CINECA, KIT, and RAL for contributing well above/double pledge!
    • enough processing/analysis work remaining in the queue
  • most CMSWeb services migrated into Kubernetes instance; a few services remaining on the to-do list
  • ARC-CE v6.10 instances at CMS sites are all locally patched; Thanks to colleagues at JINR to identify the hash DN issue and patch and the HTCondor team to address this quickly; CMS plans to switch quickly after it's released next week.
  • WebDAV storage endpoint setup and commissioning in progress
    • target date goal is May 1st
    • good head start due to ongoing volunteer campaign
  • a presentation on the CERN e-group replacement plan in one of the next WLCG Ops Coordination meetings would be appreciated
  • hoping for a agreement on a long-lifetime CentOS replacement; also very important impact to online/data recording/detector operations

LHCb

  • Federico S: NTR

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 63 done: 31 ARC, 30 HTCondor, 2 none
  • 8 sites have or plan for ARC, 4 are considering it
  • 13 sites have or plan for HTCondor, 3 are considering it, 3 consider using SIMPLE
  • 1 ticket without reply

dCache upgrade TF

DPM upgrade TF

StoRM upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Network Throughput WG


  • perfSONAR infrastructure - 4.3.3 is the latest release
  • WLCG/OSG Network Monitoring Platform
    • Agreed that CRIC will store the aggregated perfSONAR topology (GOCDB/OSG/NREN/etc.)
    • Work on publishing directly from perfSONAR toolkits - on-going
      • An issue was identified with central configuration (psconfig/PWA), which is being investigated in collaboration with perfSONAR developers (psconfig degraded for now)
      • psconfig/PWA lead developer has left, we're waiting for his replacement to follow it up
    • Another potential bug has been identified, this time directly in the toolkit, which causes workload issues and impacts testing - detailed bug report was sent to the developers
  • EU project ARCHIVER has started to use perfSONARs to test cloud connectivity (DESY, PIC and CERN are participating in this activity)
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
    • AGLT2 inbound - potential network issue was confirmed for all inbound traffic to AGLT2 (due to latency EU is more impacted than US)
    • CNAF -> AGLT2 - reported by ATLAS, under investigation, right now this looks to be separate from the issue above

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2021-03-08 - DaveDykstra
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback