WLCG Operations Coordination Minutes, March 2, 2023
Highlights
Agenda
https://indico.cern.ch/event/1259089/
Attendance
- local:
- remote: David B (IN2P3-CC), David Cameron (ATLAS), Stephan (CMS), Borja (WLCG Monitoring), David Cohen (IL T2 sites), Andrea (TRIUMF), David Mason (FNAL), Eric (IN2P3), Panos (WLCG), Adrian Coveney (APEL), Federico Stagni (LHCb), Domenico Giordano (WLCG benchmarking), Gonzalo (WLCG benchmarking), Giuseppe Bagliesi (CMS), Matthew Steve Dodge (EGI), David South (ATLAS), Marian Babic (WLCG networks), Stefano Dal Pra (CNAF), Christoph Wissing (DOMA, CMS), Alessandra Forti (ATLAS, UK T2s), Max Fischer (KIT), Thomas Hartmann (DESY), Natalia (WLCG), Julia (WLCG)
- apologies:
Operations News
Special topics
Readiness for switching to a new benchmark
Discussion after presentation of Domenico and Gonzalo
- For how long we intend to keep benchmark?
Domenico:The expectation is to keep proposed configuration as long as possible. Similarly to HS06 , 23 is a tag, no HEPScore24 is expected. Necessary changes will be discussed and approved by MB.
- Thomas mentioned that for the new HW the HEPScore and HEPSPEC give almost the same result, while for the old hardware HEPScore is lower than HEPSCPEC. Is it expected?
Domenico: Yes . Expected and understood. This was the reason why re-benchmarking of the old resources is not recommended
- Stefano: Can we report two benchmarks?
Julia: The specification does allow to report two benchmarks for the same resource in the dictionary of the normalized records , though APEL will consume only one for the time being. If two are reported, APEL will consume HEPScore.
- Thomas H. : Puppet configuration for the benchmark is needed for the site.
Domenico: It already exists at CERN. Will help other sites with it.
- Domenico: We would like benchmarking results to be reported to the central repository at CERN. For this the site needs to be authenticated to CERN message queue (MQ). Can work with host certificate, DN has to be communicated to the support team of the MQ at CERN.
- Token authentication is not enabled yet for message queue at CERN, some development would be needed.
- Julia: Do we need to broadcast to the sites the call for validation of the benchmarking suite during this month?
- Domenico
- Yes, it is needed.
Discussion after presentation of Julia and Adrian
- Adrian: APEL client won't be ready for the 1st of APRIL.
- Julia
- What is the time line for new version of APEL client?
Adrian: End of April, since it will be implemented by the new team member who needs some time to get familiar with the code. This would not include upgrade for Python 3, which will come later. Should not be a big problem since many sites might use third party clients to generate the records
- Authentication to the message queue used by APEL - AMS (different from the one used for reporting benchmarking results to CERN).
- Adrian
- In order to authenticate, there should be a registration of the DNs per service done in GocDB.
- Is token authentication foreseen?
- Adrian
- Yes it is foreseen, AMS is ready to use tokens, but for production accounting there is component which translates certificates to tokens.
- Thomas: We see scalability issues with APEL. Will DB stay the same?
Adrian : For immediate future :YES
- Julia: we are preparing complete documentation for the pioneer sites which will validate the new dataflow and will start testing asap.
Middleware News
- Useful Links
- Baselines/News
- Next week a broadcast will be sent about WLCG advice on
the next Linux OS options and associated MW plans
- The text is under review this week
- On Tuesday, the following broadcast was sent about (RH)EL 9 vs. SHA-1 CAs:
- As has been discussed in recent WLCG Ops meetings, there is a mismatch
between the default security policies of RHEL 9 + derivatives and
the use of SHA-1 by a number of CAs in IGTF.
- RHEL 9 + derivatives and other recent Linux versions come with
OpenSSL v3, which disables a number of legacy algorithms.
In addition, RHEL 9 + derivatives disable SHA-1 by default.
- Unfortunately, SHA-1 is still used in root certificates of various CAs.
- Though their number is steadily decreasing, at this time it appears
we cannot yet declare SHA-1 unsupported across the infrastructure,
because too many sites and resources would be affected.
- Instead of re-enabling SHA-1, Red Hat have suggested the IGTF CA
distribution could be adjusted in a way that should cause its CAs
to be trusted irrespective of any dependencies on SHA-1.
- However, such adjustments would need to be tested with all
relevant middleware that has any business with certificates,
which may take a considerable amount of time.
- This then implies that clients and services running RHEL 9 or a
derivative (AlmaLinux 9, Rocky Linux 9) will need to enable SHA-1
for the time being.
- The minimal way to do that is as follows:
update-crypto-policies --set DEFAULT:SHA1
- This matter is to be revisited when either Red Hat's suggestion has
been made to work across our infrastructure, or the remaining use
cases depending on SHA-1 have become negligible, which may take
many more months.
Stephan: Can we have a list of CAs having troubles?
After the meeting Maarten has added the list (see below CMS report)
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Lowish to normal activity on average in the last weeks
- More sites have been switched from single- to 8-core pilots
- Other sites are planned to follow in the coming weeks
- Some sites will be able to support whole-node jobs instead
- Job submission token configuration details will soon be communicated
ATLAS
- Mostly smooth running with 500-700k slots on average
- Issues on Alma9 with old CA certificates using SHA-1
- Planning to move Harvester to HTCondor10 this month
- Only token auth with HTCondorCE
- Only ARC REST interface
- Some sites
still have not upgraded their HTCondorCEs to support tokens so will be cut off, but no real impact on total resources
CMS
- started taking cosmic data with the CMS detector
- overall smooth running, no major issues
- good core usage between 170k and 410k cores
- production pressure lost for about 10 hours last Thursday
- usual production/analysis split of about 3:1
- significant contribution from HPCs peaking at over 70k
- main production activity Run 2 ultra-legacy Monte Carlo and Run 3
- tape writing backlog at JINR decreasing nicely with config adjustment
- waiting on python3 version/port of HammerCloud
- working with our DPM sites to migrate to other storage technology
- limited by operations manpower/expertise
- token migration progressing steadily
- going through old, unused clients with excessive scope in IAM and cleaning things up
- plan to explore IAM logging and traceability the next weeks
- ETF updated with new xrootd version and probes including IAM-issued token probes moved to production instance; Many Thanks to Marian Babik!
- native xrootd config ready; working on dCache config continuing
- looking forward to 24x7 production IAM support by CERN
- does WLCG have a list of root CAs with SHA-1 that could be shared?
- or what is the latest expiration?
Provided after the meeting
- These CA root certificates still feature
SHA1
:
ASGCCA-2007, ArmeSFo, BYGCA, CESNET-CA-Root, CNIC, DFN-GridGermany-Root,
DZeScience, DigiCertAssuredIDRootCA-Root, DigiCertGridCA-1-Classic,
DigiCertGridRootCA-Root, DigiCertGridTrustCA-Classic, GermanGrid,
IHEP-2013, KEK, LIPCA, MARGI, QuoVadis-Root-CA2, RDIG, RomanianGRID,
SRCE, SiGNET-CA, TRGrid, seegrid-ca-2013
- Their end dates typically are (sometimes many) years from now
- They will need to be re-issued using
SHA2
instead,
which is not a trivial process in IGTF,
but many others already did so and more are to follow
LHCb
- Full system again for the past few weeks -- this followed a rather long period of under-utilization of the resources
- Still using single-core queues almost everywhere. Single-core jobs constitutes the 99% of the ran jobs. Things are away about to change, and we'll verify and then activate multi-core queues at all Tier1s and Tier2Ds (Tier2s with Disk resources).
- Moving to use "SingularityCE" everywhere (this is following a long-standing security issue -- isolation).
- We ticket-ed those few sites where this was not possible yet.
- From that point on, LHCb Pilots will start failing whenever Singularity is not available
- DIRACOS2 (the conda-based environment for DIRAC installations) is dropping support for OpenSSL 1.1 (-> OpenSSL 3.0.0). Most notably this means a new xrootd version -- only site affected atm is RAL
- New ETF tests in pre-prod. Plan to add more token-related tests.
- DIRAC support for submission to HTCondor and AREX CEs with tokens validated this morning.
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
- Progressing with integrating of the new benchmark in the accounting workflow. See presentation.
Information System Evolution TF
- Most of CRIC instances have been upgraded to use new SSO
- Following the request of LHCb to provide CE and queue information via CRIC API, a prototype of the loader that will pull data from BDII is ready. Next step is to iterate with LHCb experts to see whether all corner-cases are covered.
- In order to support CMS migration to tokens CMS CRIC has been populated with all the new groups that should be synced to IAM. The sync script is ready to be deployed as a cron, it should be deployed in the Openshift cluster that runs the actual service.
- According to our information CERN BDII instance is not used by any clients, we are planning to switch it off shortly
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
- The plan for dCache storage with XRootD protocol monitoring flow has been discussed between WLCG Monitoring Task Force, dCache experts and FNAL dCache support team. The implementation and next steps for the development of the prototype have been agreed. dCache developers will prepare a forwarding script to send data to CERN Messaging system
Network Throughput WG
- perfSONAR infrastructure - 4.4.6 is the latest release
- WLCG/OSG Network Monitoring Platform
- Proposed that all T1s subscribe to the perfSONAR alarms on bandwidth decreases
- Preparing details on this and testing the new subscription portal
- perfSONAR Alarms Dashboard (psDash) is now in production (https://ps-dash.uc.ssl-hep.org/
)
- On-going efforts to migrate to the direct publishing of measurements from perfSONARs
- Draft plan for network mini-challenges and milestones in preparation for the DC24 (includes perfSONAR milestones):
- Recent and upcoming WG updates:
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
WG for Transition to Tokens and Globus Retirement
- Further progress with the CE token support campaign on EGI
- 115 of 133 tickets have already been solved
- only a few sizable T2 sites remain, the rest are small sites
Action list
Specific actions for experiments
Specific actions for sites
AOB
- Next meeting is planned for the 6th of April