WLCG Operations Coordination Minutes, Nov 11, 2021
Highlights
- Following the discussion with the LHCC referees at the end of summer, in the beginning of the next year we plan a pre-GDB to discuss what can be done to minimize the effort needed for WLCG operations, both experiment and central operations. Experiment input on how we can better organize this discussion is very much welcome.
- perfSONAR: please update to v4.4.1 ASAP and reboot the nodes.
Agenda
https://indico.cern.ch/event/1094950/
Attendance
- local:
- remote: Alberto (monitoring), Alessandra D (Napoli), Andreas (KIT), Borja (monitoring), Catalin (EGI), Chien-De (ASGC), Christoph (CMS), Dave M (FNAL), David Cameron (ATLAS), David Cohen (Technion), Felix (ASGC), Gavin (T0), Giuseppe (CMS), Henryk (LHCb), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt D (Lancaster), Mihai (FTS), Miltiadis (WLCG), Nikolay (monitoring), Panos (WLCG), Pedro (monitoring), Renato (EGI), Riccardo (WLCG), Rizart (WLCG), Stephan (CMS), Thomas (DESY)
- apologies:
Operations News
- the next meeting is planned for Dec 2
Special topics
Monitoring of Data Challenges. Lessons learned and plans for improvements.
see the
presentation
Discussion
- Julia:
- real-time monitoring is different from transfer accounting
- if the main information is only received when a file is closed,
only average rates can be shown after transfers have finished
- Rizart:
- FTS data sources contain state changes in real time
- Borja:
- Xrootd transfers only have the information at file close time
- FTS dashboards can have more information
- Julia:
- has real-time data been compared with data on file close?
- Borja:
- the throughput numbers per transfer could be compared
- the resolution currently is 1 h, we cannot see shorter periods
- Julia:
- the 1-hour cut-off is correlated with the average transfer duration
- Borja:
- the FTS also has a real-time dashboard that we could start from
and then strip what we do not want
- Mihai:
- real-time information is available, but not forwarded yet
- to be looked into
- Julia:
- another desirable feature would be to have site monitoring info
available from a central location: are there plans for that?
- Borja:
- there are no plans except for revamping the Xrootd monitoring
- Julia:
- site monitoring information would allow us to do consistency
checks with the central monitoring
- the data challenge summary document being prepared may well
have some conclusions on that matter
- Maarten:
- integrating site monitoring could be quite expensive in terms of
effort needed in the monitoring team and at the sites
- sites have different systems, at different maturity levels
- development efforts to get what we want
- operations efforts to ensure the information remains reliable
- we would need to apply the usual 80-20 rule
- Julia:
- let's wait for the conclusions of the summary document
Middleware News
- Useful Links
- Baselines/News
- Followup of the fallout from the Brazilian CA renewal:
-
canl-java v2.6.0
has been tested OK with dCache
- it will be backported to supported branches, ETA Dec
- the StoRM team plan to update from
v2.5.0
, ETA Dec
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual
- The tape challenge was very successful
ATLAS
- Running 700-800k cores with 300k from opportunistic EuroHPC
- Run 2 data and MC reprocessing ongoing with heavy staging from tape
- CERN network outage on 15 Oct
- Most services recovered quickly after the outage
- Some pilot factories could not connect to DBoD (due to OTG:0066959
), leading to some sites draining over the weekend
- Data on almost all DPM sites is publicly-accessible and indexed by Google
- Tickets opened for sites to restart dCache after Brazil CA update
- Starting Monday 15th Frontier traffic will be redirected from T1s (Lyon and TRIUMF) to CERN
- We plan to switch SAM/ETF tests to use VOMS extensions from IAM next week
- All sites have updated LSC files except
CMS
- Good CPU usage above 300k cores on average, with sizable contribution from HPC
- main activity Run 2 ultra-legacy Monte Carlo
- WebDAV deployment done for Tier1/2, now at Tier-3 level
- SAM tests for WebDAV are ready to go in production
- Planning to add IAM to CMS production VOMSes list today
- Data challenges and tape tests finished, producing the final report
- (Partial) network outage at CERN on Friday late afternoon (Oct 15th): OTG:0066817
- Several CMS services affected, particularly CMS webservices
- Most issues could quickly be fixed
- Main issue voms-admin clients failing (voms-proxy-init working though)
- MonIT monitoring unavailable due to HDFS outage (Nov 1st) caused by DNS lookup errors OTG:0067144
Discussion
- Julia:
- the data challenge cleanup will be the first such campaign with Rucio?
- Dave M:
- deletion campaigns always have to be run carefully
- we will for the first time exercise that aspect of Rucio indeed
- the campaign will be run centrally
- sites will need to check the deletions
LHCb
- Smooth running at 140k cores
- Low number of MC, WG and Analysis production requests in the queue
- Reprocessing (stripping) of 2016 is waiting for the request for validation
- Tape data challenge finished on 22/10/2021
- ~10GB/s of throughput was achieved
- Cleaning of the test data is still ongoing
- It helped to detect and solve different problems and bottlenecks
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
- Technical meeting to discuss details of integration of the new benchmark in the accounting workflow took place. It will be followed by a wider discussion at the WLCG Accounting Task Force meeting on the 25th of November
Archival Storage WG
Containers WG
CREAM migration TF
Details
here
Summary:
- 90 tickets
- 84 done: 39 ARC, 40 HTCondor, 1 both, 1 K8s, 3 none
- 1 site plans for ARC, 1 is considering it
- 2 sites have or plan for HTCondor, 1 is considering it
No change since last month.
dCache upgrade TF
DPM upgrade TF
StoRM upgrade TF
Information System Evolution TF
- Further progress with network topology implementation in CRIC. New version is deployed for validation. Next step is Perfsonar topology in CRIC
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
Network Throughput WG
- perfSONAR infrastructure - 4.4.1 is the latest release (please update ASAP, we also recommend rebooting all nodes after update)
- WLCG/OSG Network Monitoring Platform
- Work is on-going to resolve issues reported to the perfSONAR ream - number issues already fixed, but some are still open
- Recent and upcoming WG updates:
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Traceability WG
Transition to Tokens and Globus Retirement WG
- CMS are starting to use their IAM VOMS endpoint in production
Action list
Specific actions for experiments
Specific actions for sites
AOB