WLCG Operations Coordination Minutes, Dec 3, 2020
Highlights
Agenda
https://indico.cern.ch/event/980181/
Attendance
- local:
- remote: Alberto (monitoring), Chris (LHCb), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David Cameron (ATLAS), David Cohen (Technion), Eric F (IN2P3), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Nikolay (monitoring), Pedro (monitoring), Pepe (PIC), Ron (NLT1), Shawn (networks), Stephan (CMS), Thomas (DESY)
- apologies:
Operations News
- the next meeting is planned for Feb 4
Special topics
WLCG/FTS/Xrootd transfer dashboards update
see the
presentation
Discussion
- Pepe: how complete is the data? are the GLED collectors getting data from all
Xrootd services in WLCG?
- Nikolay: we recently needed to migrate the GLED collectors to CentOS 7 and
contacted the experiments to verify things from their perspectives
- Pedro: so far we have relied on experiments providing their sites with
configuration instructions for the monitoring info to be sent to MONIT
- Pepe: can local and external Xrootd accesses be distinguished?
- Pedro: the info has a flag and we only use remote accesses for WLCG dashboards
- Pepe: is there a distinction between TPC vs. job activities?
- Pedro: no
- Stephan: after the migration we found a large fraction of the sites no longer
reporting and we checked the configurations with those sites
- Pedro:
- there was similar feedback from ATLAS
- maybe these matters should be checked at the WLCG level?
- the FTS dashboards can be trusted, the Xrootd ones are unclear
- Pepe: CMS expects their Xrootd traffic to be reported accurately
- Stephan:
- there are way more Xrootd servers than FTS hosts
- not so easy to get all of them to report correctly
- US sites send their info to US collectors, not directly to MONIT
- Julia:
- US sites were supposed to use US collectors, other sites to use CERN
- how is that nowadays?
- Pedro:
- we have all that under control
- the trouble is that some servers do not report at all
- Stephan:
- the responsibility for the US collectors has shifted from UCSD to OSG
- some bugs were discovered and the services are being revamped
- Julia:
- it would be good if US experience could be shared for deployments elsewhere
- we also should think about how to check if the monitoring is working
- Stephan
- this is work in progress
- we will get back to these matters
- Maarten: to decide how much effort to spend, how useful are these dashboards?
- David Cameron: after the migration we found many ATLAS sites not reporting
and we just decided to switch off reporting altogether!
- Julia:
- the FTS dashboards only show part of the traffic
- that is why we decided to add the Xrootd dashboards
- Pepe: either we provide sites with clear instructions or those dashboards cannot be trusted
- Pedro:
- different dashboards can serve different purposes
- the FTS dashboards are useful for operations and accounting
- the Xrootd dashboards have sites missing at least for ATLAS
- can an absence of 10% of the Xrootd traffic be tolerated for the WLCG views?
- Julia:
- do the WLCG views avoid double counting of Xrootd transfers via FTS?
- with the progress in DOMA TPC, those are steadily becoming more numerous
- Pedro, Nikolay:
- Julia:
- we would like to have a complete picture for WLCG
- we need to get back to this next year
- check which fraction is missing
- determine proper configurations for sites and MONIT
- Stephan:
- let's wait for the new US collectors to get set up
- after the meeting Stephan checked the time line and confirmed that
the schedule for changing/upgrading the collector in the US is middle of 2021.
- Dave M:
- is the scope of these efforts correct?
- we would like to understand how our networks are used
- we may not want to have such monitoring depend on fragile site configurations
- Pedro:
- these are transfers dashboards: we need to monitor transfer activities
- Maarten:
- one day we might obtain our usage details from generic network monitoring
- for now we can use such transfer dashboards and hopefully they add up to
what we see on the networks
- Julia:
- those dashboards are rather for the global picture, less for operations
- Thomas:
- are sites supposed to look at them regularly?
- Maarten:
- not now, maybe when they have been much enhanced in the future
- Julia:
- the improvements that were presented look good
- we will get back to these matters next year
- Julia:
- also storage accounting is using InfluxDB: do you advise moving away from it?
- Nikolay:
- we did that because ES + Spark perform much better for high-cardinality data
- InfluxDB is OK for other use cases and we still use it indeed
WLCG Network Throughput WG update
see the
presentation
Discussion
- Julia, Maarten: consider using CRIC for your topology info?
- Marian: CRIC looks OK for WLCG sites and experiments,
but we also need correlations for other sites
- Julia:
- we can work with you to supply information that is currently missing
- for the WLCG sites all network topology information will be enabled in the WLCG CRIC
- for sites and VOs beyond WLCG we propose you deploy a dedicated CRIC instance.
In this case someone should take ownership of this instance and its content.
- you can already start with the WLCG sites
- we need to understand how much additional info it would imply to cover non-WLCG sites.
If it is not too much, it can also be hosted by the WLCG CRIC.
- Marian: sounds OK
- Shawn:
- we need a single source of truth per entity
- for WLCG sites the info should come from the WLCG CRIC
- for other sites we are already working with Edoardo
- Julia: we will set up a meeting to discuss the technical details
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High activity on average.
- No major problems.
- Sites in UK and possibly elsewhere were affected by RAL Stratum-1 misconfiguration:
- HTTP headers asked clients to cache CVMFS catalogs for 3 days instead of 61 sec.
- The problem started when the service was upgraded to CentOS 7 on Nov 19.
- The issue got cured on Dec 2, but previous effects may linger for 3 days more.
- Other VOs presumably were affected too (GGUS:149736
).
- A small number of sites did not yet upgrade their VOBOX to CentOS 7.
- They have been requested to do so ASAP.
- Thanks to all sites and best wishes for 2021!
ATLAS
- Stable Grid production in the past weeks with up to ~450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm and ~10k slots from BOINC.
- Occasional additional peaks of ~100k job slots from HPCs.
- Started a large production of MC DAOD_PHYS (Run 3 analysis format), including reading input AOD from tape through data carousel
- Expect to continue filling resources as normal over Xmas break, nothing special planned
- Started deletion from T1 MCTAPE, expected to take several weeks to complete
- HTTP-TPC migration: 11 dCache, 16 DPM, 1 Storm
- Thanks to all for keeping things running smoothly throughout a challenging year and all the best for 2021!
CMS
- CMS collaboration meeting this week
- running smoothly at around 300k cores
- usual production/analysis split of 3:1
- main proccessing activities:
- Run 2 ultra-legacy Monte Carlo
- Run 2 pre-UL Monte Carlo
- on track or beyond on HPC allocation use
- migration to Rucio complete
- Castor to CTA migration in progress
- Castor access closed, CTA write to start Dec 7th
- migration of CREAM-CEs incomplete
- CERN factory end-of-life reached
- 0/7/3 Tier-1/2/3 sites with CREAM-CE(s) remaining
- concerned about one large Russian site
- VM SL6 upgrade/migration almost done
- HammerCloud monitoring outage due to late and failed SL 6 upgrade attempt
LHCb
- running smoothly at 100-120k cores
- dominated (90%) by MC production with Run1 and Run2 conditions
- HPC: using CINECA Marconi-A2 partition (KNL, ~4k cores), a few hundred cores at SDumont (Brazil)
- Castor to CTA migration
- Archive/recall tests successful
- Tested XRootD TPC transfers from T0 to T1, with tape workflow
- Need to resolve the T0 to StoRM (HTTP only) case
- DAQ tests being scheduled
- migration planned end of January 2021
- Ongoing review of ETF tests
- getting rid of outdated ones (e.g. MJF, CREAM)
- Decommissioning of CREAM-CEs still ongoing
Discussion
- Maarten: in the recent WLCG Storage workshop we concluded that
HTTPS/WebDAV was going to be the default TPC protocol,
which means StoRM does not need to support Xrootd TPC
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Containers WG
CREAM migration TF
Details
here
Summary:
- 90 tickets
- 39 done: 21 ARC, 18 HTCondor
- 11 sites plan for ARC, 10 are considering it
- 18 sites plan for HTCondor, 9 are considering it, 6 consider using SIMPLE
- 1 ticket without reply
dCache upgrade TF
- 37 sites out of 41 have been upgraded to 5.2.15 or higher
DPM upgrade TF
- 27 out of 49 DPM sites have migrated to DPM 1.14 and enabled macaroons
StoRM upgrade TF
- Still waiting for the release 1.11.19 in UMD
- added after the meeting: it is there, not all components have that version number
Information System Evolution TF
- Migration of AGIS to ATLAS CRIC is progressing
- New functionality has been enabled in CMS CRIC to provide user info for CMS Rucio
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
see
WLCG/FTS/Xrootd transfer dashboards update
MW Readiness WG
We finally close this WG after > 2.5 years of inactivity.
Also the MW Officer role is discontinued as of 2021.
Network Throughput WG
see
WLCG Network Throughput WG update
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
- THANKS for your help in making 2020 a successful year for WLCG !
- Further challenges and opportunities await us in 2021...