WLCG Operations Coordination Minutes, Oct 1, 2020
Highlights
Agenda
https://indico.cern.ch/event/959819/
Attendance
- local:
- remote: Alessandra (Napoli), Andrew (TRIUMF), Christoph (CMS), Concezio (LHCb), David B (IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David S (ATLAS), Federico (LHCb), Giuseppe (CMS), Horst (US-ATLAS + OU), Johannes (ATLAS), Julia (WLCG), Laurence (VOMS), Maarten (WLCG + ALICE), Matt (Lancaster), Nikolay (CMS), Panos (WLCG), Stephan (CMS), Xavier (KIT)
- apologies:
Operations News
- the next meeting is planned for Nov 5
Special topics
WLCG storage workshop
see the
presentation
Discussion
- David B: can also the Google Doc be used for input?
- Julia: yes, we will send the link
WLCG Critical Services update
Please have a look at the new version of the
WLCG Critical Services page.
Its contents and layout have changed according to ideas and
observations that were discussed and agreed in Ops Coordination
meetings earlier this year. The new version has functionality added
to help identify those services that need the most attention to
their implementation and operation, to try and prevent outages
as much as feasible and/or minimize the effects of their outages.
For such services the use of HA systems, load-balancing
and/or hot-standby setups should be considered the most.
In that respect it may be particularly useful to view the services
provided by CERN-IT in descending order of
weighted maximum criticality
as defined earlier on that page.
VOMS upgrade and discussion on usage of the VOMS legacy clients by the experiments
see the
presentations
- Laurence first gave a brief description of events:
- the service was first upgraded on Tue Sep 22
- on Wed Sep 23, tickets were opened by ATLAS and LHCb
- the
voms-proxy-init2
C++ legacy client failed to verify its received proxies
- the service then was rolled back and fully working by the end of Wed afternoon
- the VOMS devs provided a fix on Fri afternoon
- the service was upgraded to the fixed version on Monday this week
- Laurence:
- the focus for testing had been on
VOMS-Admin
because of its GDPR changes
- nobody expected any change of behavior affecting any
VOMS
client
- the officially supported
voms-proxy-init3
Java client worked fine
- Maarten:
- the VOMS devs found it actually was the supported C++ library that broke,
due to a change on the server side, due to the change from SL6 to CentOS7
- that library is also used by other MW, e.g. LCMAPS, used by various services
- the tickets named some services that could not handle the broken proxies
- in the future we should try to do more tests and avoid such late discoveries
- there were test instances with the new version, but the focus was on
VOMS-Admin
- David Cameron: it may be difficult to test all possible consumers of proxies
- Laurence:
- there is a trade-off between testing efforts and potential benefits
- we can just point the service alias to hosts with a new release,
while keeping the old hosts on stand-by for roll-back if needed
- Horst: wasn't also XRootD v5 affected?
- David Cameron: possibly, at least StoRM appeared to be
- Christoph: mind that OSG distributions only contain VOMS v2 clients!
- Maarten:
- AFAIK, OSG even forked the VOMS sources
- in theory that implies even more phase space for incompatibilities
- Federico: LHCb want to avoid greatly increasing the DIRACOS size just for VOMS
- Julia:
- it appears we have several reasons for continued use of the C++ legacy clients
- we will check with the devs if those can again become supported
The devs answered on Monday:
[...]
To summarize, the statement is:
- Last week upgrade incident was caused by an incompatibility on the
server VOMS attribute certificate encoding logic;
- clients v3 are the supported ones
- we do not plan to break compatibility with voms-clients v2 and are
happy to merge any contribution from the community in the code base
(but won't invest development effort in that code base)
The recent changes in VOMS(-Admin) will also be presented in an upcoming GDB meeting.
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual
- No major problems
- CERN: significant loss of batch capacity Sep 28-30 (GGUS:148808
)
ATLAS
- Stable Grid production in the past weeks with up to ~450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm and ~10k slots from Boinc. Occasional additional peaks of ~100k job slots from HPCs.
- Will stop running Folding@Home jobs this Friday, 2 October.
- No other major operational issues apart from the usual storage or transfer related problems at sites. A PanDA Pilot code update was rolled back last week that affected job brokering and accounting.
- VOMS server update and roll-back (23/24/28/29 Sep) ( INC2558102
) affected job submission due to “bad” proxy on Harvester submission hosts.
- TPC migration (https://its.cern.ch/jira/browse/ADCINFR-166
): continuing slowly moving ready dCache and DPM sites to use https-TPC in production. Status 30 Sep: 6 dCache, 10 DPM. Unfortunately constantly discovering new bugs in middleware or storage software: Davix, EOS, Storm, Echo, DPM+XRootD 5
- Robot certificates are becoming hard to obtain and manage due to the requirements on owner being permanent CERN staff
Discussion
- Maarten, Julia: let us know if you need help in the robot certificate discussion
CMS
- global run of detector, MWGR#4, Oct 7th to 9th
- first usage of GPUs in production for HLT sw: test successful
- running smoothly at around 260k cores
- usual production/analysis split of 4:1
- nevertheless significant queue of both production and analysis jobs
- main processing activities:
- Run 2 ultra-legacy Monte Carlo
- Run 2 pre-UL Monte Carlo
- special attention given to HPC utilization
- migration to Rucio ongoing
- ongoing dataset synchronization encountered by users
- consistency checks for nanoAOD started with Rucio
- multi-hop CTA write testing successful, recall testing next
- migration of CREAM-CEs continuing
- 1/12/3 Tier-1/2/3 sites with CREAM-CE(s) remaining
- VM migration for CERN hardware decommissioning progressing
- about 40 machines remaining on the Oct 30th deadline list
- deadline kindly moved by Jan to Nov 30th for two machines
- SAM3 switch to Sitemon complete
- we like to propose an improved downtime handling of OK state overriding scheduled downtime in the availability calculation
- if site works and processes jobs it is really available
- sites should be rewarded for finishing a downtime earlier
- CMS policy is to consider OK/WARNING over DOWNTIME
- easier for MonIT team to use same policy for all VOs
- significant feature remaining is test lifetime per profile
- one day lifetime does not match CMS 15 min test frequency
Discussion
- the proposal to have
DOWNTIME
overridden by OK
and WARNING
status was accepted
- SAM-3 was already supposed to work that way
- examples were given of A/R corrections that were applied for such reasons
LHCb
- Smooth operations and productions, no major problems
- CREAM decommissioning
- T1 sites: SARA/NIKHEF, RRCKI to be completed
- T2 sites: most small sites
- Information services
- providing information from the DIRAC CS to CRIC
- consuming information from BDII (for opportunistic resources) and MJF to discover new CEs
- CTA migration
- functional tests in October, migration early 2021
- workflow discussed with CTA team
- still some unknowns with respect to the protocol (XRoot? HTTPs?)
- working on a pre-production instance
- dCache namespace migration
- disk and tape files share the same namespace on dCache T1 sites, only the SRM space token distinguishes them
- we need to split the namespace (no SRM)
- we do this directly in the storage DB (help provided by dCache dev, many thanks!!!)
- GridKa acted as guinea pig, migration took place in July (thanks!)
- PIC, IN2P3 and SARA currently testing the procedure
- using the VOMS legacy client (C++ based)
- Java needed for non-legacy VOMS
- this would be the only dependency that we have on JAVA, that up to now we never had to distribute in DIRACOS.
- If we need that, the size of DIRACOS would grow considerably
- switching to non-legacy not difficult, but given the above we prefer not to switch and insist on keeping the C++ libraries working
- for testing purposes, we could foresee a test server exposing the dteam VO. Thus we could test new releases of VOMS with the DIRAC certification instances.
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Containers WG
CREAM migration TF
Details
here
Summary:
- 90 tickets
- 28 done: 14 ARC, 14 HTCondor
- 13 sites plan for ARC, 14 are considering it
- 21 sites plan for HTCondor, 12 are considering it, 7 consider using SIMPLE
- 3 tickets without reply
dCache upgrade TF
- In terms of dCache upgrade almost done. 42 sites have migrated to 5.* or 6.*. Only 2 sites to go. Still need to follow in SRR issue and SRR deployment
DPM upgrade TF
- 11 sites
have been already upgraded to 1.14.0 or higher and enabled macaroons
StoRM upgrade TF
- Recently started the new upgrade cycle to version 1.11.18. 2 sites have upgraded.
Information System Evolution TF
- CRIC has been successfully used for pledge definition instead of REBUS. Several minor issues quickly fixed.
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
MW Readiness WG
Network Throughput WG
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
- Julia: shall we switch to Zoom for the next meeting?