WLCG Operations Coordination Minutes, Sep 1, 2022
Highlights
Agenda
https://indico.cern.ch/event/1188546/
Attendance
- local:
- remote: Alessandra (Napoli), Alessandro (EGI), Andrea (WLCG), Andrew (TRIUMF), Christoph (CMS), David Cameron (ATLAS + ARC), David Cohen (Technion), Doug (BNL), Giuseppe (CMS), Maarten (ALICE + WLCG), Marian (networks + ETF), Mark (LHCb + Birmingham), Panos (WLCG), Pavel (KIT), Shawn (networks + AGLT2), Stefano (CNAF), Stephan (CMS), Thomas (DESY), Xavier (KIT), ufa
- apologies: Julia (WLCG)
Operations News
- the next meeting is planned for Sep 29
Special topics
Future WLCG helpdesk
see the
presentation
Discussion
- Alessandro:
- there also are requirements from EOSC-Future and other parties
- we need to keep track of all requirements together somewhere
- the EGI JIRA could be used for that
- this presentation should also be given to the EGI OMB
- Pavel:
- we should indeed merge the requirements in some way
- a presentation to the EGI OMB can be scheduled
- David Cohen:
- what will happen to the GGUS ticket history?
- many tickets may provide a valuable knowledge base
- Maarten:
- we consider preserving them as static HTML pages for a while
- however, there are e.g. GDPR issues to be concerned about
- and the tickets' relevance will steadily decline over time
- Pavel:
- the new helpdesk has knowledge base functionality
- some of its knowledge might be imported from GGUS
- Maarten:
- the new helpdesk will need to have several features for WLCG
- we will not just copy everything that GGUS can do today
- some things were never really used and can finally be dropped
- we may also come up with new things to consider
- we will organize followup meetings for experiment input etc.
- Pavel:
- we have a Google Doc and Sheet for first proposals etc.
- from that we derive a prioritized list of features to be implemented
- we expect to have a first version deployed early next year
- Maarten:
- we then will start trying it out with some support units
- we need to have everything working well before the end of 2024,
when the Remedy version used by GGUS becomes unsupported
- GGUS will disable ticket submission when the new helpdesk is
sufficiently ready to take over
- old tickets are handled in GGUS until it has to be shut down
- Alessandro:
- the evolution of the new helpdesk and the transition from GGUS
are funded through EGI-ACE etc.
KIT storage SIR
see the
presentation
Discussion
- Maarten:
- the incident looks mostly due to human error?
- Xavier:
- well, it should also be noted that dCache had received
return codes that suggested those files were safe on GPFS,
when in fact they had only been successfully received by
the GPFS buffer from which they are sent to disk servers
- Maarten:
- maybe that is due to configuration, but receiving confirmation
only when the files are on the disk servers might well imply
a big, possibly unacceptable impact on performance
- we can live with such occasional, small-scale (!) incidents
Middleware News
- Useful Links
- Baselines/News
- we intend to launch a short questionnaire about the
experiences, if any, that sites have gathered on the
various EL8- and possibly EL9-compatible distributions
- results to be presented in Ops Coordination or the GDB
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
ATLAS
- Smooth running with 500-700k cores including HLT farm since beam stopped last week
- Removing GridFTP support completely from BNL tape systems revealed some dependencies in FTS, will be fixed in release deployed next week
- Robot cert used for all Rucio transfers will change owner and hence DN. We will do thorough testing and notify sites of any problems.
- All ATLAS jobs use Apptainer instead of Singularity from 4 August (ATLAS-maintained version in CVMFS with fallback to local install)
- Some concerns from sites about enabling dCache SRR API, since it exposes a lot of admin information too
Discussion
- Thomas:
- not clear if the issue is with the dCache SRR or REST interface?
- Xavier:
- David Cameron:
- Maarten:
- maybe open an issue in the dCache tracker as needed?
- will inform Julia of this potential concern
CMS
- August was a rather quiet month, very few issues for CMS
- running smoothly between 350k and 450k cores
- usual production/analysis split of 75% and 25%
- significant contribution from HPCs up to 70k cores
- main production activity Run 2 ultra-legacy Monte Carlo
- VOC working with groups on CERN VM migration campaign
- waiting on python3 version/port of HammerCloud
LHCb
- Generally smooth running
- No significant issues continuing from the summer
- Significant progress on a couple of problematic and long-standing tickets (many thanks to those involved!):
- GGUS:155120
Slow deletions at RAL - XRoot proxies found to serialise deletion requests. Config fix incoming
- GGUS:153653
Socket timeouts at NL-T1. Found to be due to root attempting to open all files at once and hitting limits on single storage pool nodes. Config changed to increase number of connections and bug report opened against hadd
(https://github.com/root-project/root/issues/11276
)
- With some fluctuations, have been nearly 200K jobs reliably all summer, mostly MC generation
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
dCache upgrade TF
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
- We remind experiments, xrootd and dCache development teams to provide input for the draft with the minimum schema for transfer monitoring which should be used for all WLCG transfers.
https://docs.google.com/document/d/1tBECfGHk4AybPoorpEe2WiBwYH9zodv-4shiW1RGUv4/edit
The deadline is the 30th of September
Network Throughput WG
- perfSONAR infrastructure - 4.4.4 is the latest release
- WLCG/OSG Network Monitoring Platform
- New alarms detecting LHCOPN/LHCONE routing changes and sudden drops in throughput are now available
- Each alarm comes with a link to a dashboard that shows all the details
- On-going development and testing of the direct publishing of measurements from perfSONARs
- Recent and upcoming WG updates:
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
WG for Transition to Tokens and Globus Retirement
- on Aug 22, the AuthZ WG published v1.0 of the WLCG Token Transition Timeline
- contains a summary of what was done in the last 2.5 years
- describes an optimistic set of milestones to work towards
- updates will be published from time to time as needed
Action list
Specific actions for experiments
Specific actions for sites
AOB