WLCG Operations Coordination Minutes, Nov 14, 2019
Highlights
Agenda
https://indico.cern.ch/event/862780/
Attendance
- local: Andrei (LHCb), Borja (monitoring), Eddie (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (monitoring + networks), Vincent (security)
- remote: Alessandra D (Napoli), Catalin (EGI), Christoph (CMS), Di (TRIUMF), Eric (IN2P3-CC), Igor (NRC-KI), Johannes (ATLAS), LAPP, Laurent (LAL), Matt (Lancaster), Rob (Chicago + USATLAS), Ron (NLT1), Sabine (ATLAS), Stephan (CMS), Steve (Liverpool)
- apologies:
Operations News
- the next meeting is planned for Thu Dec 12
- please let us know if that date would be very inconvenient
News from the CVMFS service team at CERN
In the next two weeks, CVMFS repositories hosted at CERN will gradually start using a new signing key.
Clients accessing such repositories must have available the public keys
cern-it4.cern.ch.pub
and
cern-it5.cern.ch.pub
at
/etc/cvmfs/keys/cern.ch/
. To avoid service interruptions,
sites are recommended to upgrade the CVMFS client to the latest version (
2.6.3-1
) at their earliest convenience.
Minimum required version is
2.1.20
. If using the
cvmfs-config-default
package, version
1.3-1
is the minimum required.
Sites in EGI or OSG can also switch to
cvmfs-config-egi
or
cvmfs-config-osg
, respectively.
Special topics
Central deployment of WLCG services and WLCG SLATE security WG
see the
presentation
- Stephan:
- the security review is a good add-on, but how will ops effort be reduced?
- isn't there a shift from the sites toward central teams in the experiments?
- Rob:
- taking Xcache as an example, I might deploy it myself once or twice,
then allow the central team to do it from then on
- they could do many sites in one go, instead of having to ask sites,
check whether the operation was done at each site, etc.
- Stephan:
- still, there could be more ops effort needed in the central team
- Julia:
- there would be an overall reduction of effort
- Rob:
- remember DQ2, the predecessor of Rucio, for which a central expert
even needed root access on service hosts at sites
- we could again have 1 person responsible for all instances of a service across WLCG
- Maarten:
- we already have similar experience with VOBOX services:
owned by sites, operated by experiments
- Maarten:
- could we have a hybrid solution in which SLATE is used directly by sites,
while only some services are deployed centrally?
- Rob:
- that looks the likely scenario indeed
- easy services like Frontier-Squid can be done today,
while there is a long road for many others to follow
- e.g. a first version of an HTCondor CE deployment is being tested at 1 site
- Maarten:
- do you envisage all services at your site to be deployed through SLATE?
- Rob:
- all would be done with Kubernetes, some through SLATE
- on Dec 10 we will have a Pre-GDB on Kubernetes etc.
to discuss how we can make good use of Kubernetes across WLCG
- there might be regional deployments in the future
- batch services could use Kubernetes
- Julia:
- who would be responsible for configuration recipes?
- Rob:
- it is still early days
- we may need to add tools for configurations across sites
- Xcache is easy, HTCondor CE is complex
- here we look for the community to contribute and provide feedback
- it is about a new deployment paradigm and learning the skills
- Ron:
- is it aimed at all WLCG sites, not just the small ones?
- Rob:
- indeed, e.g. my own site MWT2 is not small
- at FNAL services can be run only by employees,
but the SLATE model would simplify matters also there
- Maarten:
- would FNAL be OK with certain services being run by CMS?
- Rob:
- to be discovered
- USCMS would be in favor of simplifying operations,
but the main focus would rather be on the T2 sites
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Sites please plan moving to CentOS / EL 7 as soon as possible
- Normal to high activity
- High analysis train activity in preparation for Quark Matter 2019
, Nov 3-9, Wuhan
- Continuing afterwards
- No major issues
ATLAS
- Smooth Grid production over the past weeks with ~330k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis. In addition ~80k job slots from the HLT/Sim@CERN-P1 farm.
- No major issues apart from the usual storage or transfer related problems at sites.
- We are recommending to sites not use Let’s Encrypt certificates
- Follow-up discussions with Tier1 sites and dCache developers on data/tape carousel campaign
- Plan to add Kubernetes to the set of critical services for ATLAS in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCritSvc#ATLAS
CMS
- smooth running up to 260k cores during the last month
- UltraLegacy re-processing
- 2017: almost complete
- 2018: RAW data staging has started
- 2016: Only after 2017+2018 are finished
- B-parked data reconstruction in the tails
- completed 2018 heavy-ion data processing
- large core usage end of October due to HPC resources/NERSC
- fire protection at P5 restored and high-level trigger farm now available again
- currently no plans to use/support Let's Encrypt certificates
- do we all have a close enough understanding what we mean with the entries in the WLCG service catalogue?
Discussion
- Julia: please let us know if some non-trivial change is needed in the list of critical services
LHCb
- smooth usage of grid resources in the last month, with an average of 101k jobs running
- Main activity: Monte Carlo production (88%), user jobs (6%), Monte Carlo reconstruction (4%), work group productions (0.5%)
- Small issues
- file access problems and deleting files on ECHO at RAL
- Singularity support
- T2-D sites gradually migrating to CC7
Discussion
- Maarten:
- mind that the Containers WG has provided recommendations
allowing sites to support Singularity in 2 different ways:
it would be good if LHCb jobs had a fall-back strategy
at sites that cannot (yet) apply the preferred method
Task Forces and Working Groups
Upgrade of the T1 storage instances for TPC
- Julia:
- T1 sites please look into your plans for the required TPC support and update the table. T1 sites might already plan downtime for the upgrade.
- the tentative deadline is the end of this year
GDPR and WLCG services
Accounting TF
- In couple of days we will send a request to the T1s to validate October accounting data (CPU, tape and disk usage) using a new accounting validation workflow
Archival Storage WG
Containers WG
CREAM migration TF
dCache upgrade TF
DPM upgrade TF
- 25 sites upgraded and re-configured with DOME. Out of those 4 are not updated in CRIC
- 9 upgraded but to be re-configured
- 7 sites are migrating to different storage solutions
- 12 still to go (upgrade and re-configuration). Most of them plan to accomplish the task by the end of this year
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
- Marian: I would like to present an update on ETF in the near future
- Julia, Maarten: let's plan that for our Dec meeting
MW Readiness WG
Network Throughput WG
Squid Monitoring and HTTP Proxy Discovery TFs
- These TFs are finally nearing completion. The last few tasks should be completed soon.
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB