WLCG Operations Coordination Minutes, Nov 14, 2019

Highlights

Agenda

https://indico.cern.ch/event/862780/

Attendance

  • local: Andrei (LHCb), Borja (monitoring), Eddie (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (monitoring + networks), Vincent (security)
  • remote: Alessandra D (Napoli), Catalin (EGI), Christoph (CMS), Di (TRIUMF), Eric (IN2P3-CC), Igor (NRC-KI), Johannes (ATLAS), LAPP, Laurent (LAL), Matt (Lancaster), Rob (Chicago + USATLAS), Ron (NLT1), Sabine (ATLAS), Stephan (CMS), Steve (Liverpool)
  • apologies:

Operations News

  • the next meeting is planned for Thu Dec 12
    • please let us know if that date would be very inconvenient

News from the CVMFS service team at CERN

In the next two weeks, CVMFS repositories hosted at CERN will gradually start using a new signing key.
Clients accessing such repositories must have available the public keys cern-it4.cern.ch.pub and
cern-it5.cern.ch.pub at /etc/cvmfs/keys/cern.ch/. To avoid service interruptions,
sites are recommended to upgrade the CVMFS client to the latest version (2.6.3-1) at their earliest convenience.
Minimum required version is 2.1.20. If using the cvmfs-config-default package, version 1.3-1 is the minimum required.
Sites in EGI or OSG can also switch to cvmfs-config-egi or cvmfs-config-osg, respectively.

Special topics

Central deployment of WLCG services and WLCG SLATE security WG

see the presentation

  • Stephan:
    • the security review is a good add-on, but how will ops effort be reduced?
    • isn't there a shift from the sites toward central teams in the experiments?
  • Rob:
    • taking Xcache as an example, I might deploy it myself once or twice,
      then allow the central team to do it from then on
    • they could do many sites in one go, instead of having to ask sites,
      check whether the operation was done at each site, etc.
  • Stephan:
    • still, there could be more ops effort needed in the central team
  • Julia:
    • there would be an overall reduction of effort
  • Rob:
    • remember DQ2, the predecessor of Rucio, for which a central expert
      even needed root access on service hosts at sites
    • we could again have 1 person responsible for all instances of a service across WLCG
  • Maarten:
    • we already have similar experience with VOBOX services:
      owned by sites, operated by experiments

  • Maarten:
    • could we have a hybrid solution in which SLATE is used directly by sites,
      while only some services are deployed centrally?
  • Rob:
    • that looks the likely scenario indeed
    • easy services like Frontier-Squid can be done today,
      while there is a long road for many others to follow
    • e.g. a first version of an HTCondor CE deployment is being tested at 1 site

  • Maarten:
    • do you envisage all services at your site to be deployed through SLATE?
  • Rob:
    • all would be done with Kubernetes, some through SLATE
    • on Dec 10 we will have a Pre-GDB on Kubernetes etc.
      to discuss how we can make good use of Kubernetes across WLCG
    • there might be regional deployments in the future
    • batch services could use Kubernetes

  • Julia:
    • who would be responsible for configuration recipes?
  • Rob:
    • it is still early days
    • we may need to add tools for configurations across sites
    • Xcache is easy, HTCondor CE is complex
    • here we look for the community to contribute and provide feedback
    • it is about a new deployment paradigm and learning the skills

  • Ron:
    • is it aimed at all WLCG sites, not just the small ones?
  • Rob:
    • indeed, e.g. my own site MWT2 is not small
    • at FNAL services can be run only by employees,
      but the SLATE model would simplify matters also there
  • Maarten:
    • would FNAL be OK with certain services being run by CMS?
  • Rob:
    • to be discovered
    • USCMS would be in favor of simplifying operations,
      but the main focus would rather be on the T2 sites

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Sites please plan moving to CentOS / EL 7 as soon as possible
  • Normal to high activity
    • High analysis train activity in preparation for Quark Matter 2019, Nov 3-9, Wuhan
    • Continuing afterwards
  • No major issues

ATLAS

  • Smooth Grid production over the past weeks with ~330k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis. In addition ~80k job slots from the HLT/Sim@CERN-P1 farm.
  • No major issues apart from the usual storage or transfer related problems at sites.
  • We are recommending to sites not use Letís Encrypt certificates
  • Follow-up discussions with Tier1 sites and dCache developers on data/tape carousel campaign
  • Plan to add Kubernetes to the set of critical services for ATLAS in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCritSvc#ATLAS

CMS

  • smooth running up to 260k cores during the last month
    • UltraLegacy re-processing
      • 2017: almost complete
      • 2018: RAW data staging has started
      • 2016: Only after 2017+2018 are finished
    • B-parked data reconstruction in the tails
    • completed 2018 heavy-ion data processing
    • large core usage end of October due to HPC resources/NERSC
    • fire protection at P5 restored and high-level trigger farm now available again
  • currently no plans to use/support Let's Encrypt certificates
  • do we all have a close enough understanding what we mean with the entries in the WLCG service catalogue?

Discussion

  • Julia: please let us know if some non-trivial change is needed in the list of critical services

LHCb

  • smooth usage of grid resources in the last month, with an average of 101k jobs running
  • Main activity: Monte Carlo production (88%), user jobs (6%), Monte Carlo reconstruction (4%), work group productions (0.5%)
  • Small issues
    • file access problems and deleting files on ECHO at RAL
  • Singularity support
    • Still waiting for IN2P3
  • T2-D sites gradually migrating to CC7

Discussion

  • Maarten:
    • mind that the Containers WG has provided recommendations
      allowing sites to support Singularity in 2 different ways:
      it would be good if LHCb jobs had a fall-back strategy
      at sites that cannot (yet) apply the preferred method

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

  • Julia:
    • T1 sites please look into your plans for the required TPC support and update the table. T1 sites might already plan downtime for the upgrade.
    • the tentative deadline is the end of this year

GDPR and WLCG services

Accounting TF

  • In couple of days we will send a request to the T1s to validate October accounting data (CPU, tape and disk usage) using a new accounting validation workflow

Archival Storage WG

Containers WG

CREAM migration TF

dCache upgrade TF

DPM upgrade TF

  • 25 sites upgraded and re-configured with DOME. Out of those 4 are not updated in CRIC
  • 9 upgraded but to be re-configured
  • 7 sites are migrating to different storage solutions
  • 12 still to go (upgrade and re-configuration). Most of them plan to accomplish the task by the end of this year

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

  • Marian: I would like to present an update on ETF in the near future
  • Julia, Maarten: let's plan that for our Dec meeting

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • These TFs are finally nearing completion. The last few tasks should be completed soon.

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > WLCGOpsCoordination > WLCGOpsMinutes191114
Topic revision: r17 - 2019-11-18 - MaartenLitmaath
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback