WLCG Operations Coordination Minutes, Aug 29, 2019

Highlights

Agenda

https://indico.cern.ch/event/840490/

Attendance

  • local: Andrea M (CERN data mgmt), Boris (WLCG), Concezio (LHCb), Julia (WLCG), Maarten (ALICE + WLCG), Renato (CBPF + LHCb)
  • remote: ? (KIT), Alessandra (Manchester + ATLAS + WLCG), Andrea S (GRIF), Christoph (CMS), Dave (FNAL), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Igor (NRC-KI), Johannes (ATLAS), Luca (CNAF), Marcelo (CNAF), Mike (ASGC), Miro (databases + WLCG), Panos (WLCG), Pepe (PIC), Prasun (Kolkata), Ron (NL-T1), Simon (TRIUMF), Thomas (DESY), Tigran (dCache), Vikas (Kolkata)
  • apologies:

Operations News

  • The next meeting is planned for Oct 3
    • Please let us know if that date would pose a major inconvenience

Special topics

CNAF outage

see the report

  • it has been uploaded into the WLCGServiceIncidents archive
  • a few more details are in the experiment reports below

Upgrade of storage at the WLCG sites to versions required for TPC

  • Alessandra: the DOMA TPC WG have set the end of this year as a tentative deadline

Review of the situation at the T1 sites

  • Tier1 site support teams, in case you did not check and update info in CRIC, please, do ASAP.

Detailed instructions how to proceed

New accounting workflow

see the presentation

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • NTR

ATLAS

  • Smooth Grid production over the past weeks with ~320k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production, user analysis and a dedicated reprocessing campaign (see below). In addition ~90k job slots from the HLT/Sim@CERN-P1 farm when it was not used for TDAQ purposes. Some periods of additional HPC contributions with peaks of ~50 k concurrently running job slots running simulation using EventService.
  • Since August 8th running a special reprocessing campaign of 2018 data (~7 PB, 3mio files) using the data carousel setup. This requires a stage-in of all RAW inputs from tape at the Tier1s. A few notes and possible future improvements:
    • Expected throughput: Stage 7 PB in 2 weeks: 5.8 GB/s, for a 10% Tier1: 580 MB/s
    • INFN-T1 could not be used due to its ~2 weeks downtime - CERN CTA was used very successfully instead.
    • Observed contention in the data export along the tape -> disk buffer -> data disk path at Triumf and IN2P3-CC since the tape staging was faster than the copying away of the data from the disk buffer - room for improvements on the FTS and dCache side .... more news from FTS experts.
    • Suboptimal tape staging performance observed at FZK and PIC
    • Tier1s: are the the file pins are respected on the disk tape buffer ?
    • Improve the ATLAS WFMS task release threshold w.r.t the optimal fraction of available inputs files on disk after staging from tape
  • Can dCache please have a look and implement the following space token writing feature request which is critical for TPC and non-gridftp stores: https://github.com/dCache/dcache/issues/3920
  • CentOS7 site migration still not finished: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment?sortcol=2;table=2;up=0#sorted_table Have set the analysis queues at 13 sites to job-broker-off mode and will do the same for the corresponding production queues on September 15th in case of no visible progress.
  • Now in the tails of switching to the new PanDA worker node pilot version2 + singularity. Only moving CentOS7 queues. Excluding CERN-P1, there are the following jobs slots converted/not-converted: ~300k pilot2, ~260k pilot2+singularity, ~20k pilot1(and still to be migrated to pilot2+singularity+CentOS7)

CERN ATLAS FTS incident

see the presentation

  • Maarten: besides Rucio, could also the FTS apply a request cap to avoid overload?
  • Andrea M: to be looked into

  • Julia: might an overloaded FTS redirect clients to other instances?
  • Andrea M: when the FTS has become able to refuse requests when overloaded,
    the client (e.g. Rucio) would need to make its own decision
  • Johannes: Rucio currently has static FTS assignments per region, no load-balancing
  • Julia: Rucio could be made to react on FTS overload errors and fail over to other instances

  • Julia: there will be updates on FTS improvements in the next meetings

Discussion

  • Julia: concerning the dCache feature request, we will follow up with the devs

  • Julia: now that a sizable number of DPM sites have already upgraded,
    did the situation with the deletion errors improve?
  • Johannes:
    • we still see issues every week and have to keep opening tickets
    • for the next meeting we can prepare some stats

CMS

  • Rather smooth running over the summer
    • No major site issues, at some (longer than usual) delays due to site staff being on vacation
  • Main activities
    • Reconstruction of 'parked' b-physics events ~75% done
    • Reconstruction of HI data almost finished
    • Started of full reconstruction of Run2 data and MC
      • This will keep us busy for most of LS2
    • All requires quite some tape staging
  • Migration to CRIC continues
    • Legacy CMS siteDB now frozen

LHCb

  • Smooth production of both MonteCarlo and real data
    • No major site issues
    • CNAF outage slightly delayed our re-processing campaign of Run1 data, but now catching up
  • Starting using containers, the following requirements are needed (LHCb VO card updated accordingly)
    • CentOS7
    • Singularity
    • user namespaces
    • /cvmfs/cernvm-prod.cern.ch (in addition to /cvmfs/lhcb.cern.ch, /cvmfs/lhcb-condb.cern.ch, /cvmfs/grid.cern.ch) on the worker nodes
    • we will be ticketing sites (starting from Tier1s, then Tier2s) that are not ready with those.

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • See presentation of the new accounting workflow

Archival Storage WG

Containers WG

CREAM migration TF

dCache upgrade TF

DPM upgrade TF

  • Julia:
    • already >50% are at the required version, but only a subset have switched to DOME
    • we will open tickets for the remaining sites

  • Renato:
    • mind it is not just an upgrade, but a major configuration change
    • after a lot of help from the DPM devs, our instance at CBPF is fine now

Information System Evolution TF

  • MONIT team started integration with CRIC to get rid of dependency on REBUS

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

  • Status report presentation is planned for the October meeting

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Traceability WG

Action list

Creation date Description Responsible Status Comments
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations In progress GGUS:133915

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2019-09-02 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback