DRAFT

WLCG Operations Coordination Minutes, March 5, 2020

Highlights

Agenda

https://indico.cern.ch/event/893845/

Attendance

  • local:
  • remote:
  • apologies:

Operations News

CERN databases down March 11 morning

On Wed March 11 from 08:00 CET there will be a network intervention
affecting many hundreds of databases at CERN. In particular:

  • All Oracle databases will be inaccessible for at least a few min: OTG:0054761

  • More than 600 DB-on-demand instances will be down: OTG:0054760

Many of all those DBs are back-ends for grid and other kinds of services.
The respective service owners should already have been informed.

The downtimes are planned to be finished by 09:00 CET,
but unexpected fallout would not be unthinkable.

Special topics

CERN CEPH postmortem

See the presentation

Migration from CREAM. Status overview.

See the presentation

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • Some fallout from the Ceph incident at CERN on Thu Feb 20

ATLAS

  • Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm.
  • Large amounts of derivation production for the Moriond conferences put a bit of pressure on the grid disk space, expect some deletion this week to alleviate the situation.
  • Recovered from the second CERN CEPH incident rather quickly after experts restarted affected systems.
  • No other major other issues apart from the usual storage or transfer related problems at sites.
  • Continuing the RAW/DRAW reprocessing campaign in data/tape carousel mode: finished with data18 and started with data17 (Feb 19), approaching 50% done. Not using SARA for the moment and staging its share from CERN/CTA. Expect to start data16 in around 2 weeks
  • Started with TPC testing using webdav in production with a few sites - uncovered a couple of storage related problems and stopped until these are fixed.
  • Grand unification of PanDA queues on-going.

CMS

  • global run with first phase-2 detector configuration in mid February
  • running at about 265k cores during last month
    • ususal production/analysis mix (80%/20%)
    • ultra-legacy re-reconstruction of 2017 and 2018 data in progress
    • large pre-UL and UL Monte Carlo activities
  • CERN CEPH outage and IPv6 networking issues
    • manual intervention to recover
    • global pool managers moved to Fermilab (revealed performance impact of monitoring queries)
    • at fortunate times thus not too disruptive

LHCb

  • business as usual, ~100k concurrent jobs
  • Tier0+Tier1: running legacy stripping of Run1+2 pp collision data, 2016 nearly done, 2017 just started, all other years completed
    • 2016 data stripping impacted by CERN CEPH incident of February 20th, but eventually managed to recover manually
  • stripping of p-Ion and Ion-Ion collision data will follow
  • MC production running elsewhere

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

GDPR and WLCG services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

See the presentation

Documentation: CreamMigrationTaskForce

dCache upgrade TF

DPM upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • LHCOPN/LHCONE Asia (8-9th March) was re-scheduled to take place during HSF/WLCG workshop, but as the workshop is postponed there will be half-day virtual meeting instead (on 13th of May)
  • perfSONAR infrastructure status - 4.2.3 version was released this week
  • OSG/WLCG infrastructure
    • New dasboards were created to provide high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
    • Still work in progress but feedback is welcome
    • Dashboards are already showing some of the results of the analytical studies that were performed as part of SAND project
    • The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance in undeterministic ways).
  • Looking into dublin-traceroute (extension of paris-traceroute) and possible integration in perfSONAR
  • 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2020-03-05 - ConcezioBozzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback