WLCG Operations Coordination Minutes, Nov 5, 2020

Highlights

Agenda

https://indico.cern.ch/event/970604/

Attendance

  • local:
  • remote: Andrew (TRIUMF), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David B (IN2P3-CC), David C (ATLAS), Federico (LHCb), Giuseppe (CMS), Hector (CMS), Maarten (ALICE + WLCG), Marian (monitoring + networks), Matt D (Lancaster), Panos (WLCG), Paolo (CMS), Renato (LHCb + CBPF + ROC_LA), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • the next meeting is planned for Dec 3

Special topics

ETF update

see the presentation

Discussion

  • Thomas: what about the history of past tests for debugging?
  • Marian:
    • all useful outputs are sent to Monit, which keeps the history
    • ETF only has results from the current job and the previous job
  • Maarten: more logs on ETF and/or Monit to be considered only when really needed

  • Maarten: the CREAM EOL might be postponed by a few months
    • it was tied to the end of EOSC-hub, which got extended, AFAIU
    • clarified after the meeting: the CREAM EOL remains Dec 2020!

  • Renato: some ROC_LA sites may be unable to migrate before next year

  • David C: ATLAS may need to support CREAM for a few months more

  • Stephan: for CMS the Dec timeline remains OK
  • Marian: that will allow HTCondor to be upgraded already on the CMS instances

  • Maarten:
    • self-subscription for notifications is not urgent
    • people can send an e-mail or open a ticket
    • maybe it can be done via the flexible GUI of CRIC?
  • Marian: will check with the CRIC devs

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal to high activity on average.
    • Increased use of Singularity (through JAliEn), first at CERN.
  • No major problems.
  • CERN: migration from CASTOR to CTA.
    • Many thanks to IT-ST group!
  • F@H stopped on the grid since Oct 5.

ATLAS

  • Stable Grid production in the past weeks with up to ~450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm and ~10k slots from BOINC. Occasional additional peaks of ~70k job slots from HPCs.
  • Started medium-term planning for Grid workloads in the next 6-12 months. Run-3 simulation will start towards the end of next year.
  • Deletion of obsolete data from T1 DATATAPE (13PB) completed over first two weeks of October
  • Deletion from MCTAPE (33PB) will start end of November
  • Migration to CRIC is almost complete, chasing down the last AGIS users
  • World-readable data on DPM: was brought up a year ago, needs some action
  • 1/3 of DPM sites still run SLC6, will there be a campaign to upgrade them?
  • TPC migration: 8 dCache, 14 DPM now using HTTP-TPC
    • We plan to switch off HTTP-TPC on the following sites not supporting macaroons at the end of the year to allow including larger sites:
    • FMPhI -UNIBA DPM - upgraded not working yet working on it
    • NCG-INGRID-PT storm - still using YAIM to configure, working on it
    • UKI-SOUTHGRID-OX-HEP DPM - agreed to move to xcache part of the storage so not sure how much to push them if at all
    • INFN-MILANO-ATLASC - storm unanswered ticket
    • GRIF-LAL DPM - Emmanouil is currently taking over admin duties here
    • SE-SNIC-T2 - dcache unanswered ticket
    • Australia-ATLAS DPM - sort of unanswered ticket perhaps assigned to the wrong address
    • TR-10-ULAKBIM - DPM upgraded, not working yet, ongoing ticket

Discussion

  • Maarten:
    • we will follow up on the DPM HTTP read access issue
    • please send example endpoints via e-mail

  • Maarten:
    • we had no intention to follow up on services remaining too long on SL6
    • we consider the OS a standard Linux system administration matter
    • site admins must not count on reminders or campaigns for OS upgrades
    • we do such things only for grid middleware and the WN environment
      • e.g. to have the desired support for Singularity or experiment SW
    • production services should run supported MW on a supported OS

CMS

  • running smoothly at around 260k cores
    • usual production/analysis split of 4:1
    • main processing activities:
      • Run 2 ultra-legacy Monte Carlo
      • Run 2 pre-UL Monte Carlo
    • on track or beyond on HPC allocation use
  • migration to Rucio progressing well
    • in the tail of dataset synchronization due to bug
    • data recall from tape for analysis working
    • CTA testing ongoing
    • need FTS v3.10 to guarantee data written to tape
  • migration of CREAM-CEs continuing
    • 0/11/3 Tier-1/2/3 sites with CREAM-CE(s) remaining
  • VM migration for CERN hardware decommissioning ongoing
    • good progress
    • requested/received extensions for a few machines
  • SL 6 migration ongoing

LHCb

  • smooth running at ~110k cores, MC-dominated
  • migration to CTA ongoing
    • read/write tests EOS/CTA performed
    • exercise data transfers from the pit in the near future
  • some cleanup of ETF tests needed (e.g. machine-job-features)

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 32 done: 16 ARC, 16 HTCondor
  • 14 sites plan for ARC, 11 are considering it
  • 20 sites plan for HTCondor, 10 are considering it, 7 consider using SIMPLE
  • 2 tickets without reply

dCache upgrade TF

DPM upgrade TF

StoRM upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

MW Readiness WG

Network Throughput WG


  • Update on WG activities and plans will be presented at WLCG ops coordination (tentative Dec)
  • perfSONAR infrastructure - 4.3.0 was released this week
    • Release notes https://www.perfsonar.net/releasenotes-2020-11-02-4-3-0.html
    • Release focused on python3 migration, but also contains important changes in PWA and new testing tools were added: ethr and s3 benchmark
    • Bug was identified and reported to developers yesterday impacting around 36 nodes in OSG/WLCG (out of 166 that auto-updated); bug-fix release (4.3.1) is in the works
  • WLCG/OSG Network Monitoring Platform
    • Work on publishing directly from perfSONAR toolkits - testing started for USATLAS/USCMS sites
    • AGLT2 had a major outage due to air conditioning failure this week, which impacted some of the central services (psconfig, psetf, psmad)
  • EU project ARCHIVER plans to use perfSONAR to test cloud connectivity
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2020-11-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback