WLCG Operations Coordination Minutes, Nov 11, 2021

Highlights

  • Following the discussion with the LHCC referees at the end of summer, in the beginning of the next year we plan a pre-GDB to discuss what can be done to minimize the effort needed for WLCG operations, both experiment and central operations. Experiment input on how we can better organize this discussion is very much welcome.

  • perfSONAR: please update to v4.4.1 ASAP and reboot the nodes.

Agenda

https://indico.cern.ch/event/1094950/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra D (Napoli), Andreas (KIT), Borja (monitoring), Catalin (EGI), Chien-De (ASGC), Christoph (CMS), Dave M (FNAL), David Cameron (ATLAS), David Cohen (Technion), Felix (ASGC), Gavin (T0), Giuseppe (CMS), Henryk (LHCb), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt D (Lancaster), Mihai (FTS), Miltiadis (WLCG), Nikolay (monitoring), Panos (WLCG), Pedro (monitoring), Renato (EGI), Riccardo (WLCG), Rizart (WLCG), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • the next meeting is planned for Dec 2

Special topics

Monitoring of Data Challenges. Lessons learned and plans for improvements.

see the presentation

Discussion

  • Julia:
    • real-time monitoring is different from transfer accounting
    • if the main information is only received when a file is closed,
      only average rates can be shown after transfers have finished
  • Rizart:
    • FTS data sources contain state changes in real time
  • Borja:
    • Xrootd transfers only have the information at file close time
    • FTS dashboards can have more information
  • Julia:
    • has real-time data been compared with data on file close?
  • Borja:
    • the throughput numbers per transfer could be compared
    • the resolution currently is 1 h, we cannot see shorter periods
  • Julia:
    • the 1-hour cut-off is correlated with the average transfer duration
  • Borja:
    • the FTS also has a real-time dashboard that we could start from
      and then strip what we do not want
  • Mihai:
    • real-time information is available, but not forwarded yet
    • to be looked into

  • Julia:
    • another desirable feature would be to have site monitoring info
      available from a central location: are there plans for that?
  • Borja:
    • there are no plans except for revamping the Xrootd monitoring
  • Julia:
    • site monitoring information would allow us to do consistency
      checks with the central monitoring
    • the data challenge summary document being prepared may well
      have some conclusions on that matter
  • Maarten:
    • integrating site monitoring could be quite expensive in terms of
      effort needed in the monitoring team and at the sites
      • sites have different systems, at different maturity levels
      • development efforts to get what we want
      • operations efforts to ensure the information remains reliable
    • we would need to apply the usual 80-20 rule
  • Julia:
    • let's wait for the conclusions of the summary document

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • The tape challenge was very successful

ATLAS

  • Running 700-800k cores with 300k from opportunistic EuroHPC
  • Run 2 data and MC reprocessing ongoing with heavy staging from tape
  • CERN network outage on 15 Oct
    • Most services recovered quickly after the outage
    • Some pilot factories could not connect to DBoD (due to OTG:0066959), leading to some sites draining over the weekend
  • Data on almost all DPM sites is publicly-accessible and indexed by Google
  • Tickets opened for sites to restart dCache after Brazil CA update
  • Starting Monday 15th Frontier traffic will be redirected from T1s (Lyon and TRIUMF) to CERN
  • We plan to switch SAM/ETF tests to use VOMS extensions from IAM next week

CMS

  • Good CPU usage above 300k cores on average, with sizable contribution from HPC
    • main activity Run 2 ultra-legacy Monte Carlo

  • WebDAV deployment done for Tier1/2, now at Tier-3 level
  • SAM tests for WebDAV are ready to go in production
  • Planning to add IAM to CMS production VOMSes list today
  • Data challenges and tape tests finished, producing the final report

  • (Partial) network outage at CERN on Friday late afternoon (Oct 15th): OTG:0066817
    • Several CMS services affected, particularly CMS webservices
    • Most issues could quickly be fixed
    • Main issue voms-admin clients failing (voms-proxy-init working though)

  • MonIT monitoring unavailable due to HDFS outage (Nov 1st) caused by DNS lookup errors OTG:0067144

Discussion

  • Julia:
    • the data challenge cleanup will be the first such campaign with Rucio?
  • Dave M:
    • deletion campaigns always have to be run carefully
    • we will for the first time exercise that aspect of Rucio indeed
    • the campaign will be run centrally
    • sites will need to check the deletions

LHCb

  • Smooth running at 140k cores
    • Low number of MC, WG and Analysis production requests in the queue
  • Reprocessing (stripping) of 2016 is waiting for the request for validation
  • Tape data challenge finished on 22/10/2021
    • ~10GB/s of throughput was achieved
    • Cleaning of the test data is still ongoing
    • It helped to detect and solve different problems and bottlenecks

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • Technical meeting to discuss details of integration of the new benchmark in the accounting workflow took place. It will be followed by a wider discussion at the WLCG Accounting Task Force meeting on the 25th of November

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 84 done: 39 ARC, 40 HTCondor, 1 both, 1 K8s, 3 none
  • 1 site plans for ARC, 1 is considering it
  • 2 sites have or plan for HTCondor, 1 is considering it

No change since last month.

dCache upgrade TF

DPM upgrade TF

StoRM upgrade TF

Information System Evolution TF

  • Further progress with network topology implementation in CRIC. New version is deployed for validation. Next step is Perfsonar topology in CRIC

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Network Throughput WG


Traceability WG

Transition to Tokens and Globus Retirement WG

  • CMS are starting to use their IAM VOMS endpoint in production

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2021-11-15 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback