WLCG Operations Coordination Minutes, Oct 1, 2020

Highlights

Agenda

https://indico.cern.ch/event/959819/

Attendance

  • local:
  • remote: Alessandra (Napoli), Andrew (TRIUMF), Christoph (CMS), Concezio (LHCb), David B (IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David S (ATLAS), Federico (LHCb), Giuseppe (CMS), Horst (US-ATLAS + OU), Johannes (ATLAS), Julia (WLCG), Laurence (VOMS), Maarten (WLCG + ALICE), Matt (Lancaster), Nikolay (CMS), Panos (WLCG), Stephan (CMS), Xavier (KIT)
  • apologies:

Operations News

  • the next meeting is planned for Nov 5

Special topics

WLCG storage workshop

see the presentation

Discussion

  • David B: can also the Google Doc be used for input?
  • Julia: yes, we will send the link

WLCG Critical Services update

Please have a look at the new version of the WLCG Critical Services page. Its contents and layout have changed according to ideas and observations that were discussed and agreed in Ops Coordination meetings earlier this year. The new version has functionality added to help identify those services that need the most attention to their implementation and operation, to try and prevent outages as much as feasible and/or minimize the effects of their outages. For such services the use of HA systems, load-balancing and/or hot-standby setups should be considered the most. In that respect it may be particularly useful to view the services provided by CERN-IT in descending order of weighted maximum criticality as defined earlier on that page.

VOMS upgrade and discussion on usage of the VOMS legacy clients by the experiments

see the presentations

  • Laurence first gave a brief description of events:
    • the service was first upgraded on Tue Sep 22
    • on Wed Sep 23, tickets were opened by ATLAS and LHCb
    • the voms-proxy-init2 C++ legacy client failed to verify its received proxies
    • the service then was rolled back and fully working by the end of Wed afternoon
    • the VOMS devs provided a fix on Fri afternoon
    • the service was upgraded to the fixed version on Monday this week

  • Laurence:
    • the focus for testing had been on VOMS-Admin because of its GDPR changes
    • nobody expected any change of behavior affecting any VOMS client
    • the officially supported voms-proxy-init3 Java client worked fine

  • Maarten:
    • the VOMS devs found it actually was the supported C++ library that broke,
      due to a change on the server side, due to the change from SL6 to CentOS7
    • that library is also used by other MW, e.g. LCMAPS, used by various services
    • the tickets named some services that could not handle the broken proxies
    • in the future we should try to do more tests and avoid such late discoveries
    • there were test instances with the new version, but the focus was on VOMS-Admin

  • David Cameron: it may be difficult to test all possible consumers of proxies

  • Laurence:
    • there is a trade-off between testing efforts and potential benefits
    • we can just point the service alias to hosts with a new release,
      while keeping the old hosts on stand-by for roll-back if needed

  • Horst: wasn't also XRootD v5 affected?

  • David Cameron: possibly, at least StoRM appeared to be

  • Christoph: mind that OSG distributions only contain VOMS v2 clients!

  • Maarten:
    • AFAIK, OSG even forked the VOMS sources
    • in theory that implies even more phase space for incompatibilities

  • Federico: LHCb want to avoid greatly increasing the DIRACOS size just for VOMS

  • Julia:
    • it appears we have several reasons for continued use of the C++ legacy clients
    • we will check with the devs if those can again become supported

The devs answered on Monday:

[...]
To summarize, the statement is:

- Last week upgrade incident was caused by an incompatibility on the
  server VOMS attribute certificate encoding logic;
- clients v3 are the supported ones 
- we do not plan to break compatibility with voms-clients v2 and are
  happy to merge any contribution from the community in the code base
  (but won't invest development effort in that code base)

The recent changes in VOMS(-Admin) will also be presented in an upcoming GDB meeting.

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • No major problems
    • CERN: significant loss of batch capacity Sep 28-30 (GGUS:148808)

ATLAS

  • Stable Grid production in the past weeks with up to ~450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm and ~10k slots from Boinc. Occasional additional peaks of ~100k job slots from HPCs.
  • Will stop running Folding@Home jobs this Friday, 2 October.
  • No other major operational issues apart from the usual storage or transfer related problems at sites. A PanDA Pilot code update was rolled back last week that affected job brokering and accounting.
  • VOMS server update and roll-back (23/24/28/29 Sep) ( INC2558102) affected job submission due to “bad” proxy on Harvester submission hosts.
  • TPC migration (https://its.cern.ch/jira/browse/ADCINFR-166): continuing slowly moving ready dCache and DPM sites to use https-TPC in production. Status 30 Sep: 6 dCache, 10 DPM. Unfortunately constantly discovering new bugs in middleware or storage software: Davix, EOS, Storm, Echo, DPM+XRootD 5
  • Robot certificates are becoming hard to obtain and manage due to the requirements on owner being permanent CERN staff

Discussion

  • Maarten, Julia: let us know if you need help in the robot certificate discussion

CMS

  • global run of detector, MWGR#4, Oct 7th to 9th
  • first usage of GPUs in production for HLT sw: test successful
  • running smoothly at around 260k cores
    • usual production/analysis split of 4:1
    • nevertheless significant queue of both production and analysis jobs
    • main processing activities:
      • Run 2 ultra-legacy Monte Carlo
      • Run 2 pre-UL Monte Carlo
    • special attention given to HPC utilization
  • migration to Rucio ongoing
    • ongoing dataset synchronization encountered by users
    • consistency checks for nanoAOD started with Rucio
    • multi-hop CTA write testing successful, recall testing next
  • migration of CREAM-CEs continuing
    • 1/12/3 Tier-1/2/3 sites with CREAM-CE(s) remaining
  • VM migration for CERN hardware decommissioning progressing
    • about 40 machines remaining on the Oct 30th deadline list
    • deadline kindly moved by Jan to Nov 30th for two machines
  • SAM3 switch to Sitemon complete
    • we like to propose an improved downtime handling of OK state overriding scheduled downtime in the availability calculation
      • if site works and processes jobs it is really available
      • sites should be rewarded for finishing a downtime earlier
      • CMS policy is to consider OK/WARNING over DOWNTIME
      • easier for MonIT team to use same policy for all VOs
    • significant feature remaining is test lifetime per profile
      • one day lifetime does not match CMS 15 min test frequency

Discussion

  • the proposal to have DOWNTIME overridden by OK and WARNING status was accepted
  • SAM-3 was already supposed to work that way
  • examples were given of A/R corrections that were applied for such reasons

LHCb

  • Smooth operations and productions, no major problems
  • CREAM decommissioning
    • T1 sites: SARA/NIKHEF, RRCKI to be completed
    • T2 sites: most small sites
  • Information services
    • providing information from the DIRAC CS to CRIC
    • consuming information from BDII (for opportunistic resources) and MJF to discover new CEs
  • CTA migration
    • functional tests in October, migration early 2021
    • workflow discussed with CTA team
    • still some unknowns with respect to the protocol (XRoot? HTTPs?)
    • working on a pre-production instance
  • dCache namespace migration
    • disk and tape files share the same namespace on dCache T1 sites, only the SRM space token distinguishes them
    • we need to split the namespace (no SRM)
    • we do this directly in the storage DB (help provided by dCache dev, many thanks!!!)
    • GridKa acted as guinea pig, migration took place in July (thanks!)
    • PIC, IN2P3 and SARA currently testing the procedure
      • short outages foreseen
  • using the VOMS legacy client (C++ based)
    • Java needed for non-legacy VOMS
    • this would be the only dependency that we have on JAVA, that up to now we never had to distribute in DIRACOS.
    • If we need that, the size of DIRACOS would grow considerably
    • switching to non-legacy not difficult, but given the above we prefer not to switch and insist on keeping the C++ libraries working
    • for testing purposes, we could foresee a test server exposing the dteam VO. Thus we could test new releases of VOMS with the DIRAC certification instances.

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • NTR

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 28 done: 14 ARC, 14 HTCondor
  • 13 sites plan for ARC, 14 are considering it
  • 21 sites plan for HTCondor, 12 are considering it, 7 consider using SIMPLE
  • 3 tickets without reply

dCache upgrade TF

  • In terms of dCache upgrade almost done. 42 sites have migrated to 5.* or 6.*. Only 2 sites to go. Still need to follow in SRR issue and SRR deployment

DPM upgrade TF

  • 11 sites have been already upgraded to 1.14.0 or higher and enabled macaroons

StoRM upgrade TF

  • Recently started the new upgrade cycle to version 1.11.18. 2 sites have upgraded.

Information System Evolution TF

  • CRIC has been successfully used for pledge definition instead of REBUS. Several minor issues quickly fixed.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

MW Readiness WG

Network Throughput WG


Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • Julia: shall we switch to Zoom for the next meeting?

  • there were no objections
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2020-10-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback