WLCG Operations Coordination Minutes, Oct 14, 2021

Highlights

  • A joint HEPiX-GDB event on Linux future strategy discussion is planned for the 25th of October.
    More details will follow as the agenda will be finalized.

Agenda

https://indico.cern.ch/event/1085877/

Attendance

  • local:
  • remote: Alessandra (Manchester + ATLAS + WLCG), Borja (monitoring), Chien-De (ASGC), Christoph (CMS), Dave M (FNAL), David B (IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David Collados (databases), Eric F (IN2P3), Eric G (databases), Eva (databases), Federico (LHCb), Giuseppe (CMS), Henryk (LHCb), Joao (FTS), Julia (WLCG), Luca (storage), Maarten (ALICE + WLCG), Matt (Lancaster), Mihai (FTS), Miltiadis (WLCG), Panos (WLCG), Philippe (ATLAS), Renato (EGI), Riccardo (WLCG), Stephan (CMS), Steven (FTS)
  • apologies:

Operations News

  • GGUS synchronization toward ServiceNow was broken from Sep 8 to Oct 4
    • CERN and FNAL were affected
    • Service managers had to consult GGUS tickets directly

  • the next meeting is planned for Nov 11

Special topics

WLCG Network Challenge

see the presentation

Discussion

  • Julia:
    • transfers were killed when they did not get accomplished in time:
      were those failures included in the stats?
  • Riccardo, Alessandra:
    • they were taken into account when they had reached the FTS level
    • earlier timeouts probably should also be taken into account

  • Julia:
    • the failure rates for CMS look rather high?
  • Riccardo:
    • CMS used an infrastructure for tests instead of production:
      • it needed to be debugged
    • the storage for tests was essentially borrowed from production
    • when a test RSE got full, Rucio was unaware of it
    • the error analyses are still ongoing

  • Julia:
    • could we have functionality in common to avoid such failures?
  • Riccardo:
    • the differences were due to different implementations of the
      Rucio services used in the challenge
    • in the ESCAPE project we cannot have dedicated implementations:
      we just use the upstream version with minimal customization

  • Julia:
    • a potential way to save effort would be through common Rucio operations.
      Could the fact that CMS and ATLAS have different implementations in Rucio
      be an obstacle for common Rucio operations?
  • Riccardo:
    • Most probably implementation differences won't have an impact on operations.

  • Julia:
    • quite a number of people were involved in the challenge?
  • Riccardo:
    • central operations needed significant effort
    • it could be reduced by changing the way the tests are performed
    • the selection of unique Rucio data sets can be improved
    • while the big achievements on the monitoring side are acknowledged,
      more effort would be desirable there
    • a common dashboard would avoid having to check many dashboards
  • Julia:
    • the WLCG common dashboard?
  • Riccardo:
    • it currently has less power: fewer choices and not all the data
    • failures should also be included, which will change the pictures
  • Alessandra:
    • about 10 people were involved, which is OK for the first attempt
      • the next time we should need fewer people
    • a lot of time was spent on debugging the CMS test infrastructure
    • the DC dashboard should not be separate from the WLCG dashboard
      • the DC functionality is useful also for normal operations
      • some improvements will need to be implemented
        • consistent inclusion of Xrootd stats
        • more selection options

  • Alessandra:
    • quite some work was needed to agree on a monitoring data format
  • Julia:
    • the Xrootd monitoring redesign needs to be consistent with FTS
  • Alessandra:
    • the FTS also needs to be used consistently by the experiments
      • LHCb currently use it differently compared to ATLAS and CMS

  • Julia:
    • the goal is to have reliable monitoring covering the 4 experiments which
      can be used for operations and not just for data challenges

FTS service incident report

see the presentation

Discussion

  • David Cameron:
    • what is the reason for backing up the DB?
  • Steven:
    • the configuration details needs to be backed up: link limits etc.
    • ATM the backups are transparent
    • the workflow for our DB thus can be similar to that of other instances
  • Maarten:
    • can the configuration be exported to disk and backed up from there?
    • the FTS should then also be able to import it from disk
    • not urgent, but it would allow DB backups to be stopped
  • Steven:
    • we will look into having the configuration on EOS

  • David Cameron:
    • after ATLAS had noticed the original problem,
      it took some time before the FTS team reacted
  • Steven:
    • we had in fact already posted an outage on the Thursday
      and were in touch with Mario from ATLAS
    • we were too optimistic about resolving the incident
    • we learned GGUS should be used to communicate such issues
  • Maarten:
    • the timing of the incident was awkward:
      • main developer Mihai was on holidays, but still helped out
      • Steven had only recently joined the team and was learning the ropes
    • instead of GGUS, experiment operations lists can also be used

  • Julia:
    • there were other instabilities in August: were they related?
  • Steven:
    • indeed, they were all understood
    • for example, to create the DB replica, a backup needed to be made,
      which had an impact on the performance of the primary...
  • Mihai:
    • the sosreport fix was implemented toward the end of August
    • one other host was affected in the meantime

  • Renato:
    • when do you expect an FTS release with those various fixes?
  • Steven:
    • the DB query fix has already been released
    • the MySQL 8 transition will be released later
      • we first want to upgrade the experiment instances at CERN
      • we also have to clean up a quick Python hack we did for MySQL 8
        • our Python code needs to be ported to Python 3

Middleware News

  • The extension of the Brazilian ANSPGrid CA caused operational issues
    • At least dCache and VOMS-Admin instances were affected
      • Through the CAnL library that deals with authentication
      • The issue resembles what happened when the Swiss CA was renewed
      • The new CA is masked by the old CA in memory
      • The only recourse for affected services is to be restarted

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average, no major issues
    • New record reached: 191k concurrent jobs
  • Run-3 preparations:
    • Steady progress toward phasing out the legacy AliEn system (Perl)
    • It is being replaced with the JAliEn framework (Java):
      • Supporting Run-3 multi-core as well as Run-1/Run-2 legacy jobs
        • Run-3 jobs currently make up a small fraction
        • Their numbers are expected to grow a lot before Run-3 starts
      • At most sites, all jobs are expected to use 8 cores each
        • For 8-core Run-3 payloads or multiple single-core payloads
      • The payloads are by default run through Singularity
      • Already in use at CERN and several T1 and T2 sites
      • Site VOboxes will look simpler as well
  • Tape challenge (Oct 11-15) going fine so far
    • Transfer rates can be seen here

ATLAS

  • Running 700-800k cores with 300k from opportunistic EuroHPC
  • Reprocessing of full Run 2 data and MC has started.
    • Expect to stage 18PB RAW and 15PB MC HITS from tape over next months
  • Data challenges mostly successful so far
  • All ATLAS databases successfully updated to Oracle 19 without major downtime
  • Plans for DPM retirement, official EOL?

Discussion

  • David Cameron:
    • can we set an EOL deadline for the DPM?
    • do we need a TF to get DPM sites to migrate to other storage MW?
  • Maarten:
    • we originally agreed to support the DPM throughout Run 3
    • the port to CentOS 8 was already largely prepared
    • then came the early EOL of CentOS 8
    • CentOS Stream 8 also does not cover Run 3
    • CentOS Stream 9 could be an option, but we have very little effort
    • the best would be for sites to migrate before the end of Run 3
  • Julia:
    • we plan to have several GDB presentations from sites that
      have decided how to migrate to which system
    • other sites can take inspiration from such examples
    • we then can see if a TF is needed to chase up those sites
  • David Cameron:
    • DPM sites should check the presentation that was given by
      DPM developer Petr Vokac on Oct 7 in the ATLAS S&C Week

CMS

  • Rather good CPU usage, 320k on average
    • Significant contributions from HPCs (several 10k cores)
  • Main Campaigns
    • Finishing Legacy Reprocessing of Run-2 MC & Data
    • Preparations for Run-3 productions
    • Reprocessing of parked B-sample
    • Requires staging of RAW files at CERN, several PB
  • Update of Oracle DB with no issues (some service in scheduled downtime, of course)
  • Successfully tested HTCondor driven submission on GPU equipped nodes
  • reminded by Brazilian root CA certificate update that Singularity images based on obsolete base images require extra maintenance and/or good planning. CMS plans to switch to using root CA certificate and VOMS information from CVMFS.
  • WebDAV deployment
    • now at Tier-3 level (deployment at a couple of Tier-2 site not yet complete)
    • transfer of user analysis jobs (CRAB) output switched to WebDAV where possible
    • 50% of FTS transfers are now WebDAV
  • SAM ETF pre-production uses VOMS extension from new Identity Access Management, IAM, server; sites checked and informed about issues accepting the extension from IAM

LHCb

  • Smooth running at 140k cores
  • A reprocessing (stripping) of 2017 and 2018 has been completed. 2016 will follow
  • Tape data challenge started on 11/10/2021

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • NTR

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 84 done: 39 ARC, 40 HTCondor, 1 both, 1 K8s, 3 none
  • 1 site plans for ARC, 1 is considering it
  • 2 sites have or plan for HTCondor, 1 is considering it

dCache upgrade TF

DPM upgrade TF

StoRM upgrade TF

Information System Evolution TF

  • dCache version which fixes SRR problem is ready. Soon will start an upgrade campaign for dCache sites.
  • New version of the network topology implementation has been deployed for evaluation to the CRIC development service

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Network Throughput WG


Traceability WG

Transition to Tokens and Globus Retirement WG

  • A campaign was launched on Sep 6 for sites to deploy additional LSC
    files to support the IAM VOMS endpoints of the LHC experiments
    • 48 out of ~150 tickets still not declared solved
    • A reminder EGI broadcast was sent on Sep 29
  • CMS and ATLAS have started using their new endpoints on limited scales
    • E.g. SAM preprod tests

Discussion

  • Stephan:
    • was a deadline given to the sites?
  • Maarten:
    • sites were asked to implement the support ASAP in Oct
  • Stephan:
    • not all OSG sites are ready either
    • we will follow up with them

  • Stephan:
    • several services are dependent on VOMS-Admin for grid-mapfiles
      • example: EOS
    • what is the plan for such services to use alternatives?
  • Julia:
    • what other services may be affected?
  • Maarten:
    • these matters have been discussed in the Authorization WG
    • we have catalogued current VOMS-Admin query use cases here
    • we contacted the EOS devs a while ago and will remind them
    • we will also follow up with other relevant services
    • we have to see if VOMS-Admin can be switched off
      for some experiments already by the end of this year

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2021-10-18 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback