WLCG Operations Coordination Minutes, June 24, 2021

Highlights

Agenda

https://indico.cern.ch/event/1050924/

Attendance

  • local:
  • remote: Andrew (TRIUMF), Chien-De (ASGC), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David Cameron (ATLAS), David Cohen (Technion), Felix (ASGC), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Manuel (computing), Matt D (Lancaster), Matt V (EGI), Natalie (CERN IT Service Management), Nikolay (monitoring), Panos (WLCG), Pinja (security), Stephan (CMS)
  • apologies:

Operations News

  • the next meeting will be on Sep 2

Special topics

Support workflows for critical WLCG services at CERN

see the presentation

Discussion

  • David Cameron:
    • the proposals look OK, will check with colleagues

  • Stephan:
    • the connection from tickets to support lines could be streamlined further
    • if GGUS allowed a subcategory to be selected, it could be mapped directly
      to the relevant support unit in SNow

  • Maarten:
    • so far we have not discussed any changes on the GGUS side and we will need
      to be careful with what we ask of that small team, who should rather be
      working on the new system that is expected to replace their current setup
    • they would either need to do a special case for CERN, which is not nice,
      or be prepared for other sites to ask for such things as well, even worse

  • Julia:
    • last week we discussed essentially splitting ROC_CERN into multiple units,
      each dedicated to a class of services, e.g. one for storage issues, etc.
    • that idea was not supported, because it is deemed beneficial for all
      potentially relevant experts to become aware of ongoing incidents,
      because there are interdependencies between services
    • as the amount of such tickets generally is low, there was no worry that
      experts might become overloaded with irrelevant notifications
    • this is how we have been working successfully for many years already

  • Maarten:
    • we can ask the GGUS team to have CMS tickets for ROC_CERN assigned
      directly to the 3rd line support, as is done for team tickets
    • there generally are very few such CMS tickets per week and they anyway
      get reassigned to 3rd line support by the IT helpdesk
    • it would allow an unnecessary delay to be avoided
    • we know that for CMS it should be safe to do so, whereas we probably
      do not want to do that for all user tickets

  • Natalie:
    • for SNow tickets to target service experts directly, we can create a
      special web form that is prefilled to some extent, asking the user
      just to provide a few details and then it can be submitted

  • Maarten:
    • for our critical services we will have convenient SNow mechanisms,
      but what about other services?
    • for example, when such a service is down, I would not want to be asked
      by the helpdesk what OS my laptop is running...

  • Natalie:
    • in general, please use the feedback mechanism to point out where the
      followed procedures were inadequate in some way
    • the helpdesk only do what they have been asked to by the service managers

  • Stephan:
    • we generally have a good user experience with GGUS, whereas for SNow
      it has not always been easy to find the correct support unit in the portal
      • we sometimes needed to do a Google search to find the group that
        runs a particular service and then open a ticket via their web pages

  • Natalie:
    • the service portal is a generic catch-all for any service at CERN
    • the web pages for a given service may offer targeted SNow web forms

  • Julia:
    • there were other concerns brought up as well
    • for example, incidents are not always properly published on the IT Service Status Board
    • we propose collecting feedback, if any, from the experiments and
      reporting it at the IT Department's weekly C5 meeting when needed

  • Natalie:
    • the IT-experiment meetings might be better to discuss such matters

  • Julia:
    • those meetings are very useful indeed, but only occur on a monthly basis
    • the weekly C5 meetings would be more timely for incidents that occurred
    • we will see which meeting can serve which purpose

  • Stephan:
    • we would still like to able to open a GGUS ticket e.g. for the FTS team
      and have it go directly to the FTS experts

  • Maarten:
    • it would imply changes both on the GGUS side and on the SNow side
    • those changes may be expensive and, as explained above, possibly
      undesirable for various reasons

  • Julia:
    • we will study that further and come back to it

  • Concezio:
    • for LHCb the proposed changes look OK, will check with colleagues

Middleware News

  • Useful Links
  • Baselines/News
    • Reminder: please upgrade your ARC CEs to >= 6.12.0 for accounting and privacy reasons!
      • Will be followed up via a joint EGI-WLCG campaign this summer

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual, normal activity levels

ATLAS

  • Smooth running with up to 700k cores running, including up to 250k from Vega HPC
  • HTTP-TPC:
    • 85% of T1/2 sites done
    • Rest are mainly waiting on new xrootd version deployment and testing
  • Two successful Run-3 scale tests this week and two weeks ago - 8GB/s P1->EOS->CTA & T1s
  • dCache default transfer timeout is 2 hours. ATLAS has a lower limit of 0.5MB/s which means transferring files bigger than 4GB may be killed by the storage. We would like the storage timeout increased
  • ARC 6.12.0 fixes issue with HTCondor batch system 9.0.0
  • ADCR database upgrade to Oracle 19 scheduled for 27 September - most ADC services will be down for up to 8 hours

Discussion

  • the dCache default timeout will be checked with the developers
  • at least the most affected sites would need to adjust their configurations

CMS

  • running smoothly at around 340k cores
    • usual production/analysis split of 3:1
    • main production activity Run 2 ultra-legacy Monte Carlo
    • HPC allocations contributing between 10k and 40k cores
  • new disk pledges are slowly filling; thanks to sites providing additional resources for 2021!
  • WebDAV storage endpoint setup and commissioning in progress
    • slower than anticipated but steady progress
    • in production use at 34 Tier-1,2 sites
  • slow local/remote data access investigation continues at RAL

LHCb

  • smooth running at ~120k cores
  • reprocessing (stripping) of 2017 data to start soon, 2018 and 2016 will follow
  • throughput tests P8 --> EOS/CTA on June 22nd, nominal values reached and sustained for one day
  • some progress in RAL tickets (xroot vector reads, slow checksumming), but still not solved

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • NTR

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 79 done: 37 ARC, 38 HTCondor, 1 both, 1 K8s, 2 none
  • 3 sites have or plan for ARC, 2 are considering it
  • 5 sites have or plan for HTCondor, 1 is considering it, 2 consider using SIMPLE

dCache upgrade TF

  • Done for upgrade. Need to start another campaign for enabling SRR when dCache release with proper SRR generation is out

DPM upgrade TF

  • 3 sites to go

StoRM upgrade TF

  • 4 sites still to go

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Julia:
    • a meeting was held yesterday to discuss Xrootd monitoring improvements
    • many stakeholders were represented to discuss issues, ideas and plans
    • the minutes of that meeting provide a comprehensive summary

Network Throughput WG


Traceability WG

Transition to Tokens and Globus Retirement WG

  • a very successful CE & Factory token hackathon was held June 3-4
    • please check out the materials if you are interested
  • the token transition milestone document backlog has also been much reduced in recent weeks
  • further details in WLCGTokensGlobusWG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2021-06-18

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2021-06-25 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback