WLCG Operations Coordination Minutes, June 4, 2020

Highlights

Agenda

https://indico.cern.ch/event/924764/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra (Napoli), Andrew (TRIUMF), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David B (IN2P3-CC), David C (Technion), Eric F (IN2P3), Eric G (databases), Eva (databases), Federico (LHCb), Henryk (NCBJ-CIS), Johannes (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Nikolay (monitoring), Pedro (monitoring), Pepe (PIC), Renato (LHCb)
  • apologies:

Operations News

  • the next meeting is planned for July 2

Special topics

CERN ORACLE outage followup

DB post-mortem 27 May 2020

Discussion

  • Maarten: the DBs are currently "at risk" - when do you expect to resolve that?

  • Eva:
    • we first want to understand what happened, in collaboration with NetApp
    • we then can schedule an intervention for which everything may need to be stopped
    • we aim to resolve this matter in the next few weeks

  • Julia: do the experiments want to raise particular issues from their reports?

  • Christoph: why did the PhEDEx agent configuration need to change?

  • Eva:
    • the primary DB for PhEDEx suffered a lost write
    • we decided it was safest to switch to the standby DB instead
    • that DB is on a different cluster that is not open to the world
    • we expected to have time to work with the PhEDEx experts to prepare for that,
      but the outage forced us to make the switch already
    • the setup will stay like that from now on

  • Christoph:
    • good, we will not need to reconfigure PhEDEx yet again
    • we intend to retire it by autumn

WLCG Critical Services review followup

see the presentation

Discussion

  • Julia: what do the experiments think?

  • Johannes: looks OK for ATLAS

  • Christoph: probably also for CMS, to be checked with Stephan

  • Concezio: looks OK for LHCb

  • Dave M:
    • there are potential issues with the urgency granularities:
      • in practice there may hardly be a distinction between 1 and 2 days
      • a weekend actually lasts longer than 48h

  • Maarten:
    • good points!
    • we still can change those levels and their definitions
    • these proposals were just to get the ball rolling

  • Dave M:
    • might changes in the timing categories conflict with the MoU?
    • would this be another reason for the MoU to be revised?

  • Julia: the MoU is concerned with response times, not resolution times

  • Maarten:
    • the MoU was created when we had no experience with operations at our current scale
    • we know since years that it ought to be updated to reflect today's reality,
      but that may not happen any time soon
    • meanwhile we can be practical and already update our tables
    • mind that those tables do not promise anything
    • they are mostly to understand which services deserve the most attention:
      to have them set up properly to try and limit the effects of incidents
    • that also ties in with the work of the Business Continuity WG in CERN-IT

  • Julia:
    • we will ask concerned parties to provide further feedback
    • we then can decide in our next meeting

SAM questionnaire

see the "Minutes" note and the Migration document

Discussion

  • Maarten:
    • it seems better to send the new reports first to the experiments only:
      • there are too many significant differences
      • the new UI has usability issues and it may thus be hard to understand them
      • we do not want to deal with a storm of messages from concerned sites

  • some discussion followed

  • Concezio:
    • the set of LHCb tests is being revised
    • some of them may no longer apply to some sites

  • Renato: first let the experiments check the results

  • Maarten: I only get NDGF when I ask for the ALICE T1 sites in the new UI

  • Borja: the UI works for us, we have used it to analyze the differences

  • Johannes: the problem could be dependent on the browser

  • Maarten:
    • the UI has to work in my browser
    • if I can run into such a problem, so can any site admin

  • Pedro, Borja, Nikolay:
    • we will send the new reports only to the experiments for now
    • in the next 2 weeks we will work with the experiments to see
      what needs to be fixed before the sites are involved
    • the document attached to the agenda should explain the differences

  • Maarten:
    • ATLAS and CMS have listed quite a number of issues encountered in the UI
    • the MONIT team may not be able to fix all of them within the proposed timeline
    • can the team already foresee running the old infrastructure a while longer?

  • Borja:
    • we can run the old infrastructure longer if needed
    • issues with the new infrastructure can be followed up through SNow tickets

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • CERN DB outage had very little impact, there were no complaints
  • 10 T2/T3 sites are excluded from F@H jobs as of Sunday afternoon,
    because such jobs have started failing a lot at those sites,
    while normal ALICE jobs run fine there

ATLAS

  • Stable Grid production with up to ~400k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~40-90k slots from the HLT/Sim@CERN-P1 farm and ~15k slots from Boinc. Occasional additional peaks of 200k job slots from HPCs.
  • About 60k job slots used for Folding@Home jobs since 4 April. 50% from ~55 different grid sites via opt-in and 50% at CERN-P1
  • DB and related problems (27 May and later):
    • Problems with DBoD during that day affected Harvester grid job submission (OTG0056733)
    • Rucio and PanDA (and all other CERN hosted DBs) affected by Oracle storage troubles (OTG0056746)
    • Rucio recovered quickly in the evening, some manual services restarts, but probably not necessary
    • PanDA recovered quickly in the evening, some manual services restarts
    • All ATLAS databases are online/available at CERN and ATLAS DB Applications are back to normal. No data loss as far we can tell.
    • The ATLR DB distribution to the Tier1s broken until Tuesday afternoon - fixed by CERN IT in the meantime.
    • Had been running grid jobs with Frontier/Squid conditions data access to CERN only mode, switched off Lyon and Triumf access 22-27 May.
    • Grid jobs partially still affected by Frontier/Squid troubles. Oracle troubles triggered a bug in Frontier launchpads and squid version 4.11-2.1. Frontier/Squid experts advised sites to restart site squids, but also Frontier launchpads require restarts. Working on a fix.
  • No other major other issues apart from the usual storage or transfer related problems at sites.
  • Moved all data out of Glasgow and Milano due to on-going storage troubles.
  • Grand unification of production+analysis queues in PanDA on-going - about 80% done.
  • Test of TPC in production on-going as discussed in WLCG DOMA context
  • DRAW_RPVLL reprocessing of run 2 data using Data Carousel mode now concluded. Next use of DC is to produce one year of our new DAOD_PHYS+PHYSLITE format, with the AOD input staged from tape

CMS

  • COVID-19 Compute
    • Full HLT for Folding@Home (~60k cores with HT enabled)
    • Order 2% of pledged cores, excluding some regions (IT, ES, US with own F@H effort)
  • Main processing activities
    • re-NanoAOD campaign(s)
  • Database outage May27th
    • Major impact on Phedex (transfer system) and CRAB (user submission tool for Grid)
      • Both depend on Oracle
    • Phedex agents at sites needed patches due to changes in Oracle setup
      • About a dozen sites received the patch instructions via GGUS (all issues solved now)
    • Dependence on DBOD indirectly via VOMS services for CMS
      • For certain period of time no fresh proxies could be created
  • Jumbo frame issue (INC:2355684) still being worked on (with rather high activity in the recent days!)
  • A CA certificate expired in a Singularity container that is launched by CMS pilots
    • Uncovered a few week points in our procedures how to maintain these containers
    • CMS is looking into improving the management of those containers
  • Migration to Rucio for data management
    • As of June CMS handles nanoAOD data tier via Rucio
    • No changes required by sites
      • A few sites received GGUS tickets to properly enable the Rucio proxy at their site

Discussion

  • Maarten: containers could get the CAs from CVMFS, e.g. under grid.cern.ch

  • Christoph: to be discussed further with the container experts in CMS

  • Dave M:
    • IIRC, in OSG the recommendation appeared to go in the opposite direction:
      rely on the site to provide the CAs if possible

  • Maarten:
    • that imposes a requirement on the sites that is not really necessary
    • and there have been plenty of issues with CAs and/or CRLs at sites
    • LHCb are running fine with CAs and CRLs from grid.cern.ch since a few years

LHCb

  • DB related problems:
    • LHCbDIRAC Bookkeeping affected by Oracle storage troubles (OTG0056746)
    • VOMS DB being un-accessible impacted also LHCbDIRAC Web Portal operations and users' logins
    • All services recovered quickly in the evening without the need for manual interventions
    • No data loss as far as we can tell
  • Other business as usual
  • Running users and the usual MC / working group productions
  • No main issue to report concerning sites

Task Forces and Working Groups

GDPR and WLCG services

  • Updated list of services
  • WLCG MB agreed that for Experiment and central WLCG services hosted by CERN, WLCG Privacy Notice would be published only after the corresponding CERN RoPO (Record of Processing Operations) is approved by CERN

Accounting TF

  • T1 and T2 accounting reports generated by CRIC become official. May accounting reports will be generated ONLY by CRIC. EGI accounting report generation will be disabled
  • REBUS UI is being redirected to CRIC.
  • Accounting validation UI in CRIC has been improved following feedback received from the sites

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 8 done: 4 ARC, 4 HTCondor
  • 18 sites plan for ARC, 15 are considering it
  • 23 sites plan for HTCondor, 15 are considering it, 8 consider using SIMPLE
  • 14 tickets on hold, to be continued in the coming weeks / months
  • 7 tickets without reply
    • response times possibly affected by COVID-19 measures

dCache upgrade TF

  • NTR

DPM upgrade TF

  • For Third Party Copy (TPC) sites need to upgrade to version 1.14, which is not yet released. When it is released, the task force would do another round for upgrade.

StoRM upgrade TF

  • Just started. More info
  • Initial status
    • According to CRIC there are 24 sites which are running StoRM and are used by the WLCG VOs. The number can be higher since for some of sites CRIC is lacking information about storage implementation and implementation versions the sites are running.
    • 5 sites are currently running StoRM instances with version 1.11.15 and 2 are running 1.11.17
    • There are 3 sites with unknown StoRM version in CRIC and BDII
    • 13 sites need upgrade

Information System Evolution TF

  • To provide/edit pledges CRIC UI should be used instead of REBUS one. REBUS UI is being disabled.

Discussion

  • Concezio: we now have a US T2 that wants to pledge resources for LHCb - can they use CRIC?

  • Julia: yes, the MIT pledge for LHCb is already there and has been approved by WLCG Project Office and LHCB consumption for MIT should show up in the accounting report for May

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2020-06-08 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback