WLCG Operations Coordination Minutes, April 2, 2020

Highlights

Agenda

Attendance

  • local:
  • remote: Aleksandr A (ATLAS), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester), Alexander U (ATLAS), Alexandre Bonvin (Utrecht), Alexei, Andrea (WLCG), Andreas (KIT), Andrew (TRIUMF), Catalin (EGI), Cesare (MPCDF), Christoph (CMS), Concezio (LHCb), Costin (ALICE), Cécile Barbier, Dario (ATLAS), Dave M (FNAL), David B (IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David S (ATLAS), Doug (ATLAS), Eric (IN2P3), Federico (LHCb), Felice (CMS), Giuseppe B (CMS), Giuseppe La Rocca (EGI), Ivan (ATLAS), James (ATLAS), Jeny (FNAL), Johannes (ATLAS), Julia (WLCG), Liz (FNAL + CMS), Maarten (ALICE + WLCG), Marco (Padova), Marian (monitoring + networks), Matt D (Lancaster), Matt V (EGI), Nicolo (ATLAS), Pepe (PIC), Peter (ATLAS), Petr (Prague + ATLAS), Renato (LHCb + CBPF + ROC_LA), Ricardo (SAMPA), Riccardo (WLCG), Rod (ATLAS), Ron (NLT1), Shawn (MWT2 + ATLAS), Stefano (CNAF), Stephan (CMS), Thomas (DESY), Torsten (Wuppertal), Victor (CMS), Vincent (security)
  • apologies:

Operations News

  • the next meeting is planned for May 7
    • please let us know if that date would pose a major inconvenience

Special topics

COVID-19 impact on WLCG operations

WLCG computing resources for COVID-19 research

Note: FH denotes Folding at Home

  • Federico:
    • not convinced running FH would be the best approach, other initiatives might be better
    • would we run some amount some of it alongside LHCb workloads?
    • also depends on the perspective of sites
    • we can furnish our expertise in running workloads across the grid

  • James:
    • sites can directly contribute to other initiatives
    • FH is easy to integrate into our workflows
    • experiments could direct such jobs to sites that agree

  • Thomas:
    • are we sure there will be enough work to run?
    • Rosetta at Home did not have enough so far
    • the FH client is incompatible with other BOINC work!

  • Federico: as we cannot know the queue, pilots may just die

  • Andreas:
    • we should provide a list of running projects
    • sites then can pick before experiments would try and do something
    • KIT already doing that for resources above the pledge

  • James:
    • there are docs in a number of places
    • the CERN task force has concluded that FH would be the best option so far

  • David S: we should not just run what is possible, it has to be useful

  • Julia:
    • in principle we could even run a service creating such jobs
    • the usefulness of that is not known today

  • Dave M:
    • we would need to interact with experts of those domains
    • OSG and EGI are also running initiatives
    • FNAL is already involved there

  • Federico:
    • EGI are e.g. already running WeNMR (see the presentation)
    • WLCG lack expertise in those areas

  • Matt V:
    • Alexandre Bonvin will talk about WeNMR
    • EGI will have a call with OSG and come back to WLCG

  • David Cohen:
    • sites will need to know what resource numbers we are talking about
    • they may need to get agreement from funding agencies

  • Julia: indeed, and we should find the most effective contributions

  • Pepe:
    • resources are to be used for official purposes
    • there is more flexibility for amortized and other HW beyond pledge

  • Liz:
    • different countries and funding agencies will have different policies
    • sites should talk to their funding agencies

  • Alessandra F: WLCG cannot enforce anything

  • Christoph:
    • what sites do with resources beyond pledge is their decision
    • for running jobs in question on pledged resources we would need to know:
      • what fraction?
      • which application(s)?
      • through which channel(s)?

  • Alessandra F:
    • the best application is currently unknown
    • here we want to decide what we can do using the experiment infrastructures
    • and avoid unnecessary duplication of efforts

  • Costin: an experiment can reach all its sites

  • Federico:
    • it is not for us to operate the application(s)
    • biomed people should do that

  • Alessandra F: some interaction with people from WHO etc. might be needed

  • James:
    • the CERN task force are doing that
    • for now, FH was the only concrete proposal

  • Costin: in order not to waste effort, can we go ahead?

  • Maarten:
    • we have to be careful there
    • small-scale proofs of concept are OK at this stage
    • bigger activities could e.g. lead to issues between sites and funding agencies
    • we do not have a full plan at this time

  • Johannes:
    • in the experiments we can control the scale of these activities
    • and we could already use unpledged resources like the online farm

  • Christoph:
    • experiments cannot control the use of unpledged resources at sites
    • several sites are already using unpledged resources for related purposes

  • David S:
    • we can come to a suggestion for how to run things
    • and avoid unnecessary duplication

  • Dave M: WLCG can do the communication part

  • Julia: we will follow up in our own task force

Follow up comments after the meeting

Simone Campana could not join the meeting because of overlap with another meeting he had to attend. There are a few follow-up comments/clarifications from Simone:

  • Why FH? As James Catmore pointed out, WLCG picked FH because the Fight-against-COVID TF at CERN indicated so, while more discussions there are happening.
  • Concerns of the sites if they use resources allocated to LHC for COVID-19 research. At the moment we agreed to do this at Citizen Science level, again, as recommended by the TF. So a few thousand cores. Even if this is 10k cores, this is 1% of WLCG, so we do not expect an impact on WLCG activities considering also that normally there is a 20% beyond pledge the experiments benefit from. I will mention this activity at the next RRB and ask the Funding Agencies for feedback. The situation is different if a site or a country decided to dedicate a large fraction of the resources to some initiative. There, we give flexibility but that site or that group of sites should document it and explain it to the funding agency.

EGI initiatives. HADDOCK application.

presentation

  • Alexandre:
    • we are talking to OSG to see if our jobs can run there as well
    • our computing model has been opportunistic so far
    • sites decide if they want to support us e.g. for backfilling
    • the work volume depends on the user activity
    • it also is limited by the scalability of the portal(s)

  • James: have you contacted the CERN task force?

  • Alexandre:
    • not yet
    • at this time we are not limited by computing resources
    • we can flag jobs that are related to COVID-19 research

  • Andreas: whom to approach for such jobs?

  • Alexandre:
    • first enable the enmr.eu VO on your resources
    • we do not depend on CVMFS today, as we found it unreliable at several sites
    • instead, our jobs bring their payload of 1 to 20 MB in their input sandbox
    • the job output is typically around 5 to 20 MB
    • jobs have typically been short
    • through DIRAC we can make them longer with larger outputs
    • each site supporting these jobs will need to be enabled in DIRAC
    • if desired, the site can be tagged to receive only jobs related to COVID-19

  • Julia: in the meeting between EGI and OSG, is there a WLCG representative?

  • Matt V: not yet, but we will follow up on that

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual so far, despite COVID-19 measures everywhere!
  • Thanks very much to the site admins!
  • Current emphasis is on data analysis, which requires little additional disk space.
  • Productions that need a lot of disk space are postponed until pledges are available.

ATLAS

  • no Covid-19 related problems so far
  • Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the CERN-P1 farm. Occasional additional bursts of ~100k jobs from NERSC/Cori.
  • Finishing the RAW/DRAW reprocessing campaign in data/tape carousel mode with data15 within the next week.
  • No other major other issues apart from the usual storage or transfer related problems at sites.
  • Feedback on APEL accounting question: keep it simple !
  • Grand unification of PanDA queues on-going and test of non-gridFTP TPC tests in production
  • Feedback on Google CA bundle for TPC to GCS: CloudStorageIntegration - will move ahead with it.
  • Would like to raise criticality of services for CEPH and DBoD to 8,9

CMS

  • no Covid-19 related interrupts to the CMS computing infrastructure so far
  • jumbo frame issue at CERN impacting several sites, INC:2355684
    • after network maintenance, March 11th, OTG:0054668
    • we expected this to be corrected quickly, does anybody know what the issue is?
  • running at about 250k cores during last month
    • usual production/analysis mix (80%/20%)
    • ultra-legacy re-reconstruction of 2016 in validation
    • Run 2 Monte Carlo production is largest activity, large batch of Phase-2 events delivered

Discussion

  • Maarten:
    • the matter with the jumbo frames seems to be not so easy
    • the ticket is currently waiting for input from the affected site
  • Liz:
    • an issue with jumbo frames already hit us in the middle of Run 2
    • this could be wider than 1 site
  • Stephan:
    • at the moment this is not a big problem, affecting a limited area of work
    • we would like to have a solution, even if it implies changes on our side

LHCb

see here

Task Forces and Working Groups

GDPR and WLCG services

  • Updated list of services
  • Detailed discussion how we go to enable privacy notice for all our services has been postponed. We will have a dedicated meeting with experiment contacts most probably next Thursday

Accounting TF

  • T1 reports generated by CRIC were sent around for validation, T2 reports will be sent for March

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 5 done: 2 ARC, 3 HTCondor
  • 18 sites plan for ARC, 12 are considering it
  • 22 sites plan for HTCondor, 14 are considering it, 7 consider using SIMPLE
  • 15 tickets on hold, to be continued in a number of months
  • 14 tickets without reply
    • response times possibly affected by COVID-19 measures

dCache upgrade TF

  • 34 sites are running versions > 5.2.0

http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dcache&version=5.

  • 9 to go, some of them planned an upgrade , but postponed it due to COVID-19
  • 2 plans to move to DPM

Discussion

  • Maarten: nowadays it does not seem a good idea for sites to move to DPM

  • Julia: we will follow up with them

  • Stephan:
    • one of those sites is a CMS site that already had a DPM
    • they want to consolidate their grid storage into just one system

DPM upgrade TF

  • 34 sites upgraded and reconfigured with DOME

http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dpm&version=DOME&show_11=0&show_18=0

Out of those 15 are running 1.13.2 with DOME

  • 6 upgraded but DOME is missing, but they are working on it
  • 1 to upgrade and re-configure, in progress
  • 1 site is suspended for operations
  • 9 moving away from DPM

Information System Evolution TF

  • REBUS is in readonly mode since beginning of April. Pages for editing information have been redirected to CRIC
  • Thanks a lot to Federico for providing API from Dirac for LHCb topology information. Will be used by CRIC and Storage Space Accounting

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r27 - 2020-04-06 - LorneL
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback