Contribution of the WLCG resources for COVID-19 research

Introduction

Various initiatives were started in order to help computational tasks required for COVID-19 research to run on the WLCG infrastructure. Folding@Home and Rosetta@Home are those applications which have been already tried on the WLCG infrastructure. Moreover, Fight against COVID TF at CERN indicated that F@H is an application WLCG might consider. WLCG Operations Coordination put together community experts to provide support and coordination for this activity.

It consists of :

  • collecting information about various national initiatives
  • integrating experiment job submission frameworks with F@H applications (possibly others), testing submission, understanding constrains, ramping up
  • coordinating with sites and funding agencies regarding possible fraction of the resources involved in this activity
  • coordinating with sites and funding agencies regarding possible fraction of the resources involved in this activity
  • providing necessary technical documentation and support for the sites willing to participate

More info

Information about national initiatives

Country/Federation Application_run_on the infrastructure Job_submission_framework Resource_type Specific_requirements Scale_of_the_activity Possible_issues Comments Contacts
Italy T1/T2. CNAF and Tier-2s in Pisa, Rome, Bari, Legnaro, Frascati, Napoli and Milano. Custom application provided by an INFN spinoff which operates in the field of drugs design.It comes in the form of a compiled executable, which was expected to have a run time of ~ 11 days on 32 threads. Not possible to use LHC experiment frameworks since no WLCG queue allows for 32 threads and 10-15 days of execution time mainly CPU, though GPU could be used AVX2 is needed. Compiled executable needs cc7 and gcc9. Solved by using gcc9.2 from the CMS environment on CVMFS. On older nodes (sl6), we operated via a singularity container Topped at 30k CPU cores (so close to 1000 32 thread jobs) . Were assuming to be busy like that for ~ 2 weeks, however the jobs are already close to be done after 8-10 days. The variability between sites (sl6 vs cc7, condor vs LSF, shared posix disk vs dcache/dpm, ...) was the biggest obstacle   Tommaso.Boccali@cernNOSPAMPLEASE.ch, luca.dellagnello@cnafNOSPAMPLEASE.infn.it
UK FAH https://stats.foldingathome.org/team/246309 . Also the "Ferguson code" from IC has been installed/tested/debugged. Other code from the national RAMP initiative containerised and tested. Working with FAH to develop RUCIO usage and test data transfers DIRAC, direct submission and ATLAS covid work CPU + GPUs both local and as part of ATLAS submission   5k cores (3% in terms of HS06 in May) + some fraction of ATLAS workload running FAH Talked with quite a number of different projects but only FAH were really ready/able to use large scale resources   Alessandra.Forti@cernNOSPAMPLEASE.ch
CERN F@H with a little Rosetta@Home None, static Run on old, about to be retired hardware, due to be retired soon   8192 cores Limited to external connectivity.   tim.bell@cernNOSPAMPLEASE.ch
France Tier1 Application use by Laboratory for Therapeutic Innovation (LIT) which operate in the field of drug Submission to batch system directly CPU This community already use since many year the CC IN2P3 computing resource. But they requested a huge increase of the cpu available 10k slots for some weeks Ramping up from few jobs submitted to many thousand per day ? 10k slots has been requested to CC IN2P3 but on same time 30 k slots has been requested on another French computing center eric.fede@ccNOSPAMPLEASE.in2p3.fr
France Tier1 Application use by Laboratory for Therapeutic Innovation (LIT) which operate in the field of drug Submission to batch system directly CPU This community already use since many year the CC IN2P3 computing resource. But they requested a huge increase of the cpu available 10k slots for some weeks Ramping up from few jobs submitted to many thousand per day ? 10k slots has been requested to CC IN2P3 but on same time 30 k slots has been requested on another French computing center eric.fede@ccNOSPAMPLEASE.in2p3.fr
Spain             PIC is running F@H on GPUs, installed and running. UAM T2 is running Rosetta@Home. We have asked our funding agency and it's ok to run these initiatives as background in national WLCG resources, following the WLCG MB recent request. We could accept these types of Workloads sent centrally by LHC VOs. Last Update: 8/4/2020 jflix@picNOSPAMPLEASE.es
NDGF Custom application for population spread Direct to batch system CPU MPI 1-10M corehours Turning proof-of-concept python into scalable C, O(n^2) scaling on population hard when doing full country population studies Running on the underlying HPC resources in Sweden with higher priority than other communities (including WLCG) maswan@ndgfNOSPAMPLEASE.org
Canada TRIUMF Tier-1 is running F@H inside docker container on the compute nodes which are out of warranty   CPU   ~500 cores and can scale up to 4500 cores if the F@H client can get work units Most of time F@H client could not get work unit to run TRIUMF deployed the FAH suite on all of the SUN blades which is equivalent to about ~500 cores. We are part of the TRIUMF_CANADA team. The goal is to ramp up slowly with a few kcores using FAH and possibly also Rosetta@Home. dqing@triumfNOSPAMPLEASE.ca
Compute Canada is running FaH in VMs on the Arbutus cloud (which also hosts CA-VICTORIA-WESTGRID-T2) independent VMs running FAHClient, managed by terrafold (Ansible and Terraform) CPU and GPU Openstack VMs ~ 10K vCPUs, 300 vGPUs   VMs are consistently fully occupied with available work units rptaylor@uvicNOSPAMPLEASE.ca
Germany Tier 1 (GridKa) Folding@Home and Rosetta@Home and WeNMR COBalD/TARDIS as job factory. Configuration available at COBalD/TARDIS Folding@Home configurations. Stats available at GridKa Grafana. Following the last Ops Coordination Meeting, we have contacted WeNMR and are now receiving jobs via DIRAC. CPUs and GPUs - ~10000 cores - Ensuring good CPU utilization required some tuning. See configuration for details. Manuel.Giffels@kitNOSPAMPLEASE.edu and Andreas.Petzold@kitNOSPAMPLEASE.edu
Netherlands LHC/Tier-1 (SURFsara and Nikhef) Rosetta@Home and WeNMR See Rosetta@Home and WeNMR for info; Rosetta@Home stats here https://boinc.bakerlab.org/rosetta/top_users.php (Team Nifhack) CPUs - ~5000 cores - Rosetta@Home is running as burn-in test of a new cluster templon@nikhefNOSPAMPLEASE.nl
TW-ASGC 1. Taiwan T2 supports F@H by Atlas infrastructure. 2. CryoEM by local DiCOS system. 1. PanDa. 2. DiCOS web app. DiCOS 1. CPU for F@H 2. GPU for CryoEM   1. CPU: up to 664 cores, 2. GPU: 104 - - felix@twgridNOSPAMPLEASE.org

Submission F@H applications with experiment job submission frameworks

ALICE

  • Contributing to Folding@Home project through the ALICE workload management system
  • FAHClient configured to start immediately and run on a single CPU core
    • fetch a previously saved piece of work (or start from scratch)
    • run it for 12h (using the timeout command)
    • upload the new snapshot (cores/ and work/) back to the same place
  • Running ~6k concurrent jobs Grid-wide
  • ALICE informed the sites that ALICE is planning to run F@H on site resources (not more than 5%). No complaints have been received from sites. Sites which responded asked whether they can do more.
  • Added considerable amount of resources from the new FLP farm (part of the ALICE O2 computing facilities) early May
    • Allows testing of the equipment with known and permanent load
  • Efficiency 85-90%, high success rate
  • Saving intermediate results is beneficial and can improve efficiency and success rate
  • MonALISA plots:

ATLAS

  • Running as an Analysis type job, using prun, using cvmfs-based image distribution
  • Dedicated voms group/production role: /atlas/covid/Role=Production
  • Data management in Rucio using dedicated scope group.covid
  • Dedicated “COVID” L1 global share, applied to all ATLAS distributed resources

  • Jobs are running about half and half on unpledged/pledged resources, stable configuration since end of April
    • 30k slots (about 1/3) from the HLT farm instead of for simulation (Sim@P1)
    • 30k slots (about 10%) from sites, via opt-in agreement, 55 WLCG sites included

  • CPU resources are collected under 'ATLAS_CPU' donor, shown here
  • Running a mixture of 1-core and 8-core tasks
  • Total CPU resources running about 60k slots in total, occasionally more due to additional available analysis global share

  • Also run on the limited GPU resources available to ATLAS (currently 6 active sites), see here
  • Active GPU queues in the CERN and LHC Experiments team, running via ATLAS central submission:
    • ANALY_MANC_GPU_TEST
    • ANALY_BNL_GPU_ARC
    • ANALY_MWT2_GPU
    • ANALY_INFN-T1_GPU
    • ANALY_QMUL_GPU_TEST
    • DESY-HH_GPU (New! Local shared resources wth CMS)

  • ATLAS monitoring and documentation here

CMS

  • Work done in the context of the CERN Against COVID-19 task force
  • Running stably on about 64k job slots, divided in three main categories:
    • Opportunistic resources in CMS central WM infrastructure (e.g. submission infrastructure machines): ~ 700 cores running 8-core Folding@Home jobs.
    • CMS Grid distribution to Global Pool slots: EU sites, via opt-in agreement, excluding those sites already involved at the site or national level. once demonstrated scaling capability of the setup, explicitly limited to 3-4k cores, running 4-core Folding@Home jobs
    • CMS nodea at Point 5: running 7.5k Folding@Home jobs with 8 cores each on bare metal, with an allocation of 30k physical cores (60k virtual), for a total of 700k HS06:
      • 12400 cores (24800 virtual) from the online HLT farm
      • 16000 cores (32000 virtual) from the permanent HLT cloud
      • 1700 cores (3400 virtual) from the DAQ Readout Unit machines
  • Great flexibility demonstrated by the submission infrastructure:
    • direct injection via Condor, seamless handling of a non-CMS application
    • accurate internal monitoring via standard CERN MONIT infrastructure
    • potential application at backfilling Grid slots with CMS jobs (e.g. CMS@Home tasks) in the future
  • Monitoring of the HLT jobs via
    • Folding@Home native GUI client
    • automated scripts querying the F@H integrated telnet "pyon" server
    • dedicated Kibana dashboard
  • More information can be found on the CMS talk at the last GDB meeting.
  • Cumulative score over the past weeks:

LHCb

  • Running Folding@Home since April 20th on ~600 nodes of the old HLT farm at point 8.
  • Executing multi-threaded workloads, 24-threaded job slots (1 job per node) --> ~15000 busy slots out of 35000 available in the HLT farm
  • No re-configurations of the LHCb workload management system (DIRAC) needed. The HLT farm was reconfigured by the online team and the nodes used for F@H are simply not visible from the offline production infrastructure.
  • very stable and smooth running since the beginning -- boost after passkey was enabled
  • Report given at the last GDB
  • Cumulative distribution:

production_day_total.png

Experiment experts

  • ALICE : Costin Grigoras and Maarten Litmaath
  • ATLAS: David South
  • CMS: Felice Pantaleo
  • CMS: Felice Pantaleo
  • LHCb: Federico Stagni

OSG contribution

Technical documentation

Related meetings and presentations

-- JuliaAndreeva - 2020-04-01

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ALICE_resource_usage_Jan_Apr_2020.pdf r1 manage 902.9 K 2020-05-26 - 14:15 MaartenLitmaath  
PNGpng production_day_total.png r1 manage 13.5 K 2020-05-25 - 16:32 ConcezioBozzi LHCb_cumulative_points
PNGpng score-hourly.png r1 manage 153.5 K 2020-05-26 - 21:23 AndreaBocci Hourly Folding@Home score
PNGpng score-total.png r2 r1 manage 69.4 K 2020-05-26 - 21:21 AndreaBocci Cumulative Folding@Home score
Edit | Attach | Watch | Print version | History: r54 < r53 < r52 < r51 < r50 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r54 - 2020-06-22 - SimoneCampana
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback