WLCG GDB: Integration of GPUs in WLCG
This twiki is made to collect information about the integration of GPUs in the WLCG computing infrastructure. This topic has been brought up in the WLCG MB and we were asked to bring back information from the LHC experiments.
Each experiment, please, answer the questions below, so we can collect this feedback and bring it back to WLCG MB.
References to
public presentations and documents are welcome, but please summarize their main conclusions here, thanks!
*Note:* We agreed to repeat the same questions every year, until further notice.
QUESTIONS
- Which is the current status of integration of GPUs for offline computing, at the software level?
- Are you already using GPUs for offline computing activities? Please, comment on the experience.
- If using these resources for offline, do you account for the GPU usage in some way?
- Are there any future plans on GPU resource demands or plans for future utilization at sites?
Please, indicate mid-term and long-term plan (if available).
- How many GPUs are needed for offline for the next two years?
- Any other plans, e.g. on FPGAs?
- Other comments or questions?
ANSWERS (2023)
ALICE
=> Which is the current status of integration of GPUs for offline computing, at the software level?
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
=> If using these resources for offline, do you account for the GPU usage in some way?
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
=> How many GPUs are needed for offline for the next two years?
=>Any other plans, e.g. on FPGAs?
=>Other comments or questions?
ATLAS
=> Which is the current status of integration of GPUs for offline computing, at the software level?
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
=> If using these resources for offline, do you account for the GPU usage in some way?
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
=> How many GPUs are needed for offline for the next two years?
=>Any other plans, e.g. on FPGAs?
=>Other comments or questions?
CMS
=> Which is the current status of integration of GPUs for offline computing, at the software level?
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
=> If using these resources for offline, do you account for the GPU usage in some way?
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
=> How many GPUs are needed for offline for the next two years?
=>Any other plans, e.g. on FPGAs?
=>Other comments or questions?
LHCb
=> Which is the current status of integration of GPUs for offline computing, at the software level?
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
=> If using these resources for offline, do you account for the GPU usage in some way?
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
=> How many GPUs are needed for offline for the next two years?
=>Any other plans, e.g. on FPGAs?
=>Other comments or questions?
CONCLUSIONS
(main) conclusions derived from experiment answers.
ALICE:
ATLAS:
CMS:
LHCb:
General:
ANSWERS (2022)
ALICE
=> Which is the current status of integration of GPUs for offline computing, at the software level?
ALICE uses a common O2 software framework for online and offline computing which is capable of offloading certain reconstruction steps to GPUs. Currently, the TPC clusterization, tracking, and track model compression runs fully on GPU and in the near future ALICE plans to move also the reconstruction for ITS and TRD onto GPU. ALICE uses a framework that runs generic code on different GPU backends. Currently supported are NVIDIA CUDA, AMD ROCm, and
OpenCL 2.1 or higher with Clang C++ for
OpenCL kernel language support.
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
ALICE uses the Run 3 EPN offline computing farm for offline processing including the farm's GPUs. Since ALICE has a common framework (see above) for online and offline, this is an automatic feature. As there was no large processing of real of Run 3 data yet, only MC-simulated (for Pb-Pb) and LHC test beam (for p-p) offline reconstructions were performed using GPUs. ALICE plans to use the GPUs extensively for the offline processing of the data taken in 2022 and afterwards.
=> If using these resources for offline, do you account for the GPU usage in some way?
We do not have a realistic base for GPU accounting yet, in relation to the HS06 CPU scores. The size and composition of the EPN farm is based on internal benchmarking with the O2 software and cannot be used to publish absolute accounting data. We expect to be able to establish equivalence between the GPU and CPU performance once we are able to run the same software on CPUs with known HS06 characteristics. The necessary hooks are already implemented in the code.
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
One of the next goals in the ALICE software development is offloading more of the offline reconstruction to GPUs. In the EPN farm, around 90% of the compute capacity (measured wrt. ALICE reconstruction software) comes from the GPUs, thus it is desirable to offload 90% of the total workload to GPUs. There are sufficient GPU resources in various labs to run tests of the software on other GPU types than the ones available on the EPN farm (AMD MI50), together with studies of memory and CPU dependencies. Once the software offloading process is in advanced stage, after the first year of data taking, ALICE may ask for a larger GPU deployment at a computing centre to run offline reconstruction on larger scale with GPUs. In addition, we are also investigating the possibility to use HPCs equipped with GPUs for the same offline reconstruction tasks.
=> How many GPUs are needed for offline for the next two years?
A more precise estimate of additional GPU needs for ALICE can be made after the project goals outlined in the previous answer are achieved and evaluated.
=>Any other plans, e.g. on FPGAs?
ALICE uses FPGA for detector-specific local online processing, but there are currently no plans to employ them for offline reconstruction.
ATLAS
=> Which is the current status of integration of GPUs for offline computing, at the software level?
Very limited at the moment; several R&D projects are ongoing investigating GPU usage for the HL-LHC era. The Athena offline software supports the integration of GPUs.
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
Yes, GPUs are already integrated into offline computing and are used for limited analysis use cases, e.g. ML training. Several sites have made small-scale GPU resources available to all ATLAS users through PanDA.
=> If using these resources for offline, do you account for the GPU usage in some way?
GPU usage is measured in the same way as CPU usage, but benchmarks are missing to do proper accounting. GPUs are considered unpledged.
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
Towards the end of Run 3, prototypes of event generation, detector simulation, and/or reconstruction software able to use GPUs may become available, so sufficient resources should be available for testing and validation. Large-scale GPU deployment at sites would be needed by the start of HL-LHC (~2028), assuming the offline software has been developed to use them.
=> How many GPUs are needed for offline for the next two years?
=>Any other plans, e.g. on FPGAs?
Not for offline. For online FPGAs are being seriously considered on the timescale of the HL-LHC. The simulation strategy for triggers running on FPGAs will be discussed when that decision has been made, around 2026.
=>Other comments or questions?
In the short-term, GPU resources used for limited analysis or R&D must be considered unpledged. However it is good to start now developing a pledge framework within WLCG for possible future use cases.
CMS
=> Which is the current status of integration of GPUs for offline computing, at the software level?
The asynchronous usage of accelerators in multithreaded CMSSW workflows is fully supported by the framework. More specifically, all the components for the usage of NVidia GPUs are present in the software stack to exploit the GPU equipped HLT farm for Run 3 (the online and offline software release is the same software). Submission of workloads on Grid worker nodes equipped with NVidia GPUs is fully supported by the workload management, both for user and production jobs.
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
CMS uses GPUs offered opportunistically by sites for large scale validation of HLT reconstruction code, machine learning studies, and user jobs. The experience is very positive, the usage of standard tools allow to easily take advantage of these accelerators, both for production dataset requests and user jobs.
=> If using these resources for offline, do you account for the GPU usage in some way?
A strategic goal of CMS is to efficiently use all of the resources made available to us. Presently we do not account for GPU usage and consider all accelerators opportunistic.
=>Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
CMS will follow WLCG guidance for how to include GPUs as part of our resource request when CMS has a broader set of applications commissioned and sees that using GPUs would be cost effective. CMS is engaged in the HEPScore WG and already provided sw stacks and workflows to benchmark x86, arm, ppc and nvidia gpus. CMS effort and plan for the exploitation of GPUs is twofold, involving both computing tools and data processing software. CMS’s existing computing tools are already able to schedule work at Grid sites on GPU-equipped worker nodes in a standard way by specifying this requirement along with the usual parameters such as CMSSW configurations, input and output data sets, and maximum RSS memory needed. This transparent integration of GPU resources into our Global Pool is being improved in close collaboration with the HTCondor development team. For what concerns data processing software, presently the only application for which at the moment a sizable part of calculations can be offloaded to a GPU is the High Level Trigger reconstruction (about 30% of the runtime). Large scale validation on the Grid has been performed during the last few weeks in preparation for data taking. CMS plans to propagate to Phase-2 and Run 3 offline workflows the algorithms currently used at the HLT on GPUs and to add more during Run 3.
=> How many GPUs are needed for offline for the next two years?
See previous answer.
=>Any other plans, e.g. on FPGAs?
There is no official plan in CMS to use FPGAs offline. However, R&D is ongoing in this direction in the area of ML inference.
LHCb
=> Which is the current status of integration of GPUs for offline computing, at the software level?
LHCb has developed the Allen framework, which will be used in Run3 to execute on GPUs the first stage of the High-Level Trigger application (HLT1), to perform partial event reconstruction and selections. The HLT1 will be run online on the event building farm, but it could be run also offline in principle, for example offline GPUs could be used to emulate the HLT1 in simulation. However, Allen can be compiled also on CPUs and we will indeed be using the CPU version for emulating the HLT1. The corresponding compute work would be larger than that needed on GPUs, but it is anyway negligible with respect to the complete simulation work.
=> Are you already using GPUs for offline computing activities? Please, comment on the experience.
Not really. We are aware of a few LHCb analyses that use GPUs for e.g. training of Machine-Learning estimators or maximum-likelihood fits, but they use resources on institute clusters which are outside our distributed computing infrastructure.
=> If using these resources for offline, do you account for the GPU usage in some way?
Not applicable, given the previous remark, and to the fact that we are not aware of how many resources are currently involved.
=> Are there any future plans on GPU resource demands or plans for future utilization at sites? Please, indicate mid-term and long-term plan (if available).
No firm plans have been laid out yet.
On the mid-term scale, some speculations have started in the context of the use cases mentioned in the answers to the second question above, and in the context of possible analysis facilities.
On a longer timescale, we are considering expanding the Allen framework also to later processing stages (e.g. the second stage of the high-level trigger, HLT2, performing full event reconstruction and selections), in the context of future Run4/Run5 LHCb Upgrades.
On the simulation side, we are developing fast simulations that use machine-learning techniques. These fast simulation are currently implemented on CPUs, but they might be ported on GPUs in future.
All the above is purely speculative at this point, so we do not have any GPU resource demands at sites. We might be using opportunistically the GPUs on the online event builder farm when not in use for data taking.
=> Any other plans, e.g. on FPGAs?
We are performing some R&D work on FPGAs and other devices (e.g. TPUs) within the online context, targeting future LHCb Run4/Run5 Upgrades.
=> How many GPUs are needed for offline for the next two years?
Given the previous answers, we do not require any GPUs for offline activities in the next year.
=> Other comments or questions?
In the offline world, LHCb is totally dominated by simulation (using above 90% of compute resources). Therefore, we would welcome development towards porting (parts of) simulation on GPUs, provided they can be advantageous in terms of performance and required resources
CONCLUSIONS
(main) conclusions derived from experiment answers.
ALICE: runs offline software on GPUs at O2, a common framework for online and offline computing. The software is compatible with a variety of GPU models. ALICE plans to use the GPUs extensively for the offline processing from Run3 on. After first year of data taking, it's on the plan to use some GPU resources off-O2.
ATLAS: R&D projects are ongoing investigating GPUs. GPUs used for some analysis (ML training), resources available at some sites. Large-scale GPU deployment at sites would be needed by the start of HL-LHC (~2028), assuming the offline software has been developed to use them.
CMS: Support from CMSSW to run on GPUs, both at the HLT and Grid sites, in close collaboration with the HTCondor team. GPUs are used opportunistically for some machine learning studies, and user jobs. CMS plans to propagate to Phase-2 and Run 3 offline workflows the algorithms currently used at the HLT on GPUs and to add more during Run 3.
LHCb: Plans to use the HLT to run on GPUs to emulate the HLT1 in simulation. Software is ready for this. A few LHCb analyses that use GPUs for e.g. training of Machine-Learning estimators or maximum-likelihood fits, use resources on institute clusters which are outside our distributed computing infrastructure. No plans for GPU resource demands at sites yet.
General: GPU usage marginal (user analysis). No short-term plans to include GPUs from sites (in general). FPGAs only for online, no plans for offline yet. No benchmarks for GPU atm, which affect the accounting. All of these resources treated as opportunistic, for the moment, waiting for guidance from WLCG, though the usage is not yet at scale. This survey can be conducted again in 1 year, to know what's changed.
--
JosepFlix - 2022-02-18