The ratio of event throughput rates with GPU acceleration to the CPU-only rates as a function of the number of Atlas trigger (Athena) processes running on the CPU. Separate tests were performed with Athena configured to execute only Inner Detector Tracking (ID), only Calorimeter topological clustering (Calo) or both (ID & Calo). The system was configured to either perform the work on the CPU or offload to one or two GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. The ID track seeding takes about 30% of event processing time on CPU and is accelerated by about a factor of 5 on GPU. As a result throughput increases by about 35% with GPU acceleration for up to 14 athena processes. The Calorimeter clustering algorithm takes about 8% of event processing time on CPU and accelerated by about a factor 2 on GPU, however the effect of the acceleration is offset by a small increase in the time of the non-accelerated code and as a result a small decrease in speed is observed with offloading to GPU. |
![]() png eps pdf |
Event throughput rates with and without GPU acceleration as a function of the number of Atlas trigger (Athena) processes running on the CPU. Separate tests were performed with Athena configured to execute only Inner Detector Tracking (ID), only Calorimeter topological clustering (Calo) or both (ID & Calo). The system was configured to either perform the work on the CPU or offload to one or two GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. A significant rate increase is seen when the ID track seeding is offloaded to GPU. The ID track seeding takes about 30% of event processing time on CPU and is accelerated by about a factor of 5 on GPU. A small rate decrease is observed when the calorimeter clustering is offloaded to GPU. The calorimeter clustering takes about 8% of event processing time on CPU and accelerated by about a factor 2 on GPU, however the effect of the acceleration is offset by a small increase in the time of the non-accelerated code. There is only a relatively small increase in rate when the number of Athena processes is increased above the number of physical cores (28). |
![]() png eps pdf |
The time-averaged mean number of Atlas trigger (Athena) processes in a wait-state pending the return of work offloaded to the GPU as a function of the number of running on the CPU. Separate tests were performed with Athena configured to execute only Inner Detector Tracking (ID), only Calorimeter topological clustering (Calo) or both (ID & Calo). The system was configured to offload work to one or two GPUs. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. When offloaded to GPU, the ID track seeding takes about 8% of the total event processing time and so the average number of Athena processes waiting is less than 1 for up to about 12 Athena processes. The offloaded calorimeter clustering takes about 4% of event processing time on CPU and so the average number of Athena processes waiting is less than 1 for up to about 25 Athena processes. |
![]() png eps pdf |
Breakdown of the time per event for Inner Detector Track Seeding offloaded to a GPU showing the time fraction for the kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). Track Seeding consists of the formation of triplets of hits compatible with a track. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. Measurements were made using one GPU and with 12 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. |
![]() png pdf |
Breakdown of the time per event for Inner Detector Tracking offloaded to a GPU showing the time fraction for the Counting, Doublet Making and Triplet Making kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Counting kernel determines the number of pairs of Inner Detector hits and the Doublet and Triplet making kernels form combinations of 2 and 3 hits respectively compatible with a track. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. Measurements were made using one GPU and with 12 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. |
![]() png pdf |
Breakdown of the time per event for the Atlas Trigger process (Athena) running Inner Detector (ID) Track Seeding on the CPU or offloaded to a GPU showing the time fraction for the Counting, Doublet Making and Triplet Making kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Counting kernel determines the number of pairs of ID hits and the Doublet and Triplet making kernels form combinations of 2 and 3 hits respectively compatible with a track. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Athena processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. Measurements were made with one GPU and 12 Athena processes running on the CPU. Athena was configured to only run ID tracking. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. |
![]() png pdf |
Breakdown of the time per event for Calorimeter clustering offloaded to a GPU showing the time fraction for the kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. Measurements were made using one GPU and with 14 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. |
![]() png pdf |
Breakdown of the time per event for Calorimeter clustering offloaded to a GPU showing the time fraction for the Classification, Tagging and Growing kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Classification kernel identifies calorimeter cells that will initiate (seed), propagate (grow), or terminate a cluster, the Tagging kernel assigns a unique tag to seed cells and the Growing kernel associates neighbouring growing or terminating cells to form clusters. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. Measurements were made using one GPU and with 14 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. |
![]() png pdf |
Breakdown of the time per event for the Atlas Trigger process (Athena) running Calorimeter clustering on CPU and offloaded to a GPU showing the time for the Classification, Tagging and Clustering kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Classification kernel identifies calorimeter cells that will initiate (seed), propagate (grow), or terminate a cluster, the Tagging kernel assigns a unique tag to seed cells and the Growing kernel associates neighbouring growing or terminating cells to form clusters. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. There is a small increase in the execution time of the non-accelerated code when the calorimeter clustering is offloaded to GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module. Measurements were made using one GPU and with 14 Athena processes running on the CPU. Athena was configured to only run Calorimeter Clustering. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. |
![]() png pdf |
Transverse impact parameter distributions for the simulated tracks correctly reconstructed by the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5 |
![]() png eps pdf |
Transverse momentum distributions for the simulated tracks correctly reconstructed by the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5 |
![]() png eps pdf |
Track reconstruction efficiency as a function of simulated track azimuth for the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5 |
![]() png eps pdf |
Track reconstruction efficiency as a function of simulated track transverse momentum for the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5 |
![]() png eps pdf |
Cluster Growing algorithm timing. Timing of the Calorimeter Cluster Growing phase of the CPU Topological Clustering (blue line) and the GPU Topological Automaton Cluster [TAC] (red dashed line) algorithms for the full detector. The TAC time includes the processing time and the overheads, data conversion and transfer. The execution time of the algorithms was measured using a data sample of QCD di-jet events with leading-jet transverse momentum above 20 GeV and a fixed number of 40 simultaneous interactions per bunch-crossing. The Topological Clustering runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) and the TAC runs in a GTX650 NVidia card. |
![]() png eps pdf |
Cluster Growing algorithm timing. Timing of the Calorimeter Cluster Growing phase of the CPU Topological Clustering (blue line) and the GPU Topological Automaton Cluster [TAC] (red dashed line) algorithms for the full detector. The TAC time includes the processing time and the overheads, data conversion and transfer. The execution time of the algorithms was measured using a data sample of inclusive top quark pair production with 138 simultaneous interactions per bunch-crossing. The Topological Clustering runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) and the TAC runs in a GTX650 NVidia card. |
![]() png eps pdf |
Timing of the GPU Topological Automaton Cluster [TAC] clusters conversion overhead (purple line) and Cluster Growing (green dashed line). The remaining 5~ms of the TAC total execution time is a constant overhead due to the cell data convertion, data transfer and Inter Process Communication (IPC). The execution time of the algorithms was measured using a data sample of QCD di-jet events with leading-jet transverse momentum above 20 GeV and a fixed number of 40 simultaneous interactions per bunch-crossing. The data conversion runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) processor and the Cluster Growing runs in a GTX650 NVidia card. |
![]() png eps pdf |
Timing of the GPU Topological Automaton Cluster [TAC] clusters conversion overhead (purple line) and Cluster Growing (green dashed line). The remaining 5~ms of the TAC total execution time is a constant overhead due to the cell data convertion, data transfer and Inter Process Communication (IPC). The execution time of the algorithms was measured using a data sample of inclusive top quark pair production with 138 simultaneous interactions per bunch-crossing. The data conversion runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) processor and the Cluster Growing runs in a GTX650 NVidia card. |
![]() png eps pdf |
Relative transverse energy difference of the matched calorimeter clusters reconstructed using the standard CPU cell clustering algorithms and the similar logical algorithm ported to GPU. These are raw clusters, before the execution of the cluster splitting algorithm. The algorithms differ in the way they group the less significant cells, in the CPU they belong to the first cluster that reaches them and in the GPU they will belong to the cluster with the most energetic seed, resulting in the difference observed in lower \Et{} clusters. The x-axis shows the CPU cluster transverse energy in GeV. The y-axis shows the corresponding transverse energy difference, CPU-GPU, divided by the CPU cluster transverse energy. Clusters are matched using the group of cluster seed cells, an unique cluster identifier that is invariant on the algorithm used. The data sample used consist of QCD di-jet events with leading-jet transverse momentum above 20 GeV and a fixed number of 40 simultaneous interactions per bunch-crossing. The Topological Clustering runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) processor and the TAC runs in a GTX650 NVidia card. |
![]() png eps pdf |