AtlasPublicTopicHeader.png

Trigger Software Upgrade Public Results

Introduction

Approved plots that can be shown by ATLAS speakers at conferences and similar events. Please do not add figures on your own. Contact the responsible project leader in case of questions and/or suggestions. Follow the guidelines on the trigger public results page.

Phase-I Upgrade public plots

ATL-COM-DAQ-2016-116 ATLAS Trigger GPU Demonstrator Performance Plots

The ratio of event throughput rates with GPU acceleration to the CPU-only rates as a function of the number of Atlas trigger (Athena) processes running on the CPU. Separate tests were performed with Athena configured to execute only Inner Detector Tracking (ID), only Calorimeter topological clustering (Calo) or both (ID & Calo). The system was configured to either perform the work on the CPU or offload to one or two GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.  The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. The ID track seeding takes about 30% of event processing time on CPU and is accelerated by about a factor of 5 on GPU. As a result throughput increases by about 35% with GPU acceleration for up to 14 athena processes. The Calorimeter clustering algorithm takes about 8% of event processing time on CPU and accelerated by about a factor 2 on GPU, however the effect of the acceleration is offset by a small increase in the time of the non-accelerated code and as a result a small decrease in speed is observed with offloading to GPU.
png eps pdf
Event throughput rates with and without GPU acceleration as a function of the number of Atlas trigger (Athena) processes running on the CPU. Separate tests were performed with Athena configured to execute only Inner Detector Tracking (ID), only Calorimeter topological clustering (Calo) or both (ID & Calo). The system was configured to either perform the work on the CPU or offload to one or two GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.  The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. A significant rate increase is seen when the ID track seeding is offloaded to GPU. The ID track seeding takes about 30% of event processing time on CPU and is accelerated by about a factor of 5 on GPU. A small rate decrease is observed when the calorimeter clustering is offloaded to GPU. The calorimeter clustering takes about 8% of event processing time on CPU and accelerated by about a factor 2 on GPU, however the effect of the acceleration is offset by a small increase in the time of the non-accelerated code. There is only a relatively small increase in rate when the number of Athena processes is increased above the number of physical cores (28).
png eps pdf
The time-averaged mean number of Atlas trigger (Athena) processes in a wait-state pending the return of work offloaded to the GPU as a function of the number of running on the CPU. Separate tests were performed with Athena configured to execute only Inner Detector Tracking (ID), only Calorimeter topological clustering (Calo) or both (ID & Calo). The system was configured to offload work to one or two GPUs. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.  The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1. When offloaded to GPU, the ID track seeding takes about 8% of the total event processing time and so the average number of Athena processes waiting is less than 1 for up to about 12 Athena processes. The offloaded calorimeter clustering takes about 4% of event processing time on CPU and so the average number of Athena processes waiting is less than 1 for up to about 25 Athena processes.
png eps pdf
Breakdown of the time per event for Inner Detector Track Seeding offloaded to a GPU showing the time fraction for the kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). Track Seeding consists of the formation of triplets of hits compatible with a track. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.   Measurements were made using one GPU and with 12 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1.
png pdf
Breakdown of the time per event for Inner Detector Tracking offloaded to a GPU showing the time fraction for the Counting, Doublet Making and Triplet Making kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Counting kernel determines the number of pairs of Inner Detector hits and the Doublet and Triplet making kernels form combinations of 2 and 3 hits respectively compatible with a track. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.   Measurements were made using one GPU and with 12 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1.
png pdf
Breakdown of the time per event for the Atlas Trigger process (Athena) running Inner Detector (ID) Track Seeding on the CPU or offloaded to a GPU showing the time fraction for the Counting, Doublet Making and Triplet Making kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Counting kernel determines the number of pairs of ID hits and the Doublet and Triplet making kernels form combinations of 2 and 3 hits respectively compatible with a track. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Athena processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.   Measurements were made with one GPU and 12 Athena processes running on the CPU. Athena was configured to only run ID tracking. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1.
png pdf
Breakdown of the time per event for Calorimeter clustering offloaded to a GPU showing the time fraction for the kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.   Measurements were made using one GPU and with 14 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1.
png pdf
Breakdown of the time per event for Calorimeter clustering offloaded to a GPU showing the time fraction for the Classification, Tagging and Growing kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Classification kernel identifies calorimeter cells that will initiate (seed), propagate (grow), or terminate a cluster, the Tagging kernel assigns a unique tag to seed cells and the Growing kernel associates neighbouring growing or terminating cells to form clusters. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.   Measurements were made using one GPU and with 14 Athena processes running on the CPU. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1.
png pdf
Breakdown of the time per event for the Atlas Trigger process (Athena) running Calorimeter clustering on CPU and offloaded to a GPU showing the time for the Classification, Tagging and Clustering kernels running on the GPU (GPU execution) and the overhead associated with offloading the work (other). The Classification kernel identifies calorimeter cells that will initiate (seed), propagate (grow), or terminate a cluster, the Tagging kernel assigns a unique tag to seed cells and the Growing kernel associates neighbouring growing or terminating cells to form clusters. The overhead comprises the time to convert data-structures between CPU and GPU data-formats, the data transfer time between CPU and GPU and the Inter Process Communication (IPC) time that accounts for the transfer of data between the Atlas Trigger (Athena) processes and the process handling communication with the GPU. There is a small increase in the execution time of the non-accelerated code when the calorimeter clustering is offloaded to GPU. The system consisted of a two Intel(R) Xeon(R) E5-2695 v3 14-core CPU with a clock speed of 2.30GHz and two NVidia GK210GL GPU in a Tesla K80 module.   Measurements were made using one GPU and with 14 Athena processes running on the CPU. Athena was configured to only run Calorimeter Clustering. The input was a simulated 𝑡𝑡 ̅ dataset converted to a raw detector output format (bytestream). An average of 46 minimum bias events per simulated collision were superimposed corresponding to instantaneous luminosity of 1.7x1034 cm-2s-1.
png pdf

ATL-COM-DAQ-2016-119 Performance plots of HLT Inner Detector tracking algorithm implemented on GPU

Transverse impact parameter distributions for the simulated tracks correctly reconstructed by the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5
png eps pdf
Transverse momentum distributions for the simulated tracks correctly reconstructed by the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5
png eps pdf
Track reconstruction efficiency as a function of simulated track azimuth for the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5
png eps pdf
Track reconstruction efficiency as a function of simulated track transverse momentum for the GPU-accelerated tracking algorithm and the standard CPU-only algorithm. The reference CPU algorithm was FastTrackFinder consisting of track seed (spacepoint triplet) maker and combinatorial track following; the GPU algorithm was FastTrackFinder with GPU-accelerated track seed maker. The simulated tracks were required to have pT>1GeV and |eta|<2.5
png eps pdf

ATL-COM-DAQ-2016-133 Performance plots of HLT algorithms ported to GPU

Cluster Growing algorithm timing. Timing of the Calorimeter Cluster Growing phase of the CPU Topological Clustering (blue line) and the GPU Topological Automaton Cluster [TAC] (red dashed line) algorithms for the full detector. The TAC time includes the processing time and the overheads, data conversion and transfer. The execution time of the algorithms was measured using a data sample of QCD di-jet events with leading-jet transverse momentum above 20 GeV and a fixed number of 40 simultaneous interactions per bunch-crossing. The Topological Clustering runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) and the TAC runs in a GTX650 NVidia card.
png eps pdf
Cluster Growing algorithm timing. Timing of the Calorimeter Cluster Growing phase of the CPU Topological Clustering (blue line) and the GPU Topological Automaton Cluster [TAC] (red dashed line) algorithms for the full detector. The TAC time includes the processing time and the overheads, data conversion and transfer. The execution time of the algorithms was measured using a data sample of inclusive top quark pair production with 138 simultaneous interactions per bunch-crossing. The Topological Clustering runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) and the TAC runs in a GTX650 NVidia card.
png eps pdf
Timing of the GPU Topological Automaton Cluster [TAC] clusters conversion overhead (purple line) and Cluster Growing (green dashed line). The remaining 5~ms of the TAC total execution time is a constant overhead due to the cell data convertion, data transfer and Inter Process Communication (IPC). The execution time of the algorithms was measured using a data sample of QCD di-jet events with leading-jet transverse momentum above 20 GeV and a fixed number of 40 simultaneous interactions per bunch-crossing. The data conversion runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) processor and the Cluster Growing runs in a GTX650 NVidia card.
png eps pdf
Timing of the GPU Topological Automaton Cluster [TAC] clusters conversion overhead (purple line) and Cluster Growing (green dashed line). The remaining 5~ms of the TAC total execution time is a constant overhead due to the cell data convertion, data transfer and Inter Process Communication (IPC). The execution time of the algorithms was measured using a data sample of inclusive top quark pair production with 138 simultaneous interactions per bunch-crossing. The data conversion runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) processor and the Cluster Growing runs in a GTX650 NVidia card.
png eps pdf
Relative transverse energy difference of the matched calorimeter clusters reconstructed using the standard CPU cell clustering algorithms and the similar logical algorithm ported to GPU. These are raw clusters, before the execution of the cluster splitting algorithm. The algorithms differ in the way they group the less significant cells, in the CPU they belong to the first cluster that reaches them and in the GPU they will belong to the cluster with the most energetic seed, resulting in the difference observed in lower \Et{} clusters. The x-axis shows the CPU cluster transverse energy in GeV. The y-axis shows the corresponding transverse energy difference, CPU-GPU, divided by the CPU cluster transverse energy. Clusters are matched using the group of cluster seed cells, an unique cluster identifier that is invariant on the algorithm used. The data sample used consist of QCD di-jet events with leading-jet transverse momentum above 20 GeV and a fixed number of 40 simultaneous interactions per bunch-crossing. The Topological Clustering runs on a single CPU core of an AMD FX-8320 processor (3.5~GHz) processor and the TAC runs in a GTX650 NVidia card.
png eps pdf


Responsible: JohnBaines, TomaszBold
Subject: public

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf CaloExecutionTimePiChart1.pdf r1 manage 14.4 K 2016-09-22 - 17:14 JohnTMBaines  
PNGpng CaloExecutionTimePiChart1.png r1 manage 100.1 K 2016-09-22 - 17:21 JohnTMBaines  
PDFpdf CaloExecutionTimePiChart2.pdf r2 r1 manage 15.9 K 2016-09-23 - 10:53 JohnTMBaines  
PNGpng CaloExecutionTimePiChart2.png r2 r1 manage 114.0 K 2016-09-23 - 10:53 JohnTMBaines  
PDFpdf CaloExecutionTimePiChart3.pdf r1 manage 16.5 K 2016-09-22 - 17:12 JohnTMBaines  
PNGpng CaloExecutiontimePiChart3.png r1 manage 135.4 K 2016-09-22 - 17:12 JohnTMBaines  
Unknown file formateps HLT_a0.eps r1 manage 13.4 K 2016-09-23 - 17:16 DmitryEmeliyanov1  
PDFpdf HLT_a0.pdf r1 manage 16.4 K 2016-09-23 - 17:16 DmitryEmeliyanov1  
PNGpng HLT_a0.png r1 manage 17.4 K 2016-09-23 - 17:16 DmitryEmeliyanov1  
Unknown file formateps HLT_pT.eps r1 manage 10.1 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PDFpdf HLT_pT.pdf r1 manage 14.8 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PNGpng HLT_pT.png r1 manage 16.7 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
Unknown file formateps HLT_pT_eff.eps r1 manage 10.6 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PDFpdf HLT_pT_eff.pdf r1 manage 14.9 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PNGpng HLT_pT_eff.png r1 manage 16.6 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
Unknown file formateps HLT_phi_eff.eps r1 manage 10.2 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PDFpdf HLT_phi_eff.pdf r1 manage 14.5 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PNGpng HLT_phi_eff.png r1 manage 15.3 K 2016-09-23 - 17:28 DmitryEmeliyanov1  
PDFpdf IDexecutiontimePiChart1.pdf r1 manage 15.2 K 2016-09-22 - 17:12 JohnTMBaines  
PNGpng IDexecutiontimePiChart1.png r1 manage 47.2 K 2016-09-22 - 17:14 JohnTMBaines  
PDFpdf IDexecutiontimePiChart2.pdf r1 manage 16.6 K 2016-09-22 - 17:12 JohnTMBaines  
PNGpng IDexecutiontimePiChart2.png r1 manage 132.6 K 2016-09-22 - 17:12 JohnTMBaines  
PDFpdf IDexecutiontimePiChart3.pdf r2 r1 manage 17.1 K 2016-09-24 - 19:13 JohnTMBaines  
PNGpng IDexecutiontimePiChart3.png r2 r1 manage 113.1 K 2016-09-24 - 19:12 JohnTMBaines  
Unknown file formateps L1J50_ClusterMakerTime_jet.eps r1 manage 9.7 K 2016-10-06 - 16:27 AdemarDelgado ClusterGrow Total Time
PDFpdf L1J50_ClusterMakerTime_jet.pdf r1 manage 14.3 K 2016-10-06 - 16:27 AdemarDelgado ClusterGrow Total Time
PNGpng L1J50_ClusterMakerTime_jet.png r1 manage 17.1 K 2016-10-06 - 16:27 AdemarDelgado ClusterGrow Total Time
Unknown file formateps L1J50_ClusterMakerTime_ttbar.eps r1 manage 9.8 K 2016-10-06 - 16:27 AdemarDelgado ClusterGrow Total Time
PDFpdf L1J50_ClusterMakerTime_ttbar.pdf r1 manage 14.4 K 2016-10-06 - 16:27 AdemarDelgado ClusterGrow Total Time
PNGpng L1J50_ClusterMakerTime_ttbar.png r1 manage 16.8 K 2016-10-06 - 16:27 AdemarDelgado ClusterGrow Total Time
Unknown file formateps L1J50_GrowStepsTime_jet.eps r1 manage 9.9 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow steps time
PDFpdf L1J50_GrowStepsTime_jet.pdf r1 manage 14.4 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow steps time
PNGpng L1J50_GrowStepsTime_jet.png r1 manage 18.5 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow steps time
Unknown file formateps L1J50_GrowStepsTime_ttbar.eps r1 manage 10.1 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow steps time
PDFpdf L1J50_GrowStepsTime_ttbar.pdf r1 manage 14.4 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow steps time
PNGpng L1J50_GrowStepsTime_ttbar.png r1 manage 17.8 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow steps time
Unknown file formateps L1J50_deltaEtvsEt_jet.eps r1 manage 30.0 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow Et CPU vs GPU
PDFpdf L1J50_deltaEtvsEt_jet.pdf r1 manage 18.6 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow Et CPU vs GPU
PNGpng L1J50_deltaEtvsEt_jet.png r1 manage 22.1 K 2016-10-06 - 16:31 AdemarDelgado ClusterGrow Et CPU vs GPU
Unknown file formateps occupancyG2.eps r1 manage 9.5 K 2016-09-22 - 15:19 JohnTMBaines  
PDFpdf occupancyG2.pdf r1 manage 18.9 K 2016-09-22 - 15:19 JohnTMBaines  
PNGpng occupancyG2.png r1 manage 16.8 K 2016-09-22 - 15:19 JohnTMBaines  
Unknown file formateps rateG2.eps r1 manage 12.3 K 2016-09-22 - 15:19 JohnTMBaines  
PDFpdf rateG2.pdf r1 manage 21.8 K 2016-09-22 - 15:19 JohnTMBaines  
PNGpng rateG2.png r1 manage 20.5 K 2016-09-22 - 15:19 JohnTMBaines  
Unknown file formateps speedupG2.eps r1 manage 13.7 K 2016-09-22 - 15:19 JohnTMBaines  
PDFpdf speedupG2.pdf r1 manage 21.2 K 2016-09-22 - 15:19 JohnTMBaines  
PNGpng speedupG2.png r1 manage 20.9 K 2016-09-22 - 15:19 JohnTMBaines  
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2016-10-06 - AdemarDelgado
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback