Difference: AtlasEdinburghGPUComputing (22 vs. 23)

Revision 232014-09-12 - AndrewWashbrook

Line: 1 to 1
META TOPICPARENT name="AtlasEdinburghGroupUpgradeSoftware"

ATLAS Edinburgh GPU Computing

Line: 129 to 129


  • All source code for the exercises is available here: tar.gz


  • Source code: simple.cpp
  • To compile: g++ -o simple.exe -fopenmp simple.cpp

  • Modify simple.cpp to use OpenMP directives to enable multiple threads to execute the cout statement in parallel.
  • Modify the cout statement to print the ID of the calling thread in the parallel region.
  • Fix the number of threads to 4 using the appropriate OpenMP environment variable.
  • Repeat the above but instead use an appropriate OpenMP runtime function.
  • Why is the output always mangled? (Note: you can fix this in a later example)
  • Why is the thread index not in sequential order?


  • Source code: src/main.cpp
  • To compile: run make
  • To execute: bin/schedule

  • This code generates a configurable “sawtooth”-like workload (i.e a load imbalance between threads) to allow the relative performance of OpenMP loop schedule types to be evaluated.
  • Use OpenMP directives to parallelise the loop contained in the main section of the code.
  • Compile and measure the execution time using 1, 2, 4 and 8 threads (using the timer provided)
  • Capture the thread allocation pattern by iterating the number of calls made by each thread into a suitable array. Display the array contents to show the distribution of work amongst threads for each schedule type
  • Apply different schedule types and chunksizes to the loop construct to demonstrate the effect on execution time. Which type did you find to be optimal?
  • Modify the SAW_LENGTH and SAW_WIDTH variables to study how schedulers perform under different load imbalances.


Triangular matrix addition is performed on 2 square equal matrices of equal size 20 x 20

  • Source code: src/main.cpp
  • To compile: run make
  • To execute: bin/triangular

  • Parallelise the outermost loop and run with an increasing number of threads.
  • How does the code perform compared to the serial version? Experiment with larger matrix sizes and observe the relative timing measurements.
  • Enable the choice of schedule used in the loop to be determined at runtime. How does dynamic compare to static (and guided)?
  • Add a new variable in the loop to sum up all the array elements of Result. Ensure that the total value is consistent with repeated execution of the code.


  • Example from EPCC Advanced OpenMP course

The code generates a grid of points in a box of the complex plane containing the upper half of the (symmetric) Mandelbrot Set. Then each point is iterated using the equation above a finite number of times (2000). If within that number of iterations the threshold condition |z| > 2 is satisfied then that point is considered to be outside of the Mandelbrot Set. Then counting the number of points within the Set and those outside will give an estimate of the area of the Set.

  • Source code: area.c
  • To compile: g++ -o area -fopenmp area.c
  • Parallelise the outer loop using a parallel for directive and declare all shared, private and reduction variables
  • Check results consistency and alter schedule clause to measure performance across an increasing number of threads
  • Rewrite this example using OpenMP tasks. You could try any of the following methods:
    • Make the computation of each point a task, and use one thread only to generate the tasks.
    • Each row of points is a task
    • All threads generate tasks

Simple revisited

  • Get a copy of simple.cpp from the previous session and ensure that the “hello” statement is printed out one at a time (but not necessarily in order).

ATLAS use case: Clusterisation example

  • Provided by Ben Wynne
  • Apply OpenMP methods discussed in the slides and the exercises above to a simple clusterisation algorithm (i.e. a method of assembling adjacent deposits of charge in a detector into space points)
  • The example algorithm and optimisation methods are discussed in detail here: slides

Analysis Tools Studies

Analysis Acceleration in TMVA using GPU Computing

Line: 203 to 270
META FILEATTACHMENT attachment="C_Stepper.cpp" attr="" comment="RK4 stepper code used to compare timings" date="1321962664" name="C_Stepper.cpp" path="C++_Stepper.cpp" size="5106" stream="C++_Stepper.cpp" tmpFilename="/usr/tmp/CGItemp38533" user="wash" version="1"
META FILEATTACHMENT attachment="CUDA_Stepper.cu" attr="" comment="CUDA code for RK4 stepper" date="1321962710" name="CUDA_Stepper.cu" path="CUDA_Stepper.cu" size="8866" stream="CUDA_Stepper.cu" tmpFilename="/usr/tmp/CGItemp38558" user="wash" version="1"
META FILEATTACHMENT attachment="TMVA-GPU.tar.gz" attr="" comment="TMVA code with GPU-based MLP method" date="1365009912" name="TMVA-GPU.tar.gz" path="TMVA-GPU.tar.gz" size="2843579" user="wash" version="1"
META FILEATTACHMENT attachment="ParallelTutorial.pdf" attr="" comment="" date="1410517364" name="ParallelTutorial.pdf" path="ParallelTutorial.pdf" size="106979" user="wash" version="1"
META FILEATTACHMENT attachment="OpenMPExercises.tar.gz" attr="" comment="" date="1410517364" name="OpenMPExercises.tar.gz" path="OpenMPExercises.tar.gz" size="2320147" user="wash" version="1"
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback