Multi Threading Task Force Entry Point

With the public release of Geant4 version 10.0 (December 2013) and the successful integration of multi-threading capabilities, the task force has been absorbed by the Run, Event and Detector Description Working Group that takes over the task-force responsibilities.
This page will remain to contain updated information.
This is the twiki page for the Geant4 Multi Threading Task Force working group. The task force has been created by the Geant4 SB in February 2013 with a mandate of two years with the following charges:
- collect all the relevant expertise and act as the center of gravity,
- drive the collaboration-wide effort of MT migration, and
- drive the collaboration-wide effort of documenting MT-related matters.
The Task Force coordinator is
Andrea Dotti. The task force has a e-group:
geant4-mt-developers
open to Geant4 collaboration members.
Announcements
Geant4 Version 10.2 has been released on 6 December 2015.
It can be downloaded from
Geant4 website
. Consult the
Release Notes
for a list of updates and announcements regarding multi-threading capabilities in Geant4.
Manuals have been updated and can be downloaded and consulted from
User Documentation
pages. Note that the manuals linked from this twiki have been integrated in Geant4 official documentation. In the next months the content of these pages will evolve to keep track of new developments and improvements in Geant4 Version 10.X.Y. Please refer to the
User Documentation
for information regarding Version 10.2.
Documentation and manuals
This section contains links to the current documentation regarding multi-threading. Please note that these pages are
work in progress. An edited version of these pages has been merged into the
Geant4 documentation manuals for Geant4 Version 10.2
.
How-to multithreading for kernel developers
Geant4MTForKernelDevelopers includes documentation on how to modify and develop Geant4 code taking into account MT requirements: what are split-classes, thread-safety via thread-local-storage, design of relevant classes.
A page with details for Hadronic WG is available here:
HadronicsMTNotes
How-to multithreading for application developers
Geant4MTForApplicationDevelopers includes documentation on how to migrate user-code to be used with a multi-threaded build of Geant4.
QuickMigrationGuideForGeant4V10 contains the twiki version of the previous link that has become the content of the official Geant4 documentation.
Geant4 on Xeon Phi
A tentative manual to run Geant4 on Xeon Phi architectures can be found here:
XeonPhiSupport
Results
Not yet fully updated with Geant4 Version 10.1 Results 
This section contains few highlights of physics and performances results obtained with Geant4 Version 10.2 with multi-threading enabled. Note that the results should be considered
preliminary. You are welcome to contact
Andrea Dotti if you need more information. If you need a copy of these results for a presentation or note, you should contact
Andrea Dotti and
Makoto Asai for permission.
Physics Performances
Physics performances have been evaluated comparing the simulation outputs of a simplified version of LHC calorimeters obtained with Geant4 Version 10.0.beta with multi-threading and the same simulation application obtained with the sequential version of Geant4. As expected results are statistically compatible. This set of
plots compares the simulation obtained with showers of 20
GeV protons impinging on a W/liquid Ar sampling calorimeter (red sequential, blue multi-threaded). Some extracts are shown here: energy deposited in the calorimeter (total of absorber and active material), spectra of all secondary neutrons produced in the shower and a similar spectra for pions.


The following plot compares the response, as a function of pion beam energies, of three different versions of Geant4 agains test-beam data (
NB: in this case
10.0.beta-cand02-MT
reffers to a multi-threaded build of Geant4, but using a sequential version of the application).
More information can be found
here.
Reproducibility
Reproducibility has been studied for the simplified calorimeter setups. Weak reproducibility is confirmed for Geant4 multi-threaded: if a job is started twice with the same initial random seed the status of the RNG engines is the same at the end of both jobs.
Strong reproducibility is confirmed in the majority of the cases: re-seeding a sequential application during the event loop, with the random seed of a particular event obtained with the multi-threaded application reproduces the RNG engine status in 100% of the cases. The optional module Radioactive decay violates reproducibility and this problem is being addressed. A possible non-reproducibility when using NeutronHP module is also being investigated.
More information can be found
here.
Note: The results of this presentation are very old (version 10.0.beta) and not updated, but slides 8 and 9 contain details about our reproducibility tests.
CPU and Memory Performances
There are two metrics that are important for Multi-threading applications:
- Linear speedup of throughput: the ideal speedup is linear with the number of threads (double of events can be produced in the same amount of time if the number of threads doubles). Since threads compete for resources the real-world speedup is less than linear.
- Absolute throughput of 1 thread with respect to sequential Geant4: threads have an overhead that may be responsible of reducing absolute performances of the same application when multi-threaded activated.
For Geant4 Version 10.0 and 10.1 the collaboration has concentrated on maximizing speedup linearity, while absolute throughput has not been optimized yet. However results show that Geant4 Version 10.1 absolute throughput with or without multi-threading enabled is very similar, in addition linearity of speedup is maintained also on a large number of threads (>90% efficiency).
The most recent results can be obtained on the
Geant4 Profiling and Benchmarking Results
page in section 6.
Performances of development versions of Geant4 Version 10.0 have been also measured on Intel architecture and on Intel Xeon Phi (MIC) architectures, showing a good linearity for a large number of threads (see
XeonPhiSupport for compilation instructions):
Memory consumption, as a function of the number of threads is also measured for Intel Xeon Phi architecture (results show comparison of latest public releases):
The following table shows a
preliminary inter comparison of different architectures. Numbers are raw throughput obtained for a single computing unit (in general a CPU, a full card for MIC):
Finally the following plot shows a inter-comparison of performances obtained with the same setup and application changing compilation options:
Weak Vs Strong Scaling: how to measure MT performances
When measuring speedup as a function of thread numbers in MT mode it is important to specify if the measurements are performed for
strong or
weak scaling. The former being the time needed to perform the simulation keeping fixed the number of events independently of the number of threads and the latter keeping fixed the number of events per thread.
In general the strong scaling shows a deviation from ideal linearity larger than the weak scaling because, simplifying, you need to wait the
slowest thread to declare the job finished. This
last events peeling is a sequential part of the code (all other threads have finished already) and a stronger deviation from linearity is introduced. Depending on how events are scheduled to threads and on the nature of the events to be simulated (complex and long vs simple and fast) the differences may be large as demonstrated in the following two plots, showing the exact same simulation (CMS geometry, 50
GeV pions) in the two regimes.
Weak Scaling:
Strong Scaling (less data points have been considered in this case):
There is no
correct way to measure speedup, and it depends on the application type. We usually prefer to report results obtained in
weak scaling regime because they are more representative of the speedup of the
throughput (events/sec) and are less sensitive to the specific type of the application and events being simulated. In publications reporting this number, please specify if weak or strong scaling was measured.
Acknowledgements: The results of this page have been produced by S.Y. Jun (FNAL), A. Ribon, G. Lestaris and W. Pokorski (CERN) and A. Dotti (SLAC).
The old version of this page (used during developments of Geant4 Version 10.0.beta), is available at revision 71 (check History at the bottom of this page).