Difference: CodeAnalysisTools (1 vs. 132)

Revision 1322019-06-17 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 612 to 612
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier) or use another system without the strict quotas applied on AFS (like home institute systems).
Deleted:
<
<

Performance and Regression testing service

The Performance and Regression testing service can be accessed at https://lhcb-pr.web.cern.ch/lhcb-pr/. An introduction to its design and capabilities was given in this talk by Emmanouil Kiagias at the Core Software meeting on 24th April 2013

 

Further reading

Revision 1312019-04-25 - MichelDeCian

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 216 to 216
 gaudirun.py -T --profilerName=valgrindcallgrind --profilerExtraOptions="__cache-sim=yes __branch-sim=yes __instr-atstart=no -v __smc-check=all-non-file __dump-instr=yes __trace-jump=yes" joboptions.py |& tee profile.log
Added:
>
>

Profiling specific lines of code

Callgrind can be told to only profile certain lines of code, for example within a function. The following steps are needed:
  • Get the header files:
    local_valgrind.h, local_callgrind.h 
    e.g. from
     /cvmfs/lhcb.cern.ch/lib/lhcb/GAUDI/GAUDI_v30r5/GaudiProfiling/src/component/valgrind/local_valgrind.h 
  • Add the following to your code:
    #include "local_callgrind.h"
    // ...
    CALLGRIND_START_INSTRUMENTATION;
    // some code
    CALLGRIND_STOP_INSTRUMENTATION;
    
  • run with
      ../run gaudirun.py --profilerName=valgrindcallgrind --profilerExtraOptions="__smc-check=all-non-file  __dump-instr=yes __trace-jump=yes __instr-atstart=no (other options)" profileMyCode.py
    
  • Do not use
    CallgrindProfile
    as this seems to interfere with it.
 

Memory Usage Monitoring

The valgrind tool called "massif" also exists which does some detialed memory usage monitoring. Full documentation of this tool is available in section 9 of the valgrind user guide.

Revision 1302019-04-23 - RosenMatev

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 239 to 239
 This link is a snapshot of the Atlas valgrind TWiki, that contains some useful information.

Debugging gaudirun.py on Linux with gdb

Deleted:
<
<
 The easiest way to debug with gdb is to use the built-in --gdb flag of gaudirun.py
Changed:
<
<
> gaudirun.py --gdb yourOptions.py

Note: Currently, the default gdb on lxplus is too old to be useful for gcc 4.8 (or later) builds. There is a JIRA task to make newer gdb available on login. Until it is finished, please use gdb from CVMFS or AFS by one of:

export PATH=/cvmfs/lhcb.cern.ch/lib/contrib/gdb/7.11/x86_64-slc6-gcc49-opt/bin:$PATH
>
>
gaudirun.py --gdb yourOptions.py This picks up gdb from LCG. (LCG 95 comes with verison 8.2.1, while older releases have 7.12.1)
 
Changed:
<
<
or
>
>
Note: If possible use a debug build (or at least a build with debug symbols) as that will provide more information. This can be done by running, for example,
 
Changed:
<
<
export PATH=/afs/cern.ch/sw/lcg/external/gdb/7.11/x86_64-slc6-gcc48-opt/bin:$PATH
>
>
LbLogin -c x86_64-centos7-gcc8-dbg # for a debug build LbLogin -c x86_64-centos7-gcc8-opt+g # for an optimized build with debug symbols
 
Added:
>
>

Running with a different gdb

 Alternatively, gaudirun.py applications can be run through the gdb debugger using a similar trick as with valgrind, to call python directly and pass gaudirun.py as an argument. Just type
Changed:
<
<
> gdb --args python `which gaudirun.py` yourOptions.py
>
>
gdb --args python `which gaudirun.py` yourOptions.py
 and then to run the application then simply type run at the gdb command line. ( The option --args tells gdb to interpret any additional options after the executable name as arguments to that application, instead of the default which is to try and interpret them as core files...)

When it crashes, type where to get a traceback at that point.

Changed:
<
<
If possible use a debug build (see below) as that will provide more information. This can be done by running
>
>
Note: Currently, the default gdb on lxplus is too old to be useful for gcc 4.8 (or later) builds. You can either use your system gdb or use one from cvmfs with
 
Changed:
<
<
> LbLogin -c $CMTDEB

before SetupProject etc.

>
>
export PATH=/cvmfs/lhcb.cern.ch/lib/contrib/gdb/7.11/x86_64-slc6-gcc49-opt/bin:$PATH
 

Attaching GDB to a running process

Revision 1292019-02-22 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 8 to 8
 

LHCb dedicated Profiling and Regression test service

Changed:
<
<
LHCbPR is LHCb's own project to track performance issues in the general framework and the projects. Its aim is to improve the refactoring process during 2013-2015 and help developers to obtain performing code. Development of LHCbPR is ongoing.
>
>
LHCbPR is LHCb's own project to track performance issues in the general framework and the projects. Its aim is to help developers to obtain performing code.
 

Simplifying the problem, with GaudiDiff and GaudiExcise

Revision 1282018-08-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 551 to 551
 + 10.21% 0.62% 07_io_bound_evl libstdc++.so.6.0.24 [.] malloc@plt

This example also highlights one open issue with profiling C++ code, which is that the function names of C++ methods in idiomatic libraries such as the STL or boost can be gigantic, far away from the API names that are used in the code, and in general hard to read. Unfortunately, there is no good solution to this problem, the best that one can do is usually to look for interesting keywords in the long-winded C++ name (_Hashtable and _M_insert in the example above) and try to associate them with specific patterns in the corresponding function's code.

Added:
>
>
Note if you prefer to sort your call graph by caller rather than the (default) callee, use

perf report -g 'graph,0.5,caller'
 

This concludes this short introduction to perf. Here, we have only have scratched the surface of what perf can do. Other interesting topics could have included...

Revision 1272018-06-26 - HadrienGrasland

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 487 to 487
  There are several ways to obtain a call graph, each with different advantages and drawbacks:
Changed:
<
<
  • The best method in every respect, when it is available, is to use the Last Branch Record (LBR) hardware facility for this purpose. But this measurement method is only available on recent CPUs (>= Haswell for Intel).
>
>
  • The best method in almost every respect, when it can be used, is to use the Last Branch Record (LBR) hardware facility for this purpose. But this measurement method is only available on recent CPUs (>= Haswell for Intel), and there are hardware limitations on the depth of the call stacks that it can record.
 
  • A universally compatible counterpart is to periodically make a copy of the program's stack and analyze it using the program's DWARF debug information. This is the same method used by the GDB debugger to generate stack traces. Sadly, the need to make stack copies gives this profiling method very bad performance, which means that perf can only measure the program's state rarely, and thus performance profiles must be acquired over much longer periods of time (several minutes) in order to be statistically significant. The profile files will also be much bigger, and slower to analyze.
  • Sometimes, an alternative method based on sampling only the frame pointer of the program can achieve the same result at a much reduced cost, without loss of portability. Unfortunately, there is a very popular compiler performance optimization that breaks this profiling method, and even if you disable it on your code, the libraries that you use will most likely have it enabled. Therefore use of this profiling method is not recommended.
Changed:
<
<
To measure a call graph, pass the "--call-graph=<method>" switch to perf record, where <method> will be either "lbr" or "dwarf" depending on which one your hardware allows you to use. Here, I will assume the availability of LBR-based call graph profiling:
>
>
To measure a call graph, pass the "--call-graph=<method>" switch to perf record, where <method> will be either "lbr" or "dwarf" depending on which one your hardware allows you to use. Here is a DWARF-based version:
 
Changed:
<
<
> perf record --call-graph=lbr <your command> && perf report
>
>
> perf record --call-graph=dwarf <your command> && perf report
 [... an entire program execution later ... ] Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936 Children Self Command Shared Object Symbol

Revision 1262018-01-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 194 to 194
 
> callgrind_control --instr=off
Changed:
<
<

Alternative approach

>
>

Job Options Driven Profiling

Callgrind can also be configured from within job options, e.g. :-

 
Deleted:
<
<
Callgrind can also be configured from within job options:
 
 def addProfile():
     from Configurables import CallgrindProfile
Line: 207 to 208
  p.DumpName = 'CALLGRIND-OUT' GaudiSequencer('RecoTrSeq').Members.insert(0, p) appendPostConfigAction(addProfile)
Added:
>
>

Then to run use the following command line.

 
Changed:
<
<
gaudirun.py --profilerName=valgrindcallgrind --profilerExtraOptions="__instr-atstart=no -v __smc-check=all-non-file __dump-instr=yes __trace-jump=yes" joboptions.py |& tee out-0005.log
>
>
gaudirun.py -T --profilerName=valgrindcallgrind --profilerExtraOptions="__cache-sim=yes __branch-sim=yes __instr-atstart=no -v __smc-check=all-non-file __dump-instr=yes __trace-jump=yes" joboptions.py |& tee profile.log
 

Memory Usage Monitoring

Revision 1252017-12-12 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 592 to 592
 

Further reading

Added:
>
>
 
Line: 600 to 601
 
Changed:
<
<

ChrisRJones - 27 Feb 2006 MarcoCattaneo - 11 Mar 2013
>
>

ChrisRJones - 27 Feb 2006 MarcoCattaneo - 2017-12-12
 
META FILEATTACHMENT attachment="Atlas_UsingValgrind.pdf" attr="h" comment="Snapshot of Atlas valgrind TWiKi, taken on 10th March 2013" date="1363024926" name="Atlas_UsingValgrind.pdf" path="Atlas_UsingValgrind.pdf" size="172163" user="cattanem" version="1"
META TOPICMOVED by="ChrisRJones" date="1161806930" from="LHCb.CodeProfiling" to="LHCb.CodeAnalysisTools"

Revision 1242017-12-06 - GraemeAStewart

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 395 to 395
  4.431148695 seconds time elapsed
Changed:
<
<
Here you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), let alone all the way to the last-level cache (LLC), but that when we do reach for that one, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitudes being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. In this case they are ultimately a concern: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
>
>
Here you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), let alone all the way to the last-level cache (LLC), but that when we do reach for that one, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitudes being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. In this case they are ultimately not a concern: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
  Another thing to pay attention to here is the new column of percentages on the right. CPU performance monitoring counters have some hardware limitations, the most important of which being that you can only monitor a small set of them at any given time. Here, because we are looking at a lot of different statistics at once, perf was forced to monitor only a subset of them at a time and constantly switch between them. It then interpolates the missing data samples, which has some overhead and reduces the quality of the measurement. The percentage tells you during which fraction of the total measurement time the corresponding performance counter was actually active.

Revision 1232017-12-02 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 560 to 560
  VTune uses the same performance analysis techniques as perf, but is commercially supported by Intel. This comes with different trade-offs: you must buy a very expensive (~1k$) license if you want to use it on your personal computer, but as long as you are able to rely on the licenses that are provided by CERN Openlab (or perhaps your institution), you will be able to enjoy a very nice and powerful graphical user interface, high-quality support and documentation from Intel, and periodical tutorials from Openlab. Obviously, you shouldn't expect it to work reliably on any CPU which has not been manufactured by Intel.
Changed:
<
<
A Gaudi auditor has been provided by Sascha Mazurov to interface to the VTune Intel profiler. See IntelProfiler, IntelProfilerExample and Video tutorial on profiler installation in Gaudi, running and analyzing it from command line (without GUI)
>
>
To use the tools, see the instructions at https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools.

A Gaudi auditor has been provided by Sascha Mazurov to interface to the VTune Intel profiler. See IntelProfiler, IntelProfilerExample and Video tutorial on profiler installation in Gaudi, running and analyzing it from command line (without GUI). Note that the Profiler is now part of the main Gaudi project, so it is no longer necessary to manually check it out from gitlab. It is not built by default though, so to enable it you must first setup the environment as per the instructions above and then build your checkout of the Gaudi project.

 
Deleted:
<
<
See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools
 

Memory Profiling with Jemalloc

Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.

Revision 1222017-12-01 - GraemeAStewart

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 323 to 323
  All the previously discussed performance analysis tools are often unable to provide a precise quantitative analysis of what happens as a program is executed on a real CPU, for different reasons:
Changed:
<
<
  • Valgrind essentially works by emulating the execution of the program on a virtual CPU. This artificially inflates the cost of CPU computations with respect to other operations (such as IO) by more than an order of magnitude, and entails that performance analysis must be based on a mathematical model of a CPU, which is in practice quite far off from what modern Intel CPUs actually do.
>
>
  • Valgrind essentially works by emulating the execution of the program on a virtual CPU. This artificially inflates the cost of CPU computations with respect to other operations (such as I/O) by more than an order of magnitude, and entails that performance analysis must be based on a mathematical model of a CPU, which is in practice quite far off from what modern Intel CPUs actually do.
 
  • Google's profiler, like other user-space sampling profilers (gprof, igprof...), is only able to tell where a program spends its time, and not why. For example, it cannot tell where CPU cache misses are happening, which complicates memory layout optimizations.
Changed:
<
<
  • Neither of these tools is able to analyze the time spent in the operating system kernel, which is important for assessing the impact of blocking IO operations or lock contention in multi-threaded code.
>
>
  • Neither of these tools is able to analyze the time spent in the operating system kernel, which is important for assessing the impact of blocking I/O operations or lock contention in multi-threaded code.
  A more precise analysis of program execution on a real machine can be obtained from tools which leverage the Performance Monitoring Counters of modern CPUs, such as the "perf" profiler of the Linux kernel or Intel's VTune Amplifier. These tools provide an accurate and detailed picture of what is going on in the CPU as a program is executing, and have a negligible impact on the performance of the program under study when used correctly.
Line: 335 to 335
  The perf profiler is a free and open source program which builds on the perf_events interface that has been integrated in the Linux kernel since Linux 2.6.31. It is highly recommended to use it with as recent a Linux kernel release as possible (at least 3.x) for the following reasons:
Changed:
<
<
  • Early 2.6.x version had some very nasty bugs, causing system lock-up for example.
>
>
  • Early 2.6.x versions had some very nasty bugs, causing system lock-up, for example.
 
  • Due to the way it operates, perf requires CPU-specific support code. This means in particular that you are unlikely to be able to leverage your CPU's full performance monitoring capabilities if your Linux kernel version is older than your CPU model.
  • Perf is evolving quickly, and new versions can also bring massive improvements in features, usability and performance.
Line: 370 to 370
  4.429156950 seconds time elapsed
Changed:
<
<
The output of perf stat contains raw statistics (on the left) and some interpretations of the numbers (on the right). Here, we can see that the program under study is not multi-threaded (as the right column points out, only 1 CPU is utilized), but makes reasonably efficient use of the single CPU core that it runs on (at 1.91 instructions per cycle, we're not too far away from the theoretical Haswell maximum for this code). One thing which this wiki page cannot expose is that if a performance number is abnormally bad, perf will also helpfully highlight it using color in your terminal.
>
>
The output of perf stat contains raw statistics (on the left) and some interpretations of the numbers (on the right). Here, we can see that the program under study is not multi-threaded (as the right column points out, only 1 CPU is utilized), but makes reasonably efficient use of the single CPU core that it runs on (at 1.91 instructions per cycle, we're not too far away from the theoretical Haswell maximum for this code). If a performance number is abnormally bad, perf will helpfully highlight it using color in your terminal (which we can't show in this wiki).
  We can ask perf for more statistics using the "-d" command line switch:
Line: 395 to 395
  4.431148695 seconds time elapsed
Changed:
<
<
Here, you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), let alone all the way to the last-level cache (LLC), but that when we do reach for that one, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitudes being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. However, here, they are ultimately not concerning: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
>
>
Here you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), let alone all the way to the last-level cache (LLC), but that when we do reach for that one, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitudes being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. In this case they are ultimately a concern: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
 
Changed:
<
<
Another thing to pay attention to here is the new column of percentages on the right. CPU performance monitoring counters have some hardware limitations, the most important of which being that you can only monitor a small set of them at any given time. Here, because we are looking at a lot of different statistics at once, perf was forced to only monitor a subset of them at a time and constantly switch between them, then interpolate the missing data samples, which has some overhead and reduces the quality of the measurement. The percentage tells you during which fraction of the total measurement time the corresponding performance counter was actually active.
>
>
Another thing to pay attention to here is the new column of percentages on the right. CPU performance monitoring counters have some hardware limitations, the most important of which being that you can only monitor a small set of them at any given time. Here, because we are looking at a lot of different statistics at once, perf was forced to monitor only a subset of them at a time and constantly switch between them. It then interpolates the missing data samples, which has some overhead and reduces the quality of the measurement. The percentage tells you during which fraction of the total measurement time the corresponding performance counter was actually active.
  If you know exactly which performance counters you are interested in, you can get more precise measurements by asking perf to only measure these ones, using the "-e" command line switch:
Line: 414 to 414
  4.404909465 seconds time elapsed
Changed:
<
<
As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second step after a potential performance problem has been identified in the "generic" perf stat output.
>
>
As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as many performance statistics over the same program run time. However, we lost the other performance counters. This is usually a good second step after a potential performance problem has been identified in the "generic" perf stat output.
  You can get a full list of the (numerous) CPU performance counters supported by perf using the "perf list" command.
Changed:
<
<
Perf stat is very powerful and has very low overhead, but it only gives you coarse-grained information. Often, you want to know where your program spends its time, and more importantly why. This information can be measured using the "perf record" and "perf report" commands. The first one analyses the performance of your program by periodically sampling which function your code is executing, and how the CPU performance counters are evolving, then correlating these two informations. The second one displays the resulting statistics in a nice textual user interface.
>
>
Perf stat is very powerful and has very low overhead, but it only gives you coarse-grained information. Often, you want to know where your program spends its time, and more importantly why. This information can be measured using the "perf record" and "perf report" commands. The first one analyzes the performance of your program by periodically sampling which function your code is executing and how the CPU performance counters are evolving, then it correlates these two pieces of information. The second one displays the resulting statistics in a nice text based user interface.
  In order to report function names, your program must be compiled with debugging symbols (as enabled by the "-g" GCC flag, or the "Debug" and "RelWithDebInfo" CMake build configurations). To get a profile that is representative of your application's actual performance, you will obviously need to leave compiler optimizations on, which is exactly the kind of scenario that CMake's built-in "RelWithDebInfo" configuration was designed for.
Line: 449 to 449
  This message warns you that perf is not currently allowed to report the names of the functions that you call within the Linux kernel. This ability can be very useful when the performance of your program is limited by system calls, and you want to understand what exactly is going on. If you have administrator rights on your machine, you can enable this feature by writing "0" in the /proc/sys/kernel/kptr_restrict pseudo-file. But we do not need this feature for this short tutorial, and perf can live without it, so we'll do without for now.
Changed:
<
<
So without much ado, let us look at the report:
>
>
So let's look at the report:
 
> perf report
Samples: 207K of event 'cycles:uppp', Event count (approx.): 106619947387                                                                                                                                                                                                                                                     
Line: 472 to 472
 
  1. 10% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_unlock_usercnt
[ ... shortened for brievity ... ]
Changed:
<
<
This profile was acquired on a different program than the one which I ran "perf stat" on at the beginning, and as you can see, this specific program could use more optimization work. It spends about half of its time in memory allocation related functions (malloc, free, and implementation details thereof), which is a common performance problem in idiomatic C++ code.
>
>
This profile was acquired on a different program than the one which "perf stat" ran on at the beginning. This specific program could use more optimization work as it spends about half of its time in memory allocation related functions (malloc, free, and implementation details thereof), which is a common performance problem in idiomatic C++ code.
 
Changed:
<
<
One piece of information which you will notice at the top of the table is that this profile was based on the "cycles" performance counter, which tells how many CPU clock cycles have elapsed. This is the most common performance indicator in early performance analysis, as it tells you where your program spends its time, which is what one is usually initially most interested in. However, you can use any performance counter here, using the same "-e" flag that we used with perf stat before. For example, "perf record -e L1-dcache-load-misses" would show which functions in your code are correlated with the most CPU cache misses.
>
>
One piece of information which is noticable at the top of the table is that this profile was based on the "cycles" performance counter, which measures how many CPU clock cycles have elapsed. This is the most common performance indicator in early performance analysis, as it highlights where the program spends its time, which is what one is usually most interested in initially. However, one can use any performance counter here, using the same "-e" flag used with perf stat above. For example, "perf record -e L1-dcache-load-misses" would show which functions in your code are correlated with the most CPU cache misses.
 
Changed:
<
<
There is one important piece of information which is missing from the above report, however, and that is the reason why some specific functions were called. When doing performance analysis, it is one thing to know that one is calling malloc() too much, but it is another to know why this happens. In this case, we want to tell who called these memory allocation functions, a piece of information also known as the call graph.
>
>
There is one important piece of information which is missing from the above report, however, and that is the reason why some specific functions were called. When doing performance analysis, it is one thing to know that malloc() is called too often, but it is another to know why this happens. In this case, one wants to know who called these memory allocation functions, a piece of information also known as the call graph.
 
Changed:
<
<
There are several ways to measure a call graph, each with different advantages and drawbacks:
>
>
There are several ways to obtain a call graph, each with different advantages and drawbacks:
 
  • The best method in every respect, when it is available, is to use the Last Branch Record (LBR) hardware facility for this purpose. But this measurement method is only available on recent CPUs (>= Haswell for Intel).
Changed:
<
<
  • A universally compatible counterpart is to periodically make a copy of the program's stack and analyze it using the program's DWARF debug information. This is the same method used by the GDB debugger to generate stack traces. Sadly, the need to make stack copies gives this profiling method very bad performance, which means that perf can only measure the program's state rarely, and thus that performance profiles must be acquired over much longer periods of time (several minutes) in order to be statistically significant. The profile files will also be much bigger, and slower to analyze.
  • Sometimes, an alternative method based on sampling only the frame pointer of the program can achieve the same result at a much reduced cost, without loss of portability. But unfortunately, there is a very popular compiler performance optimization that breaks this profiling method, and even if you disable it on your code, the libraries that you use will most likely have it enabled. Therefore, use of this profiling method is not recommended.
>
>
  • A universally compatible counterpart is to periodically make a copy of the program's stack and analyze it using the program's DWARF debug information. This is the same method used by the GDB debugger to generate stack traces. Sadly, the need to make stack copies gives this profiling method very bad performance, which means that perf can only measure the program's state rarely, and thus performance profiles must be acquired over much longer periods of time (several minutes) in order to be statistically significant. The profile files will also be much bigger, and slower to analyze.
  • Sometimes, an alternative method based on sampling only the frame pointer of the program can achieve the same result at a much reduced cost, without loss of portability. Unfortunately, there is a very popular compiler performance optimization that breaks this profiling method, and even if you disable it on your code, the libraries that you use will most likely have it enabled. Therefore use of this profiling method is not recommended.
  To measure a call graph, pass the "--call-graph=<method>" switch to perf record, where <method> will be either "lbr" or "dwarf" depending on which one your hardware allows you to use. Here, I will assume the availability of LBR-based call graph profiling:
Line: 518 to 518
 + 3.30% 0.00% 07_io_bound_evl 07_io_bound_evloop.exe [.] main [ ... shortened for brievity ... ]
Changed:
<
<
Notice two new things in the report. The first one is the "Children" counter, which tells you which fraction of the elapsed CPU time was spent in a certain function or one of the functions that it calls. This allows you to tell, at a glance, which functions can be held responsible for the most execution time in your program, as opposed to which time was spent inside of each individual function. From a performance analysis perspective, that's a much more interesting information than the previously displayed "self time" alone, which is why perf report automatically sorts functions according to this criterion when you enable call graph profiling.
>
>
There are two new things in the report:
 
Changed:
<
<
The second thing is the little "+" signs in the leftmost column of the report. These signs allow you recursively explore which functions a given function is calling, using the interactive perf report UI. For example, in the following text block, I have explored where the "simulateEventLoop" function spends its time, and found out that a non-negligible fraction of it was spent inserting elements inside of a hash table (itself part of a C++ unordered_set), which in turned caused a nontrivial fraction of my dynamic memory allocations. Another time sink was the liberation of reference-counted data (from an std::shared_ptr), which caused expensive atomic operations and eventual memory liberation.
>
>
  1. The "Children" counter, which says which fraction of the elapsed CPU time was spent in a certain function or one of the functions that it calls. This allows one to tell, at a glance, which functions are responsible for the most of the execution time in your program, as opposed to which time was spent inside of each individual function. From a performance analysis perspective, this is a much more interesting piece of information than the previously displayed "self time" alone, which is why perf report automatically sorts functions according to this criterion when you enable call graph profiling.
  2. The little "+" signs in the leftmost column of the report. These signs allow a recursive exploration of which functions a given function is calling, using the interactive perf report UI. For example, in the following text block, the places where "simulateEventLoop" function spends its time are expanded showing that a non-negligible time was spent inserting elements inside a hash table (itself part of a C++ unordered_set), which in turned caused a non-trivial fraction of dynamic memory allocations. Another time sink was the freeing of reference-counted data (from an std::shared_ptr), which caused expensive atomic operations and an eventual memory release.
 
Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936                                                                                                                                                                                                                                                     
  Children      Self  Command          Shared Object              Symbol                                                                                                                                                                                                                                                     ?
Line: 543 to 544
 + 14.21% 1.85% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release ? + 10.21% 0.62% 07_io_bound_evl libstdc++.so.6.0.24 [.] malloc@plt
Changed:
<
<
This example also highlights one open issue with profiling C++ code, which is that the function names of C++ methods in idiomatic libraries such as the STL or boost can be gigantic, far away from the API names that you are used to calling from your code, and in general hard to read. Unfortunately, there is no good solution to this problem, the best that one can do is usually to look for interesting keywords in the long-winded C++ name (_Hashtable and _M_insert in the example above) and try to associate them with specific patterns in the corresponding function's code.
>
>
This example also highlights one open issue with profiling C++ code, which is that the function names of C++ methods in idiomatic libraries such as the STL or boost can be gigantic, far away from the API names that are used in the code, and in general hard to read. Unfortunately, there is no good solution to this problem, the best that one can do is usually to look for interesting keywords in the long-winded C++ name (_Hashtable and _M_insert in the example above) and try to associate them with specific patterns in the corresponding function's code.
 

This concludes this short introduction to perf. Here, we have only have scratched the surface of what perf can do. Other interesting topics could have included...

Changed:
<
<
  • Displaying annotated source code and assembly, in order to tell which part of a given function, exactly, takes time (bearing in mind that optimizing compilers can transform the source code of the original function quite tremendously, which makes this analysis somewhat difficult).
>
>
  • Displaying annotated source code and assembly, in order to tell which part of a given function, exactly, takes time (bearing in mind that optimizing compilers can transform the source code of the original function hugely, which makes this analysis somewhat difficult).
 
  • Measuring program activity every N-th occurence of a given event (e.g. L1 cache miss) instead of periodically, in order to more precisely pinpoint where in the code the event is occurring.
  • The great many performance counters available on modern CPUs, which ones are most useful, and how their values should be interpreted.
  • System-wide profiling, allowing one to study what happens to threads even when they "fall asleep" and call the operating system's kernel for the purpose of performing IO or locking a mutex.

Revision 1212017-11-30 - HadrienGrasland

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 359 to 359
  Performance counter stats for 'cargo run --release':
Changed:
<
<
4428,370578 task-clock (msec) # 1,000 CPUs utilized 46 context-switches # 0,010 K/sec 0 cpu-migrations # 0,000 K/sec 6 459 page-faults # 0,001 M/sec 15 738 566 590 cycles # 3,554 GHz 30 034 797 373 instructions # 1,91 insn per cycle 2 222 188 760 branches # 501,807 M/sec 88 966 900 branch-misses # 4,00% of all branches
>
>
4428.370578 task-clock (msec) # 1.000 CPUs utilized 46 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 6 459 page-faults # 0.001 M/sec 15 738 566 590 cycles # 3.554 GHz 30 034 797 373 instructions # 1.91 insn per cycle 2 222 188 760 branches # 501.807 M/sec 88 966 900 branch-misses # 4.00% of all branches
 
Changed:
<
<
4,429156950 seconds time elapsed
>
>
4.429156950 seconds time elapsed
  The output of perf stat contains raw statistics (on the left) and some interpretations of the numbers (on the right). Here, we can see that the program under study is not multi-threaded (as the right column points out, only 1 CPU is utilized), but makes reasonably efficient use of the single CPU core that it runs on (at 1.91 instructions per cycle, we're not too far away from the theoretical Haswell maximum for this code). One thing which this wiki page cannot expose is that if a performance number is abnormally bad, perf will also helpfully highlight it using color in your terminal.
Line: 380 to 380
  Performance counter stats for 'cargo run --release':
Changed:
<
<
4425,711186 task-clock (msec) # 0,999 CPUs utilized
          1. context-switches # 0,036 K/sec 0 cpu-migrations # 0,000 K/sec 6 431 page-faults # 0,001 M/sec 15 684 677 897 cycles # 3,544 GHz (50,10%) 29 976 594 100 instructions # 1,91 insn per cycle (62,56%) 2 208 236 648 branches # 498,956 M/sec (62,56%) 88 851 174 branch-misses # 4,02% of all branches (62,60%) 6 042 105 001 L1-dcache-loads # 1365,228 M/sec (62,08%) 11 320 634 L1-dcache-load-misses # 0,19% of all L1-dcache hits (25,06%)
      1. 870 540 LLC-loads # 0,423 M/sec (25,02%) 319 260 LLC-load-misses # 17,07% of all LL-cache hits (37,57%)
>
>
4425.711186 task-clock (msec) # 0.999 CPUs utilized
          1. context-switches # 0.036 K/sec 0 cpu-migrations # 0.000 K/sec 6 431 page-faults # 0.001 M/sec 15 684 677 897 cycles # 3.544 GHz (50.10%) 29 976 594 100 instructions # 1.91 insn per cycle (62.56%) 2 208 236 648 branches # 498.956 M/sec (62.56%) 88 851 174 branch-misses # 4.02% of all branches (62.60%) 6 042 105 001 L1-dcache-loads # 1365.228 M/sec (62.08%) 11 320 634 L1-dcache-load-misses # 0.19% of all L1-dcache hits (25.06%)
      1. 870 540 LLC-loads # 0.423 M/sec (25.02%) 319 260 LLC-load-misses # 17.07% of all LL-cache hits (37.57%)
 
Changed:
<
<
4,431148695 seconds time elapsed
>
>
4.431148695 seconds time elapsed
  Here, you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), let alone all the way to the last-level cache (LLC), but that when we do reach for that one, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitudes being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. However, here, they are ultimately not concerning: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
Line: 407 to 407
  Performance counter stats for 'cargo run --release':
Changed:
<
<
6 023 335 754 L1-dcache-loads (74,90%)
      1. 994 495 L1-dcache-load-misses # 0,15% of all L1-dcache hits (50,16%)
      2. 407 348 LLC-loads (50,09%) 311 839 LLC-load-misses # 22,16% of all LL-cache hits (75,00%)
>
>
6 023 335 754 L1-dcache-loads (74.90%)
      1. 994 495 L1-dcache-load-misses # 0.15% of all L1-dcache hits (50.16%)
      2. 407 348 LLC-loads (50.09%) 311 839 LLC-load-misses # 22.16% of all LL-cache hits (75.00%)
 
Changed:
<
<
4,404909465 seconds time elapsed
>
>
4.404909465 seconds time elapsed
  As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second step after a potential performance problem has been identified in the "generic" perf stat output.
Line: 454 to 454
 
> perf report
Samples: 207K of event 'cycles:uppp', Event count (approx.): 106619947387                                                                                                                                                                                                                                                     
Overhead  Command          Shared Object              Symbol                                                                                                                                                                                                                                                                  
Changed:
<
<
20,16% 07_io_bound_evl libc-2.26.so [.] _int_malloc 14,66% 07_io_bound_evl libc-2.26.so [.] _int_free 10,10% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, tr
  1. ,59% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::ConditionSlotKnowledge::setupSlot
  2. ,58% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(ConditionSlot
  3. ,85% 07_io_bound_evl libc-2.26.so [.] malloc
  4. ,89% 07_io_bound_evl libc-2.26.so [.] malloc_consolidate
  5. ,62% 07_io_bound_evl 07_io_bound_evloop.exe [.] BenchmarkIOSvc::startConditionIO
  6. ,34% 07_io_bound_evl libc-2.26.so [.] cfree@GLIBC_2.2.5
  7. ,70% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_lock
  8. ,30% 07_io_bound_evl libc-2.26.so [.] tcache_put
  9. ,15% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::vector<detail::ReadySlotPromise, std::allocator<detail::ReadySlotPromise> >::~vector
  10. ,05% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
  11. ,91% 07_io_bound_evl libc-2.26.so [.] tcache_get
  12. ,87% 07_io_bound_evl libstdc++.so.6.0.24 [.] operator new
  13. ,10% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_unlock_usercnt
>
>
20.16% 07_io_bound_evl libc-2.26.so [.] _int_malloc 14.66% 07_io_bound_evl libc-2.26.so [.] _int_free 10.10% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, tr
  1. 59% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::ConditionSlotKnowledge::setupSlot
  2. 58% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(ConditionSlot
  3. 85% 07_io_bound_evl libc-2.26.so [.] malloc
  4. 89% 07_io_bound_evl libc-2.26.so [.] malloc_consolidate
  5. 62% 07_io_bound_evl 07_io_bound_evloop.exe [.] BenchmarkIOSvc::startConditionIO
  6. 34% 07_io_bound_evl libc-2.26.so [.] cfree@GLIBC_2.2.5
  7. 70% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_lock
  8. 30% 07_io_bound_evl libc-2.26.so [.] tcache_put
  9. 15% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::vector<detail::ReadySlotPromise, std::allocator<detail::ReadySlotPromise> >::~vector
  10. 05% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
  11. 91% 07_io_bound_evl libc-2.26.so [.] tcache_get
  12. 87% 07_io_bound_evl libstdc++.so.6.0.24 [.] operator new
  13. 10% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_unlock_usercnt
 [ ... shortened for brievity ... ]

This profile was acquired on a different program than the one which I ran "perf stat" on at the beginning, and as you can see, this specific program could use more optimization work. It spends about half of its time in memory allocation related functions (malloc, free, and implementation details thereof), which is a common performance problem in idiomatic C++ code.

Line: 491 to 491
 [... an entire program execution later ... ] Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936 Children Self Command Shared Object Symbol
Changed:
<
<
+ 46,21% 8,88% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(C + 43,04% 0,02% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::executors::basic_thread_pool::worker_thread + 41,85% 0,04% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::SequentialScheduler::simulateEventLoop + 39,68% 8,32% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::ConditionSlotKnowledge::setupSlot + 36,99% 1,90% 07_io_bound_evl libstdc++.so.6.0.24 [.] operator new + 32,67% 6,89% 07_io_bound_evl libc-2.26.so [.] malloc + 26,47% 20,31% 07_io_bound_evl libc-2.26.so [.] _int_malloc + 17,39% 9,89% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_trai + 15,63% 14,44% 07_io_bound_evl libc-2.26.so [.] _int_free + 14,21% 1,85% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release + 10,21% 0,62% 07_io_bound_evl libstdc++.so.6.0.24 [.] malloc@plt + 6,11% 0,52% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_trai + 5,76% 0,45% 07_io_bound_evl 07_io_bound_evloop.exe [.] operator delete@plt + 5,58% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::executors::executor_ref<boost::executors::inline_executor>::submit + 5,54% 0,18% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_ptr_inplace<detail::AnyConditionData const, std::allocator<detail::AnyConditionData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose + 5,26% 0,03% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::shared_state_base::do_continuation + 4,15% 3,97% 07_io_bound_evl libc-2.26.so [.] malloc_consolidate + 4,02% 0,00% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::future_executor_continuation_shared_state<boost::future, boost::future, ConditionSvc::setupConditions(int const&)::{lambda(boost::future&&)#1}>::launch_continuat + 3,99% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future, boost::future, ConditionSvc::setupConditions(int const&)::{lambda(boos + 3,97% 0,04% 07_io_bound_evl 07_io_bound_evloop.exe [.] ConditionSvc::setupConditions(int const&)::{lambda(boost::future&&)#1}::operator() + 3,90% 3,63% 07_io_bound_evl 07_io_bound_evloop.exe [.] BenchmarkIOSvc::startConditionIO + 3,43% 3,30% 07_io_bound_evl libc-2.26.so [.] cfree@GLIBC_2.2.5 + 3,34% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::continuation_shared_state<boost::future<std::vector<boost::future, std::allocator<boost::future > > >, ConditionSlotIteration, std::_Bind<detail::SingleUseBindWrapper<ConditionSvc::setupConditions(int const&)::{lambda(boo + 3,32% 0,00% 07_io_bound_evl 07_io_bound_evloop.exe [.] benchmark + 3,30% 0,00% 07_io_bound_evl 07_io_bound_evloop.exe [.] main
>
>
+ 46.21% 8.88% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(C + 43.04% 0.02% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::executors::basic_thread_pool::worker_thread + 41.85% 0.04% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::SequentialScheduler::simulateEventLoop + 39.68% 8.32% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::ConditionSlotKnowledge::setupSlot + 36.99% 1.90% 07_io_bound_evl libstdc++.so.6.0.24 [.] operator new + 32.67% 6.89% 07_io_bound_evl libc-2.26.so [.] malloc + 26.47% 20.31% 07_io_bound_evl libc-2.26.so [.] _int_malloc + 17.39% 9.89% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_trai + 15.63% 14.44% 07_io_bound_evl libc-2.26.so [.] _int_free + 14.21% 1.85% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release + 10.21% 0.62% 07_io_bound_evl libstdc++.so.6.0.24 [.] malloc@plt + 6.11% 0.52% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_trai + 5.76% 0.45% 07_io_bound_evl 07_io_bound_evloop.exe [.] operator delete@plt + 5.58% 0.01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::executors::executor_ref<boost::executors::inline_executor>::submit + 5.54% 0.18% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_ptr_inplace<detail::AnyConditionData const, std::allocator<detail::AnyConditionData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose + 5.26% 0.03% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::shared_state_base::do_continuation + 4.15% 3.97% 07_io_bound_evl libc-2.26.so [.] malloc_consolidate + 4.02% 0.00% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::future_executor_continuation_shared_state<boost::future, boost::future, ConditionSvc::setupConditions(int const&)::{lambda(boost::future&&)#1}>::launch_continuat + 3.99% 0.01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future, boost::future, ConditionSvc::setupConditions(int const&)::{lambda(boos + 3.97% 0.04% 07_io_bound_evl 07_io_bound_evloop.exe [.] ConditionSvc::setupConditions(int const&)::{lambda(boost::future&&)#1}::operator() + 3.90% 3.63% 07_io_bound_evl 07_io_bound_evloop.exe [.] BenchmarkIOSvc::startConditionIO + 3.43% 3.30% 07_io_bound_evl libc-2.26.so [.] cfree@GLIBC_2.2.5 + 3.34% 0.01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::continuation_shared_state<boost::future<std::vector<boost::future, std::allocator<boost::future > > >, ConditionSlotIteration, std::_Bind<detail::SingleUseBindWrapper<ConditionSvc::setupConditions(int const&)::{lambda(boo + 3.32% 0.00% 07_io_bound_evl 07_io_bound_evloop.exe [.] benchmark + 3.30% 0.00% 07_io_bound_evl 07_io_bound_evloop.exe [.] main
 [ ... shortened for brievity ... ]

Notice two new things in the report. The first one is the "Children" counter, which tells you which fraction of the elapsed CPU time was spent in a certain function or one of the functions that it calls. This allows you to tell, at a glance, which functions can be held responsible for the most execution time in your program, as opposed to which time was spent inside of each individual function. From a performance analysis perspective, that's a much more interesting information than the previously displayed "self time" alone, which is why perf report automatically sorts functions according to this criterion when you enable call graph profiling.

Line: 524 to 524
 
Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936                                                                                                                                                                                                                                                     
  Children      Self  Command          Shared Object              Symbol                                                                                                                                                                                                                                                     ?
Changed:
<
<
+ 46,21% 8,88% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(? + 43,04% 0,02% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::executors::basic_thread_pool::worker_thread ? - 41,85% 0,04% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::SequentialScheduler::simulateEventLoop ? - 41,80% detail::SequentialScheduler::simulateEventLoop ? - 37,31% detail::ConditionSlotKnowledge::setupSlot ? - 16,18% std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_insert<int, std::__deta? + 5,18% operator new ? 1,03% operator new@plt ? + 13,28% std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release ? + 3,85% boost::detail::shared_state_base::do_continuation ? + 39,68% 8,32% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::ConditionSlotKnowledge::setupSlot ? + 36,99% 1,90% 07_io_bound_evl libstdc++.so.6.0.24 [.] operator new ? + 32,67% 6,89% 07_io_bound_evl libc-2.26.so [.] malloc ? + 26,47% 20,31% 07_io_bound_evl libc-2.26.so [.] _int_malloc ? + 17,39% 9,89% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_tra? + 15,63% 14,44% 07_io_bound_evl libc-2.26.so [.] _int_free ? + 14,21% 1,85% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release ? + 10,21% 0,62% 07_io_bound_evl libstdc++.so.6.0.24 [.] malloc@plt
>
>
+ 46.21% 8.88% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(? + 43.04% 0.02% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::executors::basic_thread_pool::worker_thread ? - 41.85% 0.04% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::SequentialScheduler::simulateEventLoop ? - 41.80% detail::SequentialScheduler::simulateEventLoop ? - 37.31% detail::ConditionSlotKnowledge::setupSlot ? - 16.18% std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_insert<int, std::__deta? + 5.18% operator new ? 1.03% operator new@plt ? + 13.28% std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release ? + 3.85% boost::detail::shared_state_base::do_continuation ? + 39.68% 8.32% 07_io_bound_evl 07_io_bound_evloop.exe [.] detail::ConditionSlotKnowledge::setupSlot ? + 36.99% 1.90% 07_io_bound_evl libstdc++.so.6.0.24 [.] operator new ? + 32.67% 6.89% 07_io_bound_evl libc-2.26.so [.] malloc ? + 26.47% 20.31% 07_io_bound_evl libc-2.26.so [.] _int_malloc ? + 17.39% 9.89% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Hashtable<int, int, std::allocator, std::__detail::_Identity, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_tra? + 15.63% 14.44% 07_io_bound_evl libc-2.26.so [.] _int_free ? + 14.21% 1.85% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release ? + 10.21% 0.62% 07_io_bound_evl libstdc++.so.6.0.24 [.] malloc@plt
  This example also highlights one open issue with profiling C++ code, which is that the function names of C++ methods in idiomatic libraries such as the STL or boost can be gigantic, far away from the API names that you are used to calling from your code, and in general hard to read. Unfortunately, there is no good solution to this problem, the best that one can do is usually to look for interesting keywords in the long-winded C++ name (_Hashtable and _M_insert in the example above) and try to associate them with specific patterns in the corresponding function's code.

Revision 1202017-11-28 - HadrienGrasland

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 325 to 325
 
  • Valgrind essentially works by emulating the execution of the program on a virtual CPU. This artificially inflates the cost of CPU computations with respect to other operations (such as IO) by more than an order of magnitude, and entails that performance analysis must be based on a mathematical model of a CPU, which is in practice quite far off from what modern Intel CPUs actually do.
  • Google's profiler, like other user-space sampling profilers (gprof, igprof...), is only able to tell where a program spends its time, and not why. For example, it cannot tell where CPU cache misses are happening, which complicates memory layout optimizations.
Changed:
<
<
  • Neither of these tools are able to monitor the time spent in the operating system kernel, which is important for assessing the impact of blocking IO operations or lock contention in multi-threaded code.
>
>
  • Neither of these tools is able to analyze the time spent in the operating system kernel, which is important for assessing the impact of blocking IO operations or lock contention in multi-threaded code.
  A more precise analysis of program execution on a real machine can be obtained from tools which leverage the Performance Monitoring Counters of modern CPUs, such as the "perf" profiler of the Linux kernel or Intel's VTune Amplifier. These tools provide an accurate and detailed picture of what is going on in the CPU as a program is executing, and have a negligible impact on the performance of the program under study when used correctly.
Line: 395 to 395
  4,431148695 seconds time elapsed
Changed:
<
<
Here, you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), but when we do, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitude being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. However, here, they are ultimately not concerning: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
>
>
Here, you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), let alone all the way to the last-level cache (LLC), but that when we do reach for that one, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitudes being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. However, here, they are ultimately not concerning: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.
  Another thing to pay attention to here is the new column of percentages on the right. CPU performance monitoring counters have some hardware limitations, the most important of which being that you can only monitor a small set of them at any given time. Here, because we are looking at a lot of different statistics at once, perf was forced to only monitor a subset of them at a time and constantly switch between them, then interpolate the missing data samples, which has some overhead and reduces the quality of the measurement. The percentage tells you during which fraction of the total measurement time the corresponding performance counter was actually active.
Line: 516 to 516
 + 3,34% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::continuation_shared_state<boost::future<std::vector<boost::future, std::allocator<boost::future > > >, ConditionSlotIteration, std::_Bind<detail::SingleUseBindWrapper<ConditionSvc::setupConditions(int const&)::{lambda(boo + 3,32% 0,00% 07_io_bound_evl 07_io_bound_evloop.exe [.] benchmark + 3,30% 0,00% 07_io_bound_evl 07_io_bound_evloop.exe [.] main
Deleted:
<
<
+ 2,88% 2,71% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_lock + 2,81% 0,00% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future<std::vector<boost::future, std::allocator<boost::future > > >, ConditionSlotIteration, std::_Bind<detail::S + 2,73% 0,21% 07_io_bound_evl 07_io_bound_evloop.exe [.] pthread_mutex_lock@plt + 2,59% 2,32% 07_io_bound_evl libc-2.26.so [.] tcache_put + 2,49% 2,22% 07_io_bound_evl 07_io_bound_evloop.exe [.] std::vector<detail::ReadySlotPromise, std::allocator<detail::ReadySlotPromise> >::~vector + 2,43% 0,87% 07_io_bound_evl 07_io_bound_evloop.exe [.] operator new@plt + 1,95% 1,88% 07_io_bound_evl libc-2.26.so [.] tcache_get + 1,55% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::future_executor_continuation_shared_state<boost::future<std::vector<boost::future, std::allocator<boost::future > > >, ConditionSlotIteration, std::_Bind<detail::SingleUseBindWrapper<ConditionSvc::setupConditions(int cons + 1,42% 0,13% 07_io_bound_evl 07_io_bound_evloop.exe [.] pthread_mutex_unlock@plt + 1,24% 1,12% 07_io_bound_evl libpthread-2.26.so [.] __pthread_mutex_unlock_usercnt + 1,05% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::future_executor_continuation_shared_state<boost::future, void, boost::future<detail::future_union_base<__gnu_cxx::__normal_iterator<boost::future*, std::vector<boost::future, std::allocator<boost::future > > > + 1,02% 0,01% 07_io_bound_evl 07_io_bound_evloop.exe [.] boost::detail::nullary_function::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future, void, boost::future<detail::future_union_base<__gnu_cxx::__normal_iterator<boost::future*, std::vector, std::allocator<boost::future > > >::set_value
 [ ... shortened for brievity ... ]

Notice two new things in the report. The first one is the "Children" counter, which tells you which fraction of the elapsed CPU time was spent in a certain function or one of the functions that it calls. This allows you to tell, at a glance, which functions can be held responsible for the most execution time in your program, as opposed to which time was spent inside of each individual function. From a performance analysis perspective, that's a much more interesting information than the previously displayed "self time" alone, which is why perf report automatically sorts functions according to this criterion when you enable call graph profiling.

Line: 564 to 551
 
  • Displaying annotated source code and assembly, in order to tell which part of a given function, exactly, takes time (bearing in mind that optimizing compilers can transform the source code of the original function quite tremendously, which makes this analysis somewhat difficult).
  • Measuring program activity every N-th occurence of a given event (e.g. L1 cache miss) instead of periodically, in order to more precisely pinpoint where in the code the event is occurring.
  • The great many performance counters available on modern CPUs, which ones are most useful, and how their values should be interpreted.
Changed:
<
<
  • System-wide profiling, allowing to study what happens to threads even when they "fall asleep" and call the operating system's kernel for the purpose of performing IO or locking a mutex.
>
>
  • System-wide profiling, allowing one to study what happens to threads even when they "fall asleep" and call the operating system's kernel for the purpose of performing IO or locking a mutex.
  ...but this would go beyond the scope of this introductory TWiki page. For more detailed information on Linux perf, highly recommended sources of information and "cheat sheets" include the man pages of the various perf utilities and http://www.brendangregg.com/perf.html .

Revision 1192017-11-28 - HadrienGrasland

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 375 to 339
 
  • Due to the way it operates, perf requires CPU-specific support code. This means in particular that you are unlikely to be able to leverage your CPU's full performance monitoring capabilities if your Linux kernel version is older than your CPU model.
  • Perf is evolving quickly, and new versions can also bring massive improvements in features, usability and performance.
Changed:
<
<
You can learn more about the improvements brought by successive perf releases in the "tracing and profiling" sections of the Linux kernel version history at https://kernelnewbies.org/LinuxVersions , and check which Linux kernel version your system is running using the uname command:
>
>
You can learn more about the improvements brought by successive perf releases in the highlights and "Tracing/profiling" sections of the Linux kernel release notes at https://kernelnewbies.org/LinuxVersions , and check which Linux kernel version your system is running using the uname command:
 
> uname -r
4.14.1-1-default
Line: 446 to 414
  4,404909465 seconds time elapsed
Changed:
<
<
As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second step after a potential performance problem has been identified in the first step.
>
>
As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second step after a potential performance problem has been identified in the "generic" perf stat output.
  You can get a full list of the (numerous) CPU performance counters supported by perf using the "perf list" command.

Perf stat is very powerful and has very low overhead, but it only gives you coarse-grained information. Often, you want to know where your program spends its time, and more importantly why. This information can be measured using the "perf record" and "perf report" commands. The first one analyses the performance of your program by periodically sampling which function your code is executing, and how the CPU performance counters are evolving, then correlating these two informations. The second one displays the resulting statistics in a nice textual user interface.

Changed:
<
<
In order to report function names, your program must be compiled with debugging symbols (as enabled by the "-g" GCC flag, or the "Debug" and "RelWithDebInfo" CMake build configurations). To get a profile that is representative of your application's actual performance, you will obviously need to leave compiler optimizations on, which is exactly the kind of scenario that CMake's built-in RelWithDebInfo configuration was designed for.
>
>
In order to report function names, your program must be compiled with debugging symbols (as enabled by the "-g" GCC flag, or the "Debug" and "RelWithDebInfo" CMake build configurations). To get a profile that is representative of your application's actual performance, you will obviously need to leave compiler optimizations on, which is exactly the kind of scenario that CMake's built-in "RelWithDebInfo" configuration was designed for.
  When you run perf record for the first time, you will likely see a warning message like the following one:
Added:
>
>
 
> perf record <your command>
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Revision 1182017-11-28 - HadrienGrasland

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 344 to 379
 
> uname -r
4.14.1-1-default
Changed:
<
<
To install the perf profiler, use your linux distribution's package manager. The name of the package(s) to be installed varies from one distribution to another, here are some common ones:
>
>
To install the perf profiler, use your Linux distribution's package manager. The name of the package(s) to be installed varies from one distribution to another, here are some common ones:
 
  • Debian/Ubuntu: linux-tools
  • RedHat /CentOS/Fedora/SUSE: perf
Line: 370 to 404
  4,429156950 seconds time elapsed
Changed:
<
<
The output of perf stat contains raw statistics (on the left) and some interpretations of the numbers (on the right). Here, we can see that the program under study is not multi-threaded (as CPU time is equal to the elapsed time), but makes reasonably efficient use of the single CPU core that it runs on (at 1.91 instructions per cycle, we're not too far away from the theoretical Haswell maximum for this code). One thing which this wiki page cannot expose is that if a performance number is abnormally bad, perf will helpfully highlight it using color in your terminal.

We can ask perf for more statistics from perf using the "-d" command line switch:

>
>
The output of perf stat contains raw statistics (on the left) and some interpretations of the numbers (on the right). Here, we can see that the program under study is not multi-threaded (as the right column points out, only 1 CPU is utilized), but makes reasonably efficient use of the single CPU core that it runs on (at 1.91 instructions per cycle, we're not too far away from the theoretical Haswell maximum for this code). One thing which this wiki page cannot expose is that if a performance number is abnormally bad, perf will also helpfully highlight it using color in your terminal.
 
Added:
>
>
We can ask perf for more statistics using the "-d" command line switch:
 
> perf stat -d <your command>

[ ... normal program output ... ]
Line: 414 to 446
  4,404909465 seconds time elapsed
Changed:
<
<
As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second thing to do, after a potential performance problem has been identified in the first step.
>
>
As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second step after a potential performance problem has been identified in the first step.
 
Changed:
<
<
You can also get a full list of all CPU performance counters supported by perf using the "perf list" command.
>
>
You can get a full list of the (numerous) CPU performance counters supported by perf using the "perf list" command.
 
Changed:
<
<
Perf stat is very powerful and has very low overhead, but is only gives you coarse-grained information. Often, you want to know where your program spends its time, and more importantly why. This functionality is provided by the "perf record" and "perf report" commands. The first one analyses the performance of your program by periodically sampling which function your code is executing, and how the CPU performance counters are evolving, then correlating these two informations. The second one displays the resulting statistics in a nice textual user interface.
>
>
Perf stat is very powerful and has very low overhead, but it only gives you coarse-grained information. Often, you want to know where your program spends its time, and more importantly why. This information can be measured using the "perf record" and "perf report" commands. The first one analyses the performance of your program by periodically sampling which function your code is executing, and how the CPU performance counters are evolving, then correlating these two informations. The second one displays the resulting statistics in a nice textual user interface.
 
Changed:
<
<
In order to report function names, your program must be compiled with debugging symbols (as enabled by the "-g" GCC flag, or the "Debug" and "RelWithDebInfo" CMake build configurations).
>
>
In order to report function names, your program must be compiled with debugging symbols (as enabled by the "-g" GCC flag, or the "Debug" and "RelWithDebInfo" CMake build configurations). To get a profile that is representative of your application's actual performance, you will obviously need to leave compiler optimizations on, which is exactly the kind of scenario that CMake's built-in RelWithDebInfo configuration was designed for.
  When you run perf record for the first time, you will likely see a warning message like the following one:
Deleted:
<
<
 
> perf record <your command>
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Line: 447 to 478
 [kernel.kallsyms] with build id d9c54397e4672f9850695351f23e25f24757f9b0 not found, continuing without symbols [ perf record: Captured and wrote 7.910 MB perf.data (207305 samples) ]
Changed:
<
<
This message warns you that perf is not currently allowed to report the names of functions that you call within the Linux kernel. This ability can be very useful when the performance of your program is limited by system calls, and you want to understand what exactly is going on. If you have administrator rights on your machine, you can enable this feature by writing "0" in the /proc/sys/kernel/kptr_restrict pseudo-file. But we do not need this feature for this short tutorial, and perf can live without it, so we'll do without for now.
>
>
This message warns you that perf is not currently allowed to report the names of the functions that you call within the Linux kernel. This ability can be very useful when the performance of your program is limited by system calls, and you want to understand what exactly is going on. If you have administrator rights on your machine, you can enable this feature by writing "0" in the /proc/sys/kernel/kptr_restrict pseudo-file. But we do not need this feature for this short tutorial, and perf can live without it, so we'll do without for now.
  So without much ado, let us look at the report:
Deleted:
<
<
 
> perf report
Samples: 207K of event 'cycles:uppp', Event count (approx.): 106619947387                                                                                                                                                                                                                                                     
Overhead  Command          Shared Object              Symbol                                                                                                                                                                                                                                                                  
Line: 474 to 504
  This profile was acquired on a different program than the one which I ran "perf stat" on at the beginning, and as you can see, this specific program could use more optimization work. It spends about half of its time in memory allocation related functions (malloc, free, and implementation details thereof), which is a common performance problem in idiomatic C++ code.
Changed:
<
<
One piece of information which you will notice at the top of the table is that this profile was based on the "cycles" performance counter, which tells how many CPU clock cycles have elapsed. This is the most common performance indicator in early performance analysis, as it tells you where your program spends its time, which is what one is usually initially most interested in. However, you can use any performance counter here, using the same "-e" flag as we added to perf stat recently. For example, "perf record -e L1-dcache-load-misses" would measure which functions in your code are correlated with the most CPU cache misses.
>
>
One piece of information which you will notice at the top of the table is that this profile was based on the "cycles" performance counter, which tells how many CPU clock cycles have elapsed. This is the most common performance indicator in early performance analysis, as it tells you where your program spends its time, which is what one is usually initially most interested in. However, you can use any performance counter here, using the same "-e" flag that we used with perf stat before. For example, "perf record -e L1-dcache-load-misses" would show which functions in your code are correlated with the most CPU cache misses.
 
Changed:
<
<
There is one important piece of information which is missing from the above report, however, and that is the reason why some specific functions were called. When doing performance analysis, it is one thing to know that one is calling malloc() too much, but it is another to know why this happens. In this case, what we are interested in is to tell who called the memory allocation functions, a piece of information known as the call graph.
>
>
There is one important piece of information which is missing from the above report, however, and that is the reason why some specific functions were called. When doing performance analysis, it is one thing to know that one is calling malloc() too much, but it is another to know why this happens. In this case, we want to tell who called these memory allocation functions, a piece of information also known as the call graph.
  There are several ways to measure a call graph, each with different advantages and drawbacks:

  • The best method in every respect, when it is available, is to use the Last Branch Record (LBR) hardware facility for this purpose. But this measurement method is only available on recent CPUs (>= Haswell for Intel).
Changed:
<
<
  • A universally compatible counterpart is to periodically make a copy of the program's stack and analyze it using the program's DWARF debug information. This is the same method used by the GDB debugger to generate stack traces. However, the need to make stack copies gives this profiling method very bad performance, which means that perf can only measure the program's state rarely, and thus that performance profiles must be acquired over much longer periods of time (several minutes) in order to be statistically significant.
>
>
  • A universally compatible counterpart is to periodically make a copy of the program's stack and analyze it using the program's DWARF debug information. This is the same method used by the GDB debugger to generate stack traces. Sadly, the need to make stack copies gives this profiling method very bad performance, which means that perf can only measure the program's state rarely, and thus that performance profiles must be acquired over much longer periods of time (several minutes) in order to be statistically significant. The profile files will also be much bigger, and slower to analyze.
 
  • Sometimes, an alternative method based on sampling only the frame pointer of the program can achieve the same result at a much reduced cost, without loss of portability. But unfortunately, there is a very popular compiler performance optimization that breaks this profiling method, and even if you disable it on your code, the libraries that you use will most likely have it enabled. Therefore, use of this profiling method is not recommended.

To measure a call graph, pass the "--call-graph=<method>" switch to perf record, where <method> will be either "lbr" or "dwarf" depending on which one your hardware allows you to use. Here, I will assume the availability of LBR-based call graph profiling:

Line: 533 to 562
  Notice two new things in the report. The first one is the "Children" counter, which tells you which fraction of the elapsed CPU time was spent in a certain function or one of the functions that it calls. This allows you to tell, at a glance, which functions can be held responsible for the most execution time in your program, as opposed to which time was spent inside of each individual function. From a performance analysis perspective, that's a much more interesting information than the previously displayed "self time" alone, which is why perf report automatically sorts functions according to this criterion when you enable call graph profiling.
Changed:
<
<
The second thing is the little "+" signs in the leftmost column of the report. These signs allow you recursively explore which functions a given function is calling, using the interactive perf report UI. For example, in the following text block, I have explored where the "simulateEventLoop" function spends its time, and found out that a non-negligible fraction of it was spent inserting elements inside of a hash table (unordered_map in C++ terms), which in turned caused a nontrivial fraction of my dynamic memory allocations. Another time sink was the liberation of reference-counted data (via an std::shared_ptr), which caused expensive atomic operations and eventual memory liberation.
>
>
The second thing is the little "+" signs in the leftmost column of the report. These signs allow you recursively explore which functions a given function is calling, using the interactive perf report UI. For example, in the following text block, I have explored where the "simulateEventLoop" function spends its time, and found out that a non-negligible fraction of it was spent inserting elements inside of a hash table (itself part of a C++ unordered_set), which in turned caused a nontrivial fraction of my dynamic memory allocations. Another time sink was the liberation of reference-counted data (from an std::shared_ptr), which caused expensive atomic operations and eventual memory liberation.
 
Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936                                                                                                                                                                                                                                                     
  Children      Self  Command          Shared Object              Symbol                                                                                                                                                                                                                                                     ?
+   46,21%     8,88%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(?
Line: 561 to 589
  This concludes this short introduction to perf. Here, we have only have scratched the surface of what perf can do. Other interesting topics could have included...
Changed:
<
<
  • Displaying annotated source code and assembly, in order to tell which part of a given function, exactly, takes time (bearing in mind that optimizing compilers can transform the source code of the original function quite tremendously, which makes this analysis somewhat difficult and advanced).
>
>
  • Displaying annotated source code and assembly, in order to tell which part of a given function, exactly, takes time (bearing in mind that optimizing compilers can transform the source code of the original function quite tremendously, which makes this analysis somewhat difficult).
 
  • Measuring program activity every N-th occurence of a given event (e.g. L1 cache miss) instead of periodically, in order to more precisely pinpoint where in the code the event is occurring.
  • The great many performance counters available on modern CPUs, which ones are most useful, and how their values should be interpreted.
  • System-wide profiling, allowing to study what happens to threads even when they "fall asleep" and call the operating system's kernel for the purpose of performing IO or locking a mutex.

Revision 1172017-11-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 160 to 160
  One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be, for Boole
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
  or for Brunel
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
  and finally for DaVinci
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
  Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

Revision 1162017-11-28 - HadrienGrasland

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 24 to 24
 

Building valgrind and callgrind

valgrind and callgrind can be built easily using the tar files available from the respective web pages (see above). However, it appears that the size of our LHCb applications is larger than can be handled by the default values coded into valgrind. This is seen by the error

Deleted:
<
<
 
--15591:0:aspacem  Valgrind: FATAL: VG_N_SEGMENTS is too low.
Changed:
<
<
--15591:0:aspacem Increase it and rebuild. Exiting now.
>
>
--15591:0:aspacem Increase it and rebuild. Exiting now.
  It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 25000 seems to do the trick in all cases I've found so far. I also increased VG_N_SEGNAMES to 5000 from 1000.
Line: 48 to 45
 
[lxplus066] ~ > valgrind --version
valgrind: mmap(0x8bf5000, -1488932864) failed during startup.
Changed:
<
<
valgrind: is there a hard virtual memory limit set?
>
>
valgrind: is there a hard virtual memory limit set?
  These bugs have been fixed in the more recent versions than that installed. Unfortunately such a version is not available by default. A patched version is available in the LCG AFS area, which the following scripts provide access to :-
Deleted:
<
<
 
Changed:
<
<
> source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
>
>
> source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
  or
Changed:
<
<
> source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.sh
>
>
> source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.sh
  (for csh or bash like shells respectively) which currently provides the latest versions of valgrind and callgrind, with a private patch applied to allow valgrind to properly run with the LHC LCG applications.
Line: 67 to 60
 

Working with gaudirun.py

Unfortunately, valgrind cannot be used directly with gaudirun.py. There are three solutions, first you can use gaudirun.py to generate "old style" Job Options:

Deleted:
<
<
 
Changed:
<
<
> gaudirun.py -n -v -o options.opts options.py
>
>
> gaudirun.py -n -v -o options.opts options.py
  Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain some lines which are not valid... If this happens simply edit the file by hand and remove these lines.
Line: 75 to 66
 Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain some lines which are not valid... If this happens simply edit the file by hand and remove these lines.

Then invoke valgrind with something like :-

Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=memcheck Gaudi.exe options.opts
>
>
> valgrind --tool=memcheck Gaudi.exe options.opts
  Second, and probably better since it avoids the additional step of creating old style depreciated options (which does not always work), you can run valgrind directly on the python executable and pass the full path to gaudirun.py as an argument.
Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=memcheck python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck python `which gaudirun.py` options.py
  This second method will be used in the examples below.
Line: 96 to 83
 NB2: In the profileExtraOptions "--" have to be replaced by "__" to avoid issues with the option parser.

NB3: valgrind needs to be in the path for this option to run.

Deleted:
<
<
 
Changed:
<
<
gaudirun.py --profilerName=valgrindmemcheck --profilerExtraOptions="-v __leak-check=yes __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py
>
>
gaudirun.py --profilerName=valgrindmemcheck --profilerExtraOptions="-v __leak-check=yes __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py
 

Working with ROOT6

Line: 104 to 89
 

Working with ROOT6

ROOT6 uses clang internally as its interpreter, and this causes valgrind some problems due to its use of 'self modifying code'. To deal with this the additional valgrind option --smc-check should be used. Add

Deleted:
<
<
 
Changed:
<
<
--smc-check=all-non-file
>
>
--smc-check=all-non-file
  To your set of command line options. This unfortunately makes valgrind much slower, but at least it works...
Added:
>
>
 There is prehaps some hope that in the future when ROOT enables full support for clang's pre-compiled-modules, the need for this might go away...

Memory Tests

Line: 118 to 102
  It is useful to add a few additional options, to improve the quality of the output. The options
Changed:
<
<
-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes
>
>
-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes
 increase the amount of information valgrind supplies.

In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.

Line: 129 to 113
 To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.

DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-

Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --smc-check=all-non-file --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --smc-check=all-non-file --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
  For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-
Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --smc-check=all-non-file --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$BRUNELROOT/job/Brunel.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --smc-check=all-non-file --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$BRUNELROOT/job/Brunel.supp python `which gaudirun.py` options.py
  The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
Line: 147 to 127
 This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

If possible use a debug build (see below) as that will provide more information. This can be done by running

Deleted:
<
<
 
Changed:
<
<
> LbLogin -c $CMTDEB
>
>
> LbLogin -c $CMTDEB
  before any SetupProject calls.
Line: 157 to 135
 

Cache Profiling

The valgrind tool called "cachegrind" provides a simulation of the CPU caches and thus is a cache and branch-prediction profiler. In the simpliest case it can be activated with

Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=cachegrind application
>
>
> valgrind --tool=cachegrind application
  See the cachegrind manual for more details.
Line: 173 to 149
 See here for more details on callgrind and kcachegrind.

In the simple case, usage is just

Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=callgrind application
>
>
> valgrind --tool=callgrind application
  and where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".
Line: 185 to 159
 If you wish to include in the callgrind output the same cache profiling information as provided by cachegrind, include as well the options --cache-sim=yes --branch-sim=yes

One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be, for Boole

Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
  or for Brunel
Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
  and finally for DaVinci
Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
  Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().
Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).
Line: 217 to 183
 

Remote Control

Sometimes its useful to be able to start and stop profiling by hand. This can be done by first passing the option

Deleted:
<
<
 
Changed:
<
<
--instr-atstart=no
>
>
--instr-atstart=no
  to the valgrind command used to start the application. This will start the process running, but profiling will not start until you run (on the same machine, so start a second terminal) the command
Deleted:
<
<
 
Changed:
<
<
> callgrind_control --instr=on
>
>
> callgrind_control --instr=on
  later on you can then stop the profiling when you wish, using
Deleted:
<
<
 
Changed:
<
<
> callgrind_control --instr=off
>
>
> callgrind_control --instr=off
 

Alternative approach

Line: 248 to 207
  p.DumpName = 'CALLGRIND-OUT' GaudiSequencer('RecoTrSeq').Members.insert(0, p) appendPostConfigAction(addProfile)
Deleted:
<
<

gaudirun.py --profilerName=valgrindcallgrind --profilerExtraOptions="__instr-atstart=no -v __smc-check=all-non-file __dump-instr=yes __trace-jump=yes" joboptions.py |& tee out-0005.log
 
Added:
>
>
gaudirun.py --profilerName=valgrindcallgrind --profilerExtraOptions="__instr-atstart=no -v __smc-check=all-non-file __dump-instr=yes __trace-jump=yes" joboptions.py |& tee out-0005.log
 

Memory Usage Monitoring

Line: 260 to 215
 The valgrind tool called "massif" also exists which does some detialed memory usage monitoring. Full documentation of this tool is available in section 9 of the valgrind user guide.

Simple usage is just

Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=massif python `which gaudirun.py` options.py
>
>
> valgrind --tool=massif python `which gaudirun.py` options.py
  Note that command line options exist to control the level of memory usage monitoring that is applied, whether the stack is monitored as well as the heap (by default not). See the user guide for more details. Useful additional default options are, for example
Deleted:
<
<
 
Changed:
<
<
> valgrind --tool=massif -v --max-snapshots=1000 python `which gaudirun.py` options.py
>
>
> valgrind --tool=massif -v --max-snapshots=1000 python `which gaudirun.py` options.py
  Also note, to get the most out of this tool (as with all valgrind tools) it is best to run the debug software builds.
Line: 285 to 236
  The easiest way to debug with gdb is to use the built-in --gdb flag of gaudirun.py
Changed:
<
<
> gaudirun.py --gdb yourOptions.py
>
>
> gaudirun.py --gdb yourOptions.py
  Note: Currently, the default gdb on lxplus is too old to be useful for gcc 4.8 (or later) builds. There is a JIRA task to make newer gdb available on login. Until it is finished, please use gdb from CVMFS or AFS by one of:
Changed:
<
<
export PATH=/cvmfs/lhcb.cern.ch/lib/contrib/gdb/7.11/x86_64-slc6-gcc49-opt/bin:$PATH
>
>
export PATH=/cvmfs/lhcb.cern.ch/lib/contrib/gdb/7.11/x86_64-slc6-gcc49-opt/bin:$PATH
 or
Changed:
<
<
export PATH=/afs/cern.ch/sw/lcg/external/gdb/7.11/x86_64-slc6-gcc48-opt/bin:$PATH
>
>
export PATH=/afs/cern.ch/sw/lcg/external/gdb/7.11/x86_64-slc6-gcc48-opt/bin:$PATH
  Alternatively, gaudirun.py applications can be run through the gdb debugger using a similar trick as with valgrind, to call python directly and pass gaudirun.py as an argument. Just type
Changed:
<
<
> gdb --args python `which gaudirun.py` yourOptions.py
>
>
> gdb --args python `which gaudirun.py` yourOptions.py
 and then to run the application then simply type run at the gdb command line. ( The option --args tells gdb to interpret any additional options after the executable name as arguments to that application, instead of the default which is to try and interpret them as core files...)

When it crashes, type where to get a traceback at that point.

Line: 308 to 257
 When it crashes, type where to get a traceback at that point.

If possible use a debug build (see below) as that will provide more information. This can be done by running

Deleted:
<
<
 
Changed:
<
<
> LbLogin -c $CMTDEB
>
>
> LbLogin -c $CMTDEB
  before SetupProject etc.
Line: 324 to 270
 
 > ps x | grep gaudirun
 4200 pts/7    S+     0:00 grep gaudirun
Changed:
<
<
32081 pts/6 Rl+ 46:05 /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_68/Python/2.7.6/x86_64-slc6-gcc48-opt/bin/python /cvmfs/lhcb.cern.ch/lib/lhcb/GAUDI/GAUDI_v25r2/InstallArea/x86_64-slc6-gcc48-dbg/scripts/gaudirun.py
>
>
32081 pts/6 Rl+ 46:05 /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_68/Python/2.7.6/x86_64-slc6-gcc48-opt/bin/python /cvmfs/lhcb.cern.ch/lib/lhcb/GAUDI/GAUDI_v25r2/InstallArea/x86_64-slc6-gcc48-dbg/scripts/gaudirun.py
  So in this case its 32081. Then it is just a matter of running :-
Deleted:
<
<
 
Changed:
<
<
> gdb `which python` 32081
>
>
> gdb `which python` 32081
  This will take a short while, loading libraries etc. When done you can then investigate, for instance :-
Deleted:
<
<
 
(gdb) where
#0  0x00007fd11e5c68ac in G4PhysicsOrderedFreeVector::FindBinLocation(double) const () at ../management/src/G4PhysicsOrderedFreeVector.cc:165
Line: 342 to 284
 #2 0x00007fd119e5a302 in RichG4Cerenkov::PostStepDoIt(G4Track const&, G4Step const&) () at /afs/cern.ch/lhcb/software/releases/GEANT4/GEANT4_v95r2p7g2/InstallArea/include/G4PhysicsVector.icc:68 #3 0x00007fd11e588130 in G4SteppingManager::InvokePSDIP(unsigned long) () at ../src/G4SteppingManager2.cc:525
Changed:
<
<
>
>
  Note that the process being debugged will be paused whilst GDB is attached. If you quit GDB, it will then continue running, and if you wish you can reattach again at a later stage.
Line: 352 to 293
 With recent version of GDB, it is possible to script the behavior of the debugger. This is quite handy when a lot of repetitive tasks is requested in the debugger itself, for example when debugging multi-threaded applications. Since GDB 7.0 (which is present on lxplus), it is distributed with python modules interacting directly with the GDB internals. One way to check that GDB has the support for this is to do:
(gdb) python print 23
Changed:
<
<
23
>
>
23
  at the GDB prompt. This must not fail.
Line: 367 to 306
  > emacs M-X gdb (or Esc-X on many keyboards) gdb python
Changed:
<
<
(gdb) run `which gaudirun.py` yourOptions.py
>
>
(gdb) run `which gaudirun.py` yourOptions.py
 At CERN gdb python may give you an error, if that is the case you should do
Changed:
<
<
/afs/cern.ch/sw/lcg/contrib/gdb/7.6/x86_64-slc6-gcc48-opt/bin/gdb python
>
>
/afs/cern.ch/sw/lcg/contrib/gdb/7.6/x86_64-slc6-gcc48-opt/bin/gdb python
  You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed
Line: 381 to 319
  Google provides a set of performance tools. For details on usage within LHCb see here.
Changed:
<
<

VTune Intel Profiler

>
>

CPU-assisted performance analysis

 
Changed:
<
<
A Gaudi auditor has been provided by Sascha Mazurov to interface to the VTune Intel profiler. See IntelProfiler, IntelProfilerExample and Video tutorial on profiler installation in Gaudi, running and analyzing it from command line (without GUI)
>
>
All the previously discussed performance analysis tools are often unable to provide a precise quantitative analysis of what happens as a program is executed on a real CPU, for different reasons:

  • Valgrind essentially works by emulating the execution of the program on a virtual CPU. This artificially inflates the cost of CPU computations with respect to other operations (such as IO) by more than an order of magnitude, and entails that performance analysis must be based on a mathematical model of a CPU, which is in practice quite far off from what modern Intel CPUs actually do.
  • Google's profiler, like other user-space sampling profilers (gprof, igprof...), is only able to tell where a program spends its time, and not why. For example, it cannot tell where CPU cache misses are happening, which complicates memory layout optimizations.
  • Neither of these tools are able to monitor the time spent in the operating system kernel, which is important for assessing the impact of blocking IO operations or lock contention in multi-threaded code.

A more precise analysis of program execution on a real machine can be obtained from tools which leverage the Performance Monitoring Counters of modern CPUs, such as the "perf" profiler of the Linux kernel or Intel's VTune Amplifier. These tools provide an accurate and detailed picture of what is going on in the CPU as a program is executing, and have a negligible impact on the performance of the program under study when used correctly.

There is a price to pay for this precision, however, which is that the functionality provided by these tools depends on your system configuration. Some functionality may only be available on recent Intel CPUs, and other may require use of a recent enough Linux kernel.

Linux perf (also known as perf_events)

The perf profiler is a free and open source program which builds on the perf_events interface that has been integrated in the Linux kernel since Linux 2.6.31. It is highly recommended to use it with as recent a Linux kernel release as possible (at least 3.x) for the following reasons:

  • Early 2.6.x version had some very nasty bugs, causing system lock-up for example.
  • Due to the way it operates, perf requires CPU-specific support code. This means in particular that you are unlikely to be able to leverage your CPU's full performance monitoring capabilities if your Linux kernel version is older than your CPU model.
  • Perf is evolving quickly, and new versions can also bring massive improvements in features, usability and performance.

You can learn more about the improvements brought by successive perf releases in the "tracing and profiling" sections of the Linux kernel version history at https://kernelnewbies.org/LinuxVersions , and check which Linux kernel version your system is running using the uname command:

> uname -r
4.14.1-1-default

To install the perf profiler, use your linux distribution's package manager. The name of the package(s) to be installed varies from one distribution to another, here are some common ones:

  • Debian/Ubuntu: linux-tools
  • RedHat /CentOS/Fedora/SUSE: perf


The simplest thing which you can do with perf is to measure aggregated CPU statistics over the course of an entire program execution. This is done using the "perf stat" command:

> perf stat <your command>

[ ... normal program output ... ]

Performance counter stats for 'cargo run --release':

       4428,370578      task-clock (msec)         #    1,000 CPUs utilized          
                46      context-switches          #    0,010 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             6 459      page-faults               #    0,001 M/sec                  
    15 738 566 590      cycles                    #    3,554 GHz                    
    30 034 797 373      instructions              #    1,91  insn per cycle         
     2 222 188 760      branches                  #  501,807 M/sec                  
        88 966 900      branch-misses             #    4,00% of all branches        
 
Changed:
<
<
See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools
>
>
4,429156950 seconds time elapsed
 
Added:
>
>
The output of perf stat contains raw statistics (on the left) and some interpretations of the numbers (on the right). Here, we can see that the program under study is not multi-threaded (as CPU time is equal to the elapsed time), but makes reasonably efficient use of the single CPU core that it runs on (at 1.91 instructions per cycle, we're not too far away from the theoretical Haswell maximum for this code). One thing which this wiki page cannot expose is that if a performance number is abnormally bad, perf will helpfully highlight it using color in your terminal.

We can ask perf for more statistics from perf using the "-d" command line switch:

> perf stat -d <your command>

[ ... normal program output ... ]

 Performance counter stats for 'cargo run --release':

       4425,711186      task-clock (msec)         #    0,999 CPUs utilized          
               159      context-switches          #    0,036 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             6 431      page-faults               #    0,001 M/sec                  
    15 684 677 897      cycles                    #    3,544 GHz                      (50,10%)
    29 976 594 100      instructions              #    1,91  insn per cycle           (62,56%)
     2 208 236 648      branches                  #  498,956 M/sec                    (62,56%)
        88 851 174      branch-misses             #    4,02% of all branches          (62,60%)
     6 042 105 001      L1-dcache-loads           # 1365,228 M/sec                    (62,08%)
        11 320 634      L1-dcache-load-misses     #    0,19% of all L1-dcache hits    (25,06%)
         1 870 540      LLC-loads                 #    0,423 M/sec                    (25,02%)
           319 260      LLC-load-misses           #   17,07% of all LL-cache hits     (37,57%)

       4,431148695 seconds time elapsed

Here, you can see that we start to get interesting information about the use of CPU caches. For this particular program, the cache usage pattern is that we rarely go out of the first-level CPU cache (L1), but when we do, we often need to go all the way to main memory. As main memory accesses are around 50x more costly than first-level cache accesses (the latency order of magnitude being ~3 cpu cycles for L1 vs ~150 cycles for main memory), it is often useful to carefully examine both numbers. However, here, they are ultimately not concerning: even when re-scaled with this order of magnitude in mind, our last-level cache misses still have negligible impact compared to the common case of L1 cache hits.

Another thing to pay attention to here is the new column of percentages on the right. CPU performance monitoring counters have some hardware limitations, the most important of which being that you can only monitor a small set of them at any given time. Here, because we are looking at a lot of different statistics at once, perf was forced to only monitor a subset of them at a time and constantly switch between them, then interpolate the missing data samples, which has some overhead and reduces the quality of the measurement. The percentage tells you during which fraction of the total measurement time the corresponding performance counter was actually active.

If you know exactly which performance counters you are interested in, you can get more precise measurements by asking perf to only measure these ones, using the "-e" command line switch:

> perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses <your command>

[ ... normal program output ... ]

 Performance counter stats for 'cargo run --release':

     6 023 335 754      L1-dcache-loads                                               (74,90%)
         8 994 495      L1-dcache-load-misses     #    0,15% of all L1-dcache hits    (50,16%)
         1 407 348      LLC-loads                                                     (50,09%)
           311 839      LLC-load-misses           #   22,16% of all LL-cache hits     (75,00%)

       4,404909465 seconds time elapsed

As you can see, the L1-dcache-load-misses counter was now active 50% of the time, instead of 25% before, which means that we aggregated twice as much performance statistics over the same program running time. However, we lost the other performance counters. This is usually a good second thing to do, after a potential performance problem has been identified in the first step.

You can also get a full list of all CPU performance counters supported by perf using the "perf list" command.


Perf stat is very powerful and has very low overhead, but is only gives you coarse-grained information. Often, you want to know where your program spends its time, and more importantly why. This functionality is provided by the "perf record" and "perf report" commands. The first one analyses the performance of your program by periodically sampling which function your code is executing, and how the CPU performance counters are evolving, then correlating these two informations. The second one displays the resulting statistics in a nice textual user interface.

In order to report function names, your program must be compiled with debugging symbols (as enabled by the "-g" GCC flag, or the "Debug" and "RelWithDebInfo" CMake build configurations).

When you run perf record for the first time, you will likely see a warning message like the following one:

> perf record <your command>
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.

[ ... normal program output ... ]

[ perf record: Woken up 29 times to write data ]
[kernel.kallsyms] with build id d9c54397e4672f9850695351f23e25f24757f9b0 not found, continuing without symbols
[ perf record: Captured and wrote 7.910 MB perf.data (207305 samples) ]

This message warns you that perf is not currently allowed to report the names of functions that you call within the Linux kernel. This ability can be very useful when the performance of your program is limited by system calls, and you want to understand what exactly is going on. If you have administrator rights on your machine, you can enable this feature by writing "0" in the /proc/sys/kernel/kptr_restrict pseudo-file. But we do not need this feature for this short tutorial, and perf can live without it, so we'll do without for now.

So without much ado, let us look at the report:

> perf report
Samples: 207K of event 'cycles:uppp', Event count (approx.): 106619947387                                                                                                                                                                                                                                                     
Overhead  Command          Shared Object              Symbol                                                                                                                                                                                                                                                                  
  20,16%  07_io_bound_evl  libc-2.26.so               [.] _int_malloc
  14,66%  07_io_bound_evl  libc-2.26.so               [.] _int_free
  10,10%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Hashtable<int, int, std::allocator<int>, std::__detail::_Identity, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, tr
   8,59%  07_io_bound_evl  07_io_bound_evloop.exe     [.] detail::ConditionSlotKnowledge::setupSlot
   8,58%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(ConditionSlot
   6,85%  07_io_bound_evl  libc-2.26.so               [.] malloc
   3,89%  07_io_bound_evl  libc-2.26.so               [.] malloc_consolidate
   3,62%  07_io_bound_evl  07_io_bound_evloop.exe     [.] BenchmarkIOSvc::startConditionIO
   3,34%  07_io_bound_evl  libc-2.26.so               [.] cfree@GLIBC_2.2.5
   2,70%  07_io_bound_evl  libpthread-2.26.so         [.] __pthread_mutex_lock
   2,30%  07_io_bound_evl  libc-2.26.so               [.] tcache_put
   2,15%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::vector<detail::ReadySlotPromise, std::allocator<detail::ReadySlotPromise> >::~vector
   2,05%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
   1,91%  07_io_bound_evl  libc-2.26.so               [.] tcache_get
   1,87%  07_io_bound_evl  libstdc++.so.6.0.24        [.] operator new
   1,10%  07_io_bound_evl  libpthread-2.26.so         [.] __pthread_mutex_unlock_usercnt
[ ... shortened for brievity ... ]

This profile was acquired on a different program than the one which I ran "perf stat" on at the beginning, and as you can see, this specific program could use more optimization work. It spends about half of its time in memory allocation related functions (malloc, free, and implementation details thereof), which is a common performance problem in idiomatic C++ code.

One piece of information which you will notice at the top of the table is that this profile was based on the "cycles" performance counter, which tells how many CPU clock cycles have elapsed. This is the most common performance indicator in early performance analysis, as it tells you where your program spends its time, which is what one is usually initially most interested in. However, you can use any performance counter here, using the same "-e" flag as we added to perf stat recently. For example, "perf record -e L1-dcache-load-misses" would measure which functions in your code are correlated with the most CPU cache misses.


There is one important piece of information which is missing from the above report, however, and that is the reason why some specific functions were called. When doing performance analysis, it is one thing to know that one is calling malloc() too much, but it is another to know why this happens. In this case, what we are interested in is to tell who called the memory allocation functions, a piece of information known as the call graph.

There are several ways to measure a call graph, each with different advantages and drawbacks:

  • The best method in every respect, when it is available, is to use the Last Branch Record (LBR) hardware facility for this purpose. But this measurement method is only available on recent CPUs (>= Haswell for Intel).
  • A universally compatible counterpart is to periodically make a copy of the program's stack and analyze it using the program's DWARF debug information. This is the same method used by the GDB debugger to generate stack traces. However, the need to make stack copies gives this profiling method very bad performance, which means that perf can only measure the program's state rarely, and thus that performance profiles must be acquired over much longer periods of time (several minutes) in order to be statistically significant.
  • Sometimes, an alternative method based on sampling only the frame pointer of the program can achieve the same result at a much reduced cost, without loss of portability. But unfortunately, there is a very popular compiler performance optimization that breaks this profiling method, and even if you disable it on your code, the libraries that you use will most likely have it enabled. Therefore, use of this profiling method is not recommended.

To measure a call graph, pass the "--call-graph=<method>" switch to perf record, where <method> will be either "lbr" or "dwarf" depending on which one your hardware allows you to use. Here, I will assume the availability of LBR-based call graph profiling:

> perf record --call-graph=lbr <your command> && perf report
[... an entire program execution later ... ]
Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936                                                                                                                                                                                                                                                     
  Children      Self  Command          Shared Object              Symbol                                                                                                                                                                                                                                                      
+   46,21%     8,88%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(C
+   43,04%     0,02%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::executors::basic_thread_pool::worker_thread
+   41,85%     0,04%  07_io_bound_evl  07_io_bound_evloop.exe     [.] detail::SequentialScheduler::simulateEventLoop
+   39,68%     8,32%  07_io_bound_evl  07_io_bound_evloop.exe     [.] detail::ConditionSlotKnowledge::setupSlot
+   36,99%     1,90%  07_io_bound_evl  libstdc++.so.6.0.24        [.] operator new
+   32,67%     6,89%  07_io_bound_evl  libc-2.26.so               [.] malloc
+   26,47%    20,31%  07_io_bound_evl  libc-2.26.so               [.] _int_malloc
+   17,39%     9,89%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Hashtable<int, int, std::allocator<int>, std::__detail::_Identity, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_trai
+   15,63%    14,44%  07_io_bound_evl  libc-2.26.so               [.] _int_free
+   14,21%     1,85%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
+   10,21%     0,62%  07_io_bound_evl  libstdc++.so.6.0.24        [.] malloc@plt
+    6,11%     0,52%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Hashtable<int, int, std::allocator<int>, std::__detail::_Identity, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_trai
+    5,76%     0,45%  07_io_bound_evl  07_io_bound_evloop.exe     [.] operator delete@plt
+    5,58%     0,01%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::executors::executor_ref<boost::executors::inline_executor>::submit
+    5,54%     0,18%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Sp_counted_ptr_inplace<detail::AnyConditionData const, std::allocator<detail::AnyConditionData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose
+    5,26%     0,03%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::shared_state_base::do_continuation
+    4,15%     3,97%  07_io_bound_evl  libc-2.26.so               [.] malloc_consolidate
+    4,02%     0,00%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::future_executor_continuation_shared_state<boost::future<ConditionSlotIteration>, boost::future<ConditionSlotIteration>, ConditionSvc::setupConditions(int const&)::{lambda(boost::future<ConditionSlotIteration>&&)#1}>::launch_continuat
+    3,99%     0,01%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future<ConditionSlotIteration>, boost::future<ConditionSlotIteration>, ConditionSvc::setupConditions(int const&)::{lambda(boos
+    3,97%     0,04%  07_io_bound_evl  07_io_bound_evloop.exe     [.] ConditionSvc::setupConditions(int const&)::{lambda(boost::future<ConditionSlotIteration>&&)#1}::operator()
+    3,90%     3,63%  07_io_bound_evl  07_io_bound_evloop.exe     [.] BenchmarkIOSvc::startConditionIO
+    3,43%     3,30%  07_io_bound_evl  libc-2.26.so               [.] cfree@GLIBC_2.2.5
+    3,34%     0,01%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::continuation_shared_state<boost::future<std::vector<boost::future<void>, std::allocator<boost::future<void> > > >, ConditionSlotIteration, std::_Bind<detail::SingleUseBindWrapper<ConditionSvc::setupConditions(int const&)::{lambda(boo
+    3,32%     0,00%  07_io_bound_evl  07_io_bound_evloop.exe     [.] benchmark
+    3,30%     0,00%  07_io_bound_evl  07_io_bound_evloop.exe     [.] main
+    2,88%     2,71%  07_io_bound_evl  libpthread-2.26.so         [.] __pthread_mutex_lock
+    2,81%     0,00%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future<std::vector<boost::future<void>, std::allocator<boost::future<void> > > >, ConditionSlotIteration, std::_Bind<detail::S
+    2,73%     0,21%  07_io_bound_evl  07_io_bound_evloop.exe     [.] pthread_mutex_lock@plt
+    2,59%     2,32%  07_io_bound_evl  libc-2.26.so               [.] tcache_put
+    2,49%     2,22%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::vector<detail::ReadySlotPromise, std::allocator<detail::ReadySlotPromise> >::~vector
+    2,43%     0,87%  07_io_bound_evl  07_io_bound_evloop.exe     [.] operator new@plt
+    1,95%     1,88%  07_io_bound_evl  libc-2.26.so               [.] tcache_get
+    1,55%     0,01%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::future_executor_continuation_shared_state<boost::future<std::vector<boost::future<void>, std::allocator<boost::future<void> > > >, ConditionSlotIteration, std::_Bind<detail::SingleUseBindWrapper<ConditionSvc::setupConditions(int cons
+    1,42%     0,13%  07_io_bound_evl  07_io_bound_evloop.exe     [.] pthread_mutex_unlock@plt
+    1,24%     1,12%  07_io_bound_evl  libpthread-2.26.so         [.] __pthread_mutex_unlock_usercnt
+    1,05%     0,01%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::future_executor_continuation_shared_state<boost::future<void>, void, boost::future<detail::future_union_base<__gnu_cxx::__normal_iterator<boost::future<void>*, std::vector<boost::future<void>, std::allocator<boost::future<void> > > >
+    1,02%     0,01%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::run_it<boost::detail::continuation_shared_state<boost::future<void>, void, boost::future<detail::future_union_base<__gnu_cxx::__normal_iterator<boost::future<void>*, std::vector<boo
+    1,00%     0,00%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::promise<std::vector<boost::future<void>, std::allocator<boost::future<void> > > >::set_value
[ ... shortened for brievity ... ]

Notice two new things in the report. The first one is the "Children" counter, which tells you which fraction of the elapsed CPU time was spent in a certain function or one of the functions that it calls. This allows you to tell, at a glance, which functions can be held responsible for the most execution time in your program, as opposed to which time was spent inside of each individual function. From a performance analysis perspective, that's a much more interesting information than the previously displayed "self time" alone, which is why perf report automatically sorts functions according to this criterion when you enable call graph profiling.

The second thing is the little "+" signs in the leftmost column of the report. These signs allow you recursively explore which functions a given function is calling, using the interactive perf report UI. For example, in the following text block, I have explored where the "simulateEventLoop" function spends its time, and found out that a non-negligible fraction of it was spent inserting elements inside of a hash table (unordered_map in C++ terms), which in turned caused a nontrivial fraction of my dynamic memory allocations. Another time sink was the liberation of reference-counted data (via an std::shared_ptr), which caused expensive atomic operations and eventual memory liberation.

Samples: 208K of event 'cycles:uppp', Event count (approx.): 105972566936                                                                                                                                                                                                                                                     
  Children      Self  Command          Shared Object              Symbol                                                                                                                                                                                                                                                     ?
+   46,21%     8,88%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::detail::nullary_function<void ()>::impl_type<boost::detail::shared_state_nullary_task<void, boost::detail::invoker<std::_Bind<detail::SingleUseBindWrapper<BenchmarkIOSvc::startConditionIO(int const&, ConditionSlotIteration const&)::{lambda(?
+   43,04%     0,02%  07_io_bound_evl  07_io_bound_evloop.exe     [.] boost::executors::basic_thread_pool::worker_thread                                                                                                                                                                                                     ?
-   41,85%     0,04%  07_io_bound_evl  07_io_bound_evloop.exe     [.] detail::SequentialScheduler::simulateEventLoop                                                                                                                                                                                                         ?
   - 41,80% detail::SequentialScheduler::simulateEventLoop                                                                                                                                                                                                                                                                   ?
      - 37,31% detail::ConditionSlotKnowledge::setupSlot                                                                                                                                                                                                                                                                     ?
         - 16,18% std::_Hashtable<int, int, std::allocator<int>, std::__detail::_Identity, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_insert<int, std::__deta?
            + 5,18% operator new                                                                                                                                                                                                                                                                                             ?
              1,03% operator new@plt                                                                                                                                                                                                                                                                                         ?
         + 13,28% std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release                                                                                                                                                                                                                                              ?
      + 3,85% boost::detail::shared_state_base::do_continuation                                                                                                                                                                                                                                                              ?
+   39,68%     8,32%  07_io_bound_evl  07_io_bound_evloop.exe     [.] detail::ConditionSlotKnowledge::setupSlot                                                                                                                                                                                                              ?
+   36,99%     1,90%  07_io_bound_evl  libstdc++.so.6.0.24        [.] operator new                                                                                                                                                                                                                                           ?
+   32,67%     6,89%  07_io_bound_evl  libc-2.26.so               [.] malloc                                                                                                                                                                                                                                                 ?
+   26,47%    20,31%  07_io_bound_evl  libc-2.26.so               [.] _int_malloc                                                                                                                                                                                                                                            ?
+   17,39%     9,89%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Hashtable<int, int, std::allocator<int>, std::__detail::_Identity, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_tra?
+   15,63%    14,44%  07_io_bound_evl  libc-2.26.so               [.] _int_free                                                                                                                                                                                                                                              ?
+   14,21%     1,85%  07_io_bound_evl  07_io_bound_evloop.exe     [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release                                                                                                                                                                                          ?
+   10,21%     0,62%  07_io_bound_evl  libstdc++.so.6.0.24        [.] malloc@plt

This example also highlights one open issue with profiling C++ code, which is that the function names of C++ methods in idiomatic libraries such as the STL or boost can be gigantic, far away from the API names that you are used to calling from your code, and in general hard to read. Unfortunately, there is no good solution to this problem, the best that one can do is usually to look for interesting keywords in the long-winded C++ name (_Hashtable and _M_insert in the example above) and try to associate them with specific patterns in the corresponding function's code.


This concludes this short introduction to perf. Here, we have only have scratched the surface of what perf can do. Other interesting topics could have included...

  • Displaying annotated source code and assembly, in order to tell which part of a given function, exactly, takes time (bearing in mind that optimizing compilers can transform the source code of the original function quite tremendously, which makes this analysis somewhat difficult and advanced).
  • Measuring program activity every N-th occurence of a given event (e.g. L1 cache miss) instead of periodically, in order to more precisely pinpoint where in the code the event is occurring.
  • The great many performance counters available on modern CPUs, which ones are most useful, and how their values should be interpreted.
  • System-wide profiling, allowing to study what happens to threads even when they "fall asleep" and call the operating system's kernel for the purpose of performing IO or locking a mutex.

...but this would go beyond the scope of this introductory TWiki page. For more detailed information on Linux perf, highly recommended sources of information and "cheat sheets" include the man pages of the various perf utilities and http://www.brendangregg.com/perf.html .

Intel VTune Amplifier

VTune uses the same performance analysis techniques as perf, but is commercially supported by Intel. This comes with different trade-offs: you must buy a very expensive (~1k$) license if you want to use it on your personal computer, but as long as you are able to rely on the licenses that are provided by CERN Openlab (or perhaps your institution), you will be able to enjoy a very nice and powerful graphical user interface, high-quality support and documentation from Intel, and periodical tutorials from Openlab. Obviously, you shouldn't expect it to work reliably on any CPU which has not been manufactured by Intel.

A Gaudi auditor has been provided by Sascha Mazurov to interface to the VTune Intel profiler. See IntelProfiler, IntelProfilerExample and Video tutorial on profiler installation in Gaudi, running and analyzing it from command line (without GUI)

See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools

 

Memory Profiling with Jemalloc

Added:
>
>
 Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.

To view the diff between two memory dumps you can use the jeprof tool:

Line: 392 to 581
  To view the diff between two memory dumps you can use the jeprof tool:
Changed:
<
<
lb-run --ext jemalloc LCG/latest jeprof --evince --base=.heap .heap
>
>
lb-run --ext jemalloc LCG/latest jeprof --evince --base=.heap .heap
 

Debug and Optimised Libraries

Line: 399 to 586
 

Debug and Optimised Libraries

To build and run in debug mode, run

Deleted:
<
<
 
Changed:
<
<
> LbLogin -c $CMTDEB
>
>
> LbLogin -c $CMTDEB
  Before any SetupProject calls.
Line: 422 to 607
 
Added:
>
>
 

Revision 1152017-11-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 154 to 154
  before any SetupProject calls.
Added:
>
>

Cache Profiling

The valgrind tool called "cachegrind" provides a simulation of the CPU caches and thus is a cache and branch-prediction profiler. In the simpliest case it can be activated with

 > valgrind --tool=cachegrind application

See the cachegrind manual for more details.

 

Code Profiling

A valgrind tool called "callgrind" also exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package. Full documentation of this tool is available in section 6 of the valgrind user guide.

Line: 172 to 182
  The options --dump-instr=yes --trace-jump=yes are also useful as they provide more information.
Added:
>
>
If you wish to include in the callgrind output the same cache profiling information as provided by cachegrind, include as well the options --cache-sim=yes --branch-sim=yes
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be, for Boole
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
 

or for Brunel

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
 

and finally for DaVinci

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
 

Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --cache-sim=yes --branch-sim=yes --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
 

Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

Revision 1142017-07-17 - RosenMatev

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 378 to 378
 

Memory Profiling with Jemalloc

Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
Deleted:
<
<
A recent version of jemalloc is available on AFS in, to use you do:
export PROF_HOME=/afs/cern.ch/lhcb/group/profiling
export LD_LIBRARY_PATH=${PROF_HOME}/lib:$LD_LIBRARY_PATH
export PATH=${PROF_HOME}/bin:$PATH
 To view the diff between two memory dumps you can use the jeprof tool:
Changed:
<
<
jeprof --evince --base=.heap .heap
>
>
lb-run --ext jemalloc LCG/latest jeprof --evince --base=.heap .heap
 

Revision 1132017-06-28 - RosenMatev

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 376 to 376
 See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools

Memory Profiling with Jemalloc

Changed:
<
<
Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
>
>
Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
  A recent version of jemalloc is available on AFS in, to use you do:

Revision 1122017-05-19 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 415 to 415
 

Further reading

Added:
>
>
 

Revision 1112017-01-27 - PaulSeyfert

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 91 to 91
 Since Gaudi v23r7, a new way to invoke profilers has been introduced in Gaudi: the new --profilerName option allows to directly call valgrind (or igprof) on the Gaudi process. It is also possible to pass options directly to valgrind, with --profileExtraOptions.
Changed:
<
<
NB1: The profiler name is valgrind+tool (e.g. "valgrindmemcheck" or "valgrindmassif" or "valgrindcallgrind"), see the gaudirun.py help for more details.
>
>
NB1: The profiler name is valgrind+tool (e.g. "valgrindmemcheck" or "valgrindmassif" or "valgrindcallgrind"), see the gaudirun.py --help for more details.
 NB2: In the profileExtraOptions "--" have to be replaced by "__" to avoid issues with the option parser.
Added:
>
>
 NB3: valgrind needs to be in the path for this option to run.

Revision 1102017-01-19 - MarianStahl

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 374 to 374
 See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools

Memory Profiling with Jemalloc

Changed:
<
<
Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
>
>
Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
  A recent version of jemalloc is available on AFS in, to use you do:

Revision 1092016-07-12 - DanielCampora

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 278 to 278
 There is a JIRA task to make newer gdb available on login. Until it is finished, please use gdb from CVMFS or AFS by one of:
Changed:
<
<
export PATH=/cvmfs/sft.cern.ch/lcg/external/gdb/7.6/x86_64-slc6-gcc48-opt/bin:$PATH export PATH=/afs/cern.ch/sw/lcg/external/gdb/7.6/x86_64-slc6-gcc48-opt/bin:$PATH
>
>
export PATH=/cvmfs/lhcb.cern.ch/lib/contrib/gdb/7.11/x86_64-slc6-gcc49-opt/bin:$PATH or
export PATH=/afs/cern.ch/sw/lcg/external/gdb/7.11/x86_64-slc6-gcc48-opt/bin:$PATH
 

Alternatively, gaudirun.py applications can be run through the gdb debugger using a similar trick as with valgrind, to call python directly and pass gaudirun.py as an argument. Just type

Revision 1082016-03-15 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 376 to 376
 A recent version of jemalloc is available on AFS in, to use you do:
export PROF_HOME=/afs/cern.ch/lhcb/group/profiling
Changed:
<
<
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PROF_HOME}/lib export PATH=$PATH:${PROF_HOME}/bin
>
>
export LD_LIBRARY_PATH=${PROF_HOME}/lib:$LD_LIBRARY_PATH export PATH=${PROF_HOME}/bin:$PATH
 

Revision 1072016-01-08 - RosenMatev

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 269 to 269
 

Debugging gaudirun.py on Linux with gdb

Changed:
<
<
gaudirun.py applications can be run through the gdb debugger using a similar trick as with valgrind, to call python directly and pass gaudirun.py as an argument. Just type
>
>
The easiest way to debug with gdb is to use the built-in --gdb flag of gaudirun.py
 > gaudirun.py --gdb yourOptions.py
 
Added:
>
>
Note: Currently, the default gdb on lxplus is too old to be useful for gcc 4.8 (or later) builds. There is a JIRA task to make newer gdb available on login. Until it is finished, please use gdb from CVMFS or AFS by one of:
 
Changed:
<
<
> gdb --args python `which gaudirun.py` yourOptions.py
>
>
export PATH=/cvmfs/sft.cern.ch/lcg/external/gdb/7.6/x86_64-slc6-gcc48-opt/bin:$PATH export PATH=/afs/cern.ch/sw/lcg/external/gdb/7.6/x86_64-slc6-gcc48-opt/bin:$PATH
 
Added:
>
>
Alternatively, gaudirun.py applications can be run through the gdb debugger using a similar trick as with valgrind, to call python directly and pass gaudirun.py as an argument. Just type
 > gdb --args python `which gaudirun.py` yourOptions.py 
 and then to run the application then simply type run at the gdb command line. ( The option --args tells gdb to interpret any additional options after the executable name as arguments to that application, instead of the default which is to try and interpret them as core files...)

When it crashes, type where to get a traceback at that point.

Revision 1062015-10-29 - BenjaminCouturier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 362 to 362
 

Memory Profiling with Jemalloc

Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
Added:
>
>
A recent version of jemalloc is available on AFS in, to use you do:
export PROF_HOME=/afs/cern.ch/lhcb/group/profiling
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PROF_HOME}/lib
export PATH=$PATH:${PROF_HOME}/bin

To view the diff between two memory dumps you can use the jeprof tool:

jeprof --evince  --base=<first>.heap <executable> <second>.heap

 

Debug and Optimised Libraries

To build and run in debug mode, run

Revision 1052015-10-29 - MarcoClemencic

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 359 to 359
  See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools
Added:
>
>

Memory Profiling with Jemalloc

Since Gaudi v26r3 it's possible to use Jemalloc profiling tools to audit memory allocations. Instructions can be found in Gaudi doxygen pages.
 

Debug and Optimised Libraries

To build and run in debug mode, run

Revision 1042015-04-02 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 114 to 114
  The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool. Full documentation of this tool is available in section 4 of the valgrind user guide.
Changed:
<
<
It is useful to add a few additional options, to improve the quality of the output. The options =-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes = increase the amount of information valgrind supplies.
>
>
It is useful to add a few additional options, to improve the quality of the output. The options
-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes
increase the amount of information valgrind supplies.
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.

Revision 1032015-02-21 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 196 to 196
  Using kcachegrind takes some getting used to. One of the first things to do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
Added:
>
>

Remote Control

Sometimes its useful to be able to start and stop profiling by hand. This can be done by first passing the option

--instr-atstart=no

to the valgrind command used to start the application. This will start the process running, but profiling will not start until you run (on the same machine, so start a second terminal) the command

> callgrind_control --instr=on

later on you can then stop the profiling when you wish, using

> callgrind_control --instr=off
 

Alternative approach

Callgrind can also be configured from within job options:

Revision 1022015-02-13 - GloriaCorti

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 314 to 314
 
 > emacs
 M-X gdb  (or Esc-X on many keyboards)
Changed:
<
<
/afs/cern.ch/sw/lcg/contrib/gdb/7.6/x86_64-slc6-gcc48-opt/bin/gdb python
>
>
gdb python
  (gdb) run `which gaudirun.py` yourOptions.py
Added:
>
>
At CERN gdb python may give you an error, if that is the case you should do
/afs/cern.ch/sw/lcg/contrib/gdb/7.6/x86_64-slc6-gcc48-opt/bin/gdb python
  You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed

Revision 1012015-02-13 - GloriaCorti

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 314 to 314
 
 > emacs
 M-X gdb  (or Esc-X on many keyboards)
Changed:
<
<
gdb python
>
>
/afs/cern.ch/sw/lcg/contrib/gdb/7.6/x86_64-slc6-gcc48-opt/bin/gdb python
  (gdb) run `which gaudirun.py` yourOptions.py

Revision 1002015-01-30 - PaulSeyfert

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 196 to 196
  Using kcachegrind takes some getting used to. One of the first things to do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
Added:
>
>

Alternative approach

Callgrind can also be configured from within job options:

 def addProfile():
     from Configurables import CallgrindProfile
     p = CallgrindProfile('CallgrindProfile')
     p.StartFromEventN = 40
     p.StopAtEventN = 90
     p.DumpAtEventN = 90
     p.DumpName = 'CALLGRIND-OUT'
     GaudiSequencer('RecoTrSeq').Members.insert(0, p)
 appendPostConfigAction(addProfile)

gaudirun.py --profilerName=valgrindcallgrind --profilerExtraOptions="__instr-atstart=no -v __smc-check=all-non-file __dump-instr=yes __trace-jump=yes" joboptions.py |& tee out-0005.log
 

Memory Usage Monitoring

The valgrind tool called "massif" also exists which does some detialed memory usage monitoring. Full documentation of this tool is available in section 9 of the valgrind user guide.

Revision 992015-01-27 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 125 to 125
 DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --smc-check=all-non-file --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$BRUNELROOT/job/Brunel.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --smc-check=all-non-file --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$BRUNELROOT/job/Brunel.supp python `which gaudirun.py` options.py
 

The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Line: 169 to 169
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be, for Boole
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
 

or for Brunel

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
 

and finally for DaVinci

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
 

Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --smc-check=all-non-file --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
 

Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

Revision 982014-10-24 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 242 to 242
  before SetupProject etc.
Added:
>
>

Attaching GDB to a running process

GDB can be used to debug an already running process, which can be useful for instance to investigate hung applications. Just set up a second terminal with the same software environment as the one you wish to debug. Running the DEBUG builds is highly advised, as then you will get line numbers in any traceback information.

Then, identify the process ID for the job you wish to investigate, e.g.

 > ps x | grep gaudirun
 4200 pts/7    S+     0:00 grep gaudirun
32081 pts/6    Rl+   46:05 /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_68/Python/2.7.6/x86_64-slc6-gcc48-opt/bin/python /cvmfs/lhcb.cern.ch/lib/lhcb/GAUDI/GAUDI_v25r2/InstallArea/x86_64-slc6-gcc48-dbg/scripts/gaudirun.py <snip>

So in this case its 32081. Then it is just a matter of running :-

 > gdb `which python` 32081

This will take a short while, loading libraries etc. When done you can then investigate, for instance :-

(gdb) where
#0  0x00007fd11e5c68ac in G4PhysicsOrderedFreeVector::FindBinLocation(double) const () at ../management/src/G4PhysicsOrderedFreeVector.cc:165
#1  0x00007fd11e5cc6f4 in G4PhysicsVector::ComputeValue(double) () at ../management/src/G4PhysicsVector.cc:531
#2  0x00007fd119e5a302 in RichG4Cerenkov::PostStepDoIt(G4Track const&, G4Step const&) ()
    at /afs/cern.ch/lhcb/software/releases/GEANT4/GEANT4_v95r2p7g2/InstallArea/include/G4PhysicsVector.icc:68
#3  0x00007fd11e588130 in G4SteppingManager::InvokePSDIP(unsigned long) () at ../src/G4SteppingManager2.cc:525
<snip>

Note that the process being debugged will be paused whilst GDB is attached. If you quit GDB, it will then continue running, and if you wish you can reattach again at a later stage.

 

Scripting GDB with python

With recent version of GDB, it is possible to script the behavior of the debugger. This is quite handy when a lot of repetitive tasks is requested in the debugger itself, for example when debugging multi-threaded applications. Since GDB 7.0 (which is present on lxplus), it is distributed with python modules interacting directly with the GDB internals. One way to check that GDB has the support for this is to do:

Revision 972014-08-07 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 99 to 99
 gaudirun.py --profilerName=valgrindmemcheck --profilerExtraOptions="-v __leak-check=yes __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py
Added:
>
>

Working with ROOT6

ROOT6 uses clang internally as its interpreter, and this causes valgrind some problems due to its use of 'self modifying code'. To deal with this the additional valgrind option --smc-check should be used. Add

--smc-check=all-non-file

To your set of command line options. This unfortunately makes valgrind much slower, but at least it works... There is prehaps some hope that in the future when ROOT enables full support for clang's pre-compiled-modules, the need for this might go away...

 

Memory Tests

The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool. Full documentation of this tool is available in section 4 of the valgrind user guide.

Revision 962014-05-20 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Revision 952014-05-15 - BenjaminCouturier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 36 to 36
 

Usage at CERN

Changed:
<
<

Using Valgrind from LCG Externals

>
>

System Valgrind

 
Changed:
<
<
Valgrind ships with the LCG externals on AFS at CERN, and SetupProject allows adding Valgrind to the environment of your application, e.g. running:
>
>
Check if you have "/usr/bin/valgrind" on your machine If not, on a desktop, it can be installed with: "sudo yum install valgrind"
 
Changed:
<
<
SetupProject DaVinci valgrind

will set up the environment for DaVinci, adding valgrind to the PATH/LD_LIBRARY_PATH.

Alternative installation

>
>

Alternative Installation on machines with AFS

  The version of valgrind installed by default on lxplus has problems running with applications that require reasonably large amounts of memory due to bugs in its internal memory model. It also cannot run on the normal lxplus nodes due to the virtual memory limit in place - e.g. you will see
Line: 96 to 91
 Since Gaudi v23r7, a new way to invoke profilers has been introduced in Gaudi: the new --profilerName option allows to directly call valgrind (or igprof) on the Gaudi process. It is also possible to pass options directly to valgrind, with --profileExtraOptions.
Changed:
<
<
NB1: The profiler name is valgrind+tool (e.g. "valgrindmemcheck" or "valgrindmassif"), see th gaudirun.py help for more details. NB2: In the profileExtraOptions "--" have to be replaced by "__" to avoid issues with the otpion parser. NB3: valgrind needs to be in the path for this option to run. If it is not installed on the machine, "SetupProject Application valgrind" is necessary.
>
>
NB1: The profiler name is valgrind+tool (e.g. "valgrindmemcheck" or "valgrindmassif" or "valgrindcallgrind"), see the gaudirun.py help for more details. NB2: In the profileExtraOptions "--" have to be replaced by "__" to avoid issues with the option parser. NB3: valgrind needs to be in the path for this option to run.
 
gaudirun.py --profilerName=valgrindmemcheck --profilerExtraOptions="-v __leak-check=yes  __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py

Revision 942014-04-15 - RobLambert

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 9 to 9
 

LHCb dedicated Profiling and Regression test service

LHCbPR is LHCb's own project to track performance issues in the general framework and the projects. Its aim is to improve the refactoring process during 2013-2015 and help developers to obtain performing code. Development of LHCbPR is ongoing.

Added:
>
>

Simplifying the problem, with GaudiDiff and GaudiExcise

The first step in debugging is often to provide the simplest reproducible example of the problem you've encountered. Tools are available to help you with that.

  • To run only one algorithm from within a complicated sequence, take a look at GaudiExcise
  • To see what the differences really are between to gaudi jobs, take a look at GaudiDiff
 

The "valgrind, callgrind and kcachegrind" Utilities

Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.

Revision 932014-03-10 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 104 to 104
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.
Added:
>
>
To suppress some general warnings from the LHCb stack, not specific to a particular application, include --suppressions=$STDOPTS/LHCb.supp.
 To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.

DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$STDOPTS/LHCb.supp --suppressions=$BRUNELROOT/job/Brunel.supp python `which gaudirun.py` options.py
 

The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Revision 922014-03-09 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 106 to 106
  To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.
Deleted:
<
<
To save the output to an XML file, use --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml as well.
 DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml --gen-suppressions=all --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Line: 202 to 200
 

Understanding the output

Changed:
<
<
The Atlas valgrind TWiki contains hints on interpreting the valgrind output. If you have access to the Atlas TWiKis, go to https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind. If not, this link is a snapshot of the Atlas valgrind TWiKi, taken on 10th March 2013
>
>
There are a selection of applications that can be used to analysis valgrind's output. This link contains some information on these. Alleyoop and Valkyrie are installed as part of the setup above.

Valkyrie requires an XML output file. To save the output to an XML file, use the command line options --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml as well as any others you wish to use.

This link is a snapshot of the Atlas valgrind TWiki, that contains some useful information.

 

Debugging gaudirun.py on Linux with gdb

Revision 912014-03-09 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 106 to 106
  To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.
Added:
>
>
To save the output to an XML file, use --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml as well.
 DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --child-silent-after-fork=yes --xml=yes --xml-file=memcheck.xml --gen-suppressions=all --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Revision 902014-01-23 - BenjaminCouturier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 93 to 93
 NB3: valgrind needs to be in the path for this option to run. If it is not installed on the machine, "SetupProject Application valgrind" is necessary.
Changed:
<
<
> gaudirun.py --profilerName=valgrindmemchec --profilerExtraOptions="-v __leak-check=yes __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py
>
>
gaudirun.py --profilerName=valgrindmemcheck --profilerExtraOptions="-v __leak-check=yes __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py
 

Memory Tests

Revision 892013-12-06 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 124 to 124
  This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.
Added:
>
>
If possible use a debug build (see below) as that will provide more information. This can be done by running

 > LbLogin -c $CMTDEB

before any SetupProject calls.

 

Code Profiling

A valgrind tool called "callgrind" also exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package. Full documentation of this tool is available in section 6 of the valgrind user guide.

Line: 252 to 260
 

Debug and Optimised Libraries

Added:
>
>
To build and run in debug mode, run

 > LbLogin -c $CMTDEB

Before any SetupProject calls.

 Utilities like callgrind, Google PerfTools and gdb can work on both optimised ( CMTCONFIG) and un-optimised ( CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings.

This does not happen in un-optimised (debug) builds, where all information is available, but of course you must then bear in mind that the two builds are different, and for some studies like profiling (code timing) the information you get will be different. In these cases a good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.

Revision 882013-11-12 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 24 to 24
  It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 25000 seems to do the trick in all cases I've found so far. I also increased VG_N_SEGNAMES to 5000 from 1000.
Added:
>
>
Update : As per valgrind 3.9.0, this step is no longer required, as the default values have been increased to more reasonable settings.
 

Usage at CERN

Using Valgrind from LCG Externals

Revision 872013-10-02 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 102 to 102
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.
Added:
>
>
To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.
 DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --gen-suppressions=all --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 
Deleted:
<
<
To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.
 The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh this is | & tee -a profile.log.

Revision 862013-10-01 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 105 to 105
 DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python --suppressions=$DAVINCIROOT/job/DaVinci.supp `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp --suppressions=$DAVINCIROOT/job/DaVinci.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-

Revision 852013-10-01 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 102 to 102
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.
Changed:
<
<
So the final full command is then :-
>
>
DaVinci has the specific suppression file $DAVINCIROOT/job/DaVinci.supp, so the final full command for DaVinci is then :-
 
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python --suppressions=$DAVINCIROOT/job/DaVinci.supp `which gaudirun.py` options.py
 
Changed:
<
<
For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving
>
>
For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving :-
 
 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
Changed:
<
<
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
>
>
To generate in the output log template suppression blocks for new warnings, add the option --gen-suppressions=all to the command line arguments.

The above commands will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

  Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh this is | & tee -a profile.log.

Revision 842013-08-12 - BenjaminCouturier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 26 to 26
 

Usage at CERN

Added:
>
>

Using Valgrind from LCG Externals

Valgrind ships with the LCG externals on AFS at CERN, and SetupProject allows adding Valgrind to the environment of your application, e.g. running:

SetupProject DaVinci valgrind

will set up the environment for DaVinci, adding valgrind to the PATH/LD_LIBRARY_PATH.

Alternative installation

 The version of valgrind installed by default on lxplus has problems running with applications that require reasonably large amounts of memory due to bugs in its internal memory model. It also cannot run on the normal lxplus nodes due to the virtual memory limit in place - e.g. you will see
Line: 49 to 61
 

Working with gaudirun.py

Changed:
<
<
Unfortunately, valgrind cannot be used directly with gaudirun.py. There are two solutions, first you can use gaudirun.py to generate "old style" Job Options:
>
>
Unfortunately, valgrind cannot be used directly with gaudirun.py. There are three solutions, first you can use gaudirun.py to generate "old style" Job Options:
 
 > gaudirun.py -n -v -o options.opts options.py
Line: 71 to 83
  This second method will be used in the examples below.
Added:
>
>
Since Gaudi v23r7, a new way to invoke profilers has been introduced in Gaudi: the new --profilerName option allows to directly call valgrind (or igprof) on the Gaudi process. It is also possible to pass options directly to valgrind, with --profileExtraOptions.

NB1: The profiler name is valgrind+tool (e.g. "valgrindmemcheck" or "valgrindmassif"), see th gaudirun.py help for more details. NB2: In the profileExtraOptions "--" have to be replaced by "__" to avoid issues with the otpion parser. NB3: valgrind needs to be in the path for this option to run. If it is not installed on the machine, "SetupProject Application valgrind" is necessary.

 > gaudirun.py --profilerName=valgrindmemchec --profilerExtraOptions="-v __leak-check=yes  __leak-check=full __show-reachable=yes __suppressions=$ROOTSYS/etc/valgrind-root.supp __suppressions=$STDOPTS/valgrind-python.supp __suppressions=$STDOPTS/Gaudi.supp" options.py
 

Memory Tests

The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool. Full documentation of this tool is available in section 4 of the valgrind user guide.

Revision 832013-06-26 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 185 to 185
  > LbLogin -c $CMTDEB
Changed:
<
<
On institute setups it may be necessary to use the LHCb software from AFS since the debug libraries are not on CVMFS.
source /afs/cern.ch/lhcb/software/releases/LBSCRIPTS/prod/InstallArea/scripts/LbLogin.sh -c $CMTDEB

Before running SetupProject.

>
>
before SetupProject etc.
 

Scripting GDB with python

Revision 822013-06-10 - StefanLohn

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 6 to 6
 
Added:
>
>

LHCb dedicated Profiling and Regression test service

LHCbPR is LHCb's own project to track performance issues in the general framework and the projects. Its aim is to improve the refactoring process during 2013-2015 and help developers to obtain performing code. Development of LHCbPR is ongoing.

 

The "valgrind, callgrind and kcachegrind" Utilities

Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.

Line: 41 to 45
  > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.sh
Changed:
<
<
(for csh or bash like shells respectively) which currently provides the latest versions of valgrind and callgrind, with a private patch applied to allow valgrind to properly run with the LHC LCG applications.
>
>
(for csh or bash like shells respectively) which currently provides the latest versions of valgrind and callgrind, with a private patch applied to allow valgrind to properly run with the LHC LCG applications.
 

Working with gaudirun.py

Line: 72 to 75
  The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool. Full documentation of this tool is available in section 4 of the valgrind user guide.
Changed:
<
<
It is useful to add a few additional options, to improve the quality of the output. The options =-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes = increase the amount of information valgrind supplies.
>
>
It is useful to add a few additional options, to improve the quality of the output. The options =-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes = increase the amount of information valgrind supplies.
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.
Line: 89 to 91
  > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
Changed:
<
<
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
>
>
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
 
Changed:
<
<
Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh this is | & tee -a profile.log.
>
>
Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh this is | & tee -a profile.log.
  This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.
Line: 141 to 143
  kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.
Changed:
<
<
Using kcachegrind takes some getting used to. One of the first things to do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
>
>
Using kcachegrind takes some getting used to. One of the first things to do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
 

Memory Usage Monitoring

Line: 244 to 248
 
Changed:
<
<

ChrisRJones - 27 Feb 2006 MarcoCattaneo - 11 Mar 2013
>
>

ChrisRJones - 27 Feb 2006 MarcoCattaneo - 11 Mar 2013
 
META FILEATTACHMENT attachment="Atlas_UsingValgrind.pdf" attr="h" comment="Snapshot of Atlas valgrind TWiKi, taken on 10th March 2013" date="1363024926" name="Atlas_UsingValgrind.pdf" path="Atlas_UsingValgrind.pdf" size="172163" user="cattanem" version="1"
META TOPICMOVED by="ChrisRJones" date="1161806930" from="LHCb.CodeProfiling" to="LHCb.CodeAnalysisTools"

Revision 812013-04-24 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 233 to 233
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier) or use another system without the strict quotas applied on AFS (like home institute systems).
Added:
>
>

Performance and Regression testing service

The Performance and Regression testing service can be accessed at https://lhcb-pr.web.cern.ch/lhcb-pr/. An introduction to its design and capabilities was given in this talk by Emmanouil Kiagias at the Core Software meeting on 24th April 2013
 

Further reading

Revision 802013-03-11 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 163 to 163
 

Understanding the output

Changed:
<
<
The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
>
>
The Atlas valgrind TWiki contains hints on interpreting the valgrind output. If you have access to the Atlas TWiKis, go to https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind. If not, this link is a snapshot of the Atlas valgrind TWiKi, taken on 10th March 2013
 

Debugging gaudirun.py on Linux with gdb

Line: 244 to 244
 
ChrisRJones - 27 Feb 2006
Changed:
<
<
MarcoCattaneo - 18 Jun 2009
>
>
MarcoCattaneo - 11 Mar 2013
 
Added:
>
>
META FILEATTACHMENT attachment="Atlas_UsingValgrind.pdf" attr="h" comment="Snapshot of Atlas valgrind TWiKi, taken on 10th March 2013" date="1363024926" name="Atlas_UsingValgrind.pdf" path="Atlas_UsingValgrind.pdf" size="172163" user="cattanem" version="1"
 
META TOPICMOVED by="ChrisRJones" date="1161806930" from="LHCb.CodeProfiling" to="LHCb.CodeAnalysisTools"

Revision 792013-01-05 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 19 to 19
 --15591:0:aspacem Increase it and rebuild. Exiting now.
Changed:
<
<
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 15000 seems to do the trick in all cases I've found so far.
>
>
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 25000 seems to do the trick in all cases I've found so far. I also increased VG_N_SEGNAMES to 5000 from 1000.
 

Usage at CERN

Revision 782012-11-11 - AlbertBursche

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 183 to 183
  > LbLogin -c $CMTDEB
Added:
>
>
On institute setups it may be necessary to use the LHCb software from AFS since the debug libraries are not on CVMFS.
source /afs/cern.ch/lhcb/software/releases/LBSCRIPTS/prod/InstallArea/scripts/LbLogin.sh -c $CMTDEB
 Before running SetupProject.

Scripting GDB with python

Revision 772012-09-28 - BenjaminCouturier

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 217 to 217
 

VTune Intel Profiler

A Gaudi auditor has been provided by Sascha Mazurov to interface to the VTune Intel profiler. See IntelProfiler, IntelProfilerExample and Video tutorial on profiler installation in Gaudi, running and analyzing it from command line (without GUI)
Added:
>
>
See https://twiki.cern.ch/twiki/bin/view/Openlab/IntelTools
 

Debug and Optimised Libraries

Utilities like callgrind, Google PerfTools and gdb can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings.

Revision 762012-07-18 - ChristianElsasser

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 91 to 91
  The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
Changed:
<
<
Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh (insert command from (t)csh expert...).
>
>
Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh this is | & tee -a profile.log.
  This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

Revision 752012-04-24 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 177 to 177
  When it crashes, type where to get a traceback at that point.
Changed:
<
<
If possible use a debug build (see below) as that will provide more information.
>
>
If possible use a debug build (see below) as that will provide more information. This can be done by running

 > LbLogin -c $CMTDEB

Before running SetupProject.

 

Scripting GDB with python

With recent version of GDB, it is possible to script the behavior of the debugger. This is quite handy when a lot of repetitive tasks is requested in the debugger itself, for example when debugging multi-threaded applications. Since GDB 7.0 (which is present on lxplus), it is distributed with python modules interacting directly with the GDB internals. One way to check that GDB has the support for this is to do:

Revision 742012-01-26 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 203 to 203
 You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed

Changed:
<
<

Google PerTools

>
>

Google PerfTools

  Google provides a set of performance tools. For details on usage within LHCb see here.
Added:
>
>

VTune Intel Profiler

A Gaudi auditor has been provided by Sascha Mazurov to interface to the VTune Intel profiler. See IntelProfiler, IntelProfilerExample and Video tutorial on profiler installation in Gaudi, running and analyzing it from command line (without GUI)
 

Debug and Optimised Libraries

Changed:
<
<
Utilities like callgrind, Google PerTools and gdb can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings.
>
>
Utilities like callgrind, Google PerfTools and gdb can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings.
  This does not happen in un-optimised (debug) builds, where all information is available, but of course you must then bear in mind that the two builds are different, and for some studies like profiling (code timing) the information you get will be different. In these cases a good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.

Revision 732012-01-06 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 219 to 219
 
  • Stefan Nies' talk on tools for code optimisation at 37th software week, 2009-06-18
  • Talks given at CHEP2009 and CHEP2010 giving hints on strategies and tools for optimising code:
Changed:
<
<
>
>
 

Revision 722011-09-01 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 72 to 72
  The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool. Full documentation of this tool is available in section 4 of the valgrind user guide.
Changed:
<
<
It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full increase the amount of information valgrind supplies.
>
>
It is useful to add a few additional options, to improve the quality of the output. The options =-v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --track-origins=yes = increase the amount of information valgrind supplies.
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.

So the final full command is then :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --track-origins=yes --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Revision 712011-09-01 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 82 to 82
  > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
Added:
>
>
For Brunel, there is an additional suppressions file you can apply $BRUNELROOT/job/Brunel.supp giving

 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$BRUNELROOT/job/Brunel.supp --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh (insert command from (t)csh expert...).

Revision 702011-05-26 - HubertDegaudenzi

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 171 to 171
 When it crashes, type where to get a traceback at that point.

If possible use a debug build (see below) as that will provide more information.

Added:
>
>

Scripting GDB with python

With recent version of GDB, it is possible to script the behavior of the debugger. This is quite handy when a lot of repetitive tasks is requested in the debugger itself, for example when debugging multi-threaded applications. Since GDB 7.0 (which is present on lxplus), it is distributed with python modules interacting directly with the GDB internals. One way to check that GDB has the support for this is to do:

(gdb) python print 23
23
at the GDB prompt. This must not fail.

More information can be fount at the PythonGdb page of the GDB Wiki and in the GDB documentation on how to use it.

 

GDB in Emacs

Revision 692011-04-15 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 74 to 74
  It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full increase the amount of information valgrind supplies.
Changed:
<
<
In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp.
>
>
In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like ROOT, STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp.
  So the final full command is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$STDOPTS/valgrind-python.supp --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Revision 682010-12-21 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 40 to 40
 
 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.sh
Deleted:
<
<
for csh or bash like shells respectively.
 
Changed:
<
<
which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000.
>
>
(for csh or bash like shells respectively) which currently provides the latest versions of valgrind and callgrind, with a private patch applied to allow valgrind to properly run with the LHC LCG applications.
 

Working with gaudirun.py

Line: 170 to 170
  When it crashes, type where to get a traceback at that point.
Changed:
<
<
If possible use a debug build (see above) as that will provide more information.
>
>
If possible use a debug build (see below) as that will provide more information.
 

GDB in Emacs

Line: 192 to 192
 

Debug and Optimised Libraries

Changed:
<
<
Utilities like callgrind, Google PerTools and gdb can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings. This does not happen in un-optimised (debug) builds, where all information is available, but of course you must then bear in mind that the two builds are different, and for some studies like profiling, the timing information you get will be different. In these cases a good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
>
>
Utilities like callgrind, Google PerTools and gdb can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings.
 
Changed:
<
<
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).
>
>
This does not happen in un-optimised (debug) builds, where all information is available, but of course you must then bear in mind that the two builds are different, and for some studies like profiling (code timing) the information you get will be different. In these cases a good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.

Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier) or use another system without the strict quotas applied on AFS (like home institute systems).

 

Further reading

Revision 672010-12-21 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 200 to 200
 

Further reading

Changed:
<
<
  • Atlas TWiKi page on optimizing code
  • Talks given at CHEP2009 giving hints on strategies and tools for optimising code:
>
>
  • Talks given at CHEP2009 and CHEP2010 giving hints on strategies and tools for optimising code:
 

Revision 662010-12-20 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 154 to 154
  Also note, to get the most out of this tool (as with all valgrind tools) it is best to run the debug software builds.
Deleted:
<
<

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings. This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. A good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.

Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).

 

Understanding the output

Deleted:
<
<
The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
 
Added:
>
>
The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
 

Debugging gaudirun.py on Linux with gdb

Line: 197 to 190
  Google provides a set of performance tools. For details on usage within LHCb see here.
Added:
>
>

Debug and Optimised Libraries

Utilities like callgrind, Google PerTools and gdb can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by the utility. Similarly you will not get annotated source code listings. This does not happen in un-optimised (debug) builds, where all information is available, but of course you must then bear in mind that the two builds are different, and for some studies like profiling, the timing information you get will be different. In these cases a good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.

Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).

 

Further reading

Added:
>
>
 

Revision 652010-11-20 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 146 to 146
  > valgrind --tool=massif python `which gaudirun.py` options.py
Changed:
<
<
Note that command line options exist to control the level of memory usage monitoring that is applied, whether the stack is monitored as well as the heap (by default not). See the user guide for more details.
>
>
Note that command line options exist to control the level of memory usage monitoring that is applied, whether the stack is monitored as well as the heap (by default not). See the user guide for more details. Useful additional default options are, for example

 > valgrind --tool=massif -v --max-snapshots=1000 python `which gaudirun.py` options.py
  Also note, to get the most out of this tool (as with all valgrind tools) it is best to run the debug software builds.

Revision 642010-11-12 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 171 to 171
  and then to run the application then simply type run at the gdb command line. ( The option --args tells gdb to interpret any additional options after the executable name as arguments to that application, instead of the default which is to try and interpret them as core files...)
Added:
>
>
When it crashes, type where to get a traceback at that point.

If possible use a debug build (see above) as that will provide more information.

GDB in Emacs

 An alternative method is to use Emacs to start a debug session.

Revision 632010-11-05 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 31 to 31
 valgrind: is there a hard virtual memory limit set?
Changed:
<
<
These bugs have been fixed in the more recent versions than that installed. Unfortunately such a version is not available by default, so I had to install it privately. To access this version run :-
>
>
These bugs have been fixed in the more recent versions than that installed. Unfortunately such a version is not available by default. A patched version is available in the LCG AFS area, which the following scripts provide access to :-
 
 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh

Revision 622010-11-05 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 70 to 70
 

Memory Tests

Changed:
<
<
The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.
>
>
The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool. Full documentation of this tool is available in section 4 of the valgrind user guide.
  It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full increase the amount of information valgrind supplies.
Line: 90 to 90
 

Code Profiling

Changed:
<
<
A valgrind tool called "callgrind" also exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package.
>
>
A valgrind tool called "callgrind" also exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package. Full documentation of this tool is available in section 6 of the valgrind user guide.
  In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind).
Line: 136 to 136
  Using kcachegrind takes some getting used to. One of the first things to do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
Added:
>
>

Memory Usage Monitoring

The valgrind tool called "massif" also exists which does some detialed memory usage monitoring. Full documentation of this tool is available in section 9 of the valgrind user guide.

Simple usage is just

 > valgrind --tool=massif python `which gaudirun.py` options.py

Note that command line options exist to control the level of memory usage monitoring that is applied, whether the stack is monitored as well as the heap (by default not). See the user guide for more details.

Also note, to get the most out of this tool (as with all valgrind tools) it is best to run the debug software builds.

 

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Revision 612010-09-09 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 82 to 82
  > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
Changed:
<
<
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
>
>
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Another useful trick is to first redirect STDERR to STDOUT, then use tee to send the output to file and to the terminal. For bash this is 2>&1 | tee profile.log. For (t)csh (insert command from (t)csh expert...).

  This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

Revision 602010-07-09 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 70 to 70
 

Memory Tests

Changed:
<
<
The main use of valgrind is perform a memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.
>
>
The main use of valgrind is to perform memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.
  It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full increase the amount of information valgrind supplies.
Line: 104 to 104
  The options --dump-instr=yes --trace-jump=yes are also useful as they provide more information.
Changed:
<
<
One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
>
>
One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be, for Boole
 
 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
Changed:
<
<
for Boole, for Brunel
>
>
or for Brunel
 
 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
Changed:
<
<
or for DaVinci
>
>
and finally for DaVinci
 
 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py

Revision 592010-02-10 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 72 to 72
  The main use of valgrind is perform a memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.
Changed:
<
<
It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.
>
>
It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full increase the amount of information valgrind supplies.
  In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp.

So the final full command is then :-

Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=50 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Revision 582009-10-11 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 167 to 167
 You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed

Added:
>
>

Google PerTools

 
Changed:
<
<

google-perftools

google-perftools
>
>
Google provides a set of performance tools. For details on usage within LHCb see here.
 

Further reading

Revision 572009-10-06 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 42 to 42
  for csh or bash like shells respectively.
Changed:
<
<
which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
>
>
which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000.
 

Working with gaudirun.py

Line: 52 to 52
  > gaudirun.py -n -v -o options.opts options.py
Changed:
<
<
Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain somelines. If this happens simply edit the file by hand and remove these lines.
>
>
Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain some lines which are not valid... If this happens simply edit the file by hand and remove these lines.
  Then invoke valgrind with something like :-

Revision 562009-10-02 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 167 to 167
 You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed

Added:
>
>

google-perftools

google-perftools
 

Further reading

Revision 552009-09-15 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 49 to 49
 Unfortunately, valgrind cannot be used directly with gaudirun.py. There are two solutions, first you can use gaudirun.py to generate "old style" Job Options:
Changed:
<
<
> gaudirun.py options.py -n -v --old-opts > options.opts
>
>
> gaudirun.py -n -v -o options.opts options.py
 
Changed:
<
<
Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain some invalid comment lines. If this happens simply edit the file by hand and remove these lines.
>
>
Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain somelines. If this happens simply edit the file by hand and remove these lines.
  Then invoke valgrind with something like :-

Revision 542009-09-15 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 139 to 139
 callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings. This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. A good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
Changed:
<
<
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).
>
>
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimised builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).
 

Understanding the output

The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
Changed:
<
<

Debugging gaudirun.py on Linux

We recommend that you use Emacs to start a debug session. The following works:
>
>

Debugging gaudirun.py on Linux with gdb

gaudirun.py applications can be run through the gdb debugger using a similar trick as with valgrind, to call python directly and pass gaudirun.py as an argument. Just type

 > gdb --args python `which gaudirun.py` yourOptions.py 

and then to run the application then simply type run at the gdb command line. ( The option --args tells gdb to interpret any additional options after the executable name as arguments to that application, instead of the default which is to try and interpret them as core files...)

An alternative method is to use Emacs to start a debug session.

 
 > emacs
 M-X gdb  (or Esc-X on many keyboards)

Revision 532009-09-14 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 44 to 44
  which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
Changed:
<
<

Memory Tests

The main use of valgrind is perform a memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.

>
>

Working with gaudirun.py

 
Changed:
<
<
Unfortunately, valgrind cannot be used directly with gaudirun.py, so you should first use gaudirun.py to generate "old style" Job Options:
>
>
Unfortunately, valgrind cannot be used directly with gaudirun.py. There are two solutions, first you can use gaudirun.py to generate "old style" Job Options:
 
 > gaudirun.py options.py -n -v --old-opts > options.opts
Line: 62 to 60
  > valgrind --tool=memcheck Gaudi.exe options.opts
Added:
>
>
Second, and probably better since it avoids the additional step of creating old style depreciated options (which does not always work), you can run valgrind directly on the python executable and pass the full path to gaudirun.py as an argument.

 > valgrind --tool=memcheck python `which gaudirun.py` options.py

This second method will be used in the examples below.

Memory Tests

The main use of valgrind is perform a memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.

 It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.

In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp.

Line: 69 to 79
 So the final full command is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Gaudi.exe options.opt
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp python `which gaudirun.py` options.py
 

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Line: 97 to 107
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit::execute()" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit::execute()" python `which gaudirun.py` options.py
 

for Boole, for Brunel

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit::execute()" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit::execute()" python `which gaudirun.py` options.py
 

or for DaVinci

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit::execute()" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit::execute()" python `which gaudirun.py` options.py
 

Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" python `which gaudirun.py` options.py
 

Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

Revision 522009-09-12 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 97 to 97
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit::execute()" Gaudi.exe options.opts
 

for Boole, for Brunel

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit::execute()" Gaudi.exe options.opts
 

or for DaVinci

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit*execute*" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit::execute()" Gaudi.exe options.opts
 
Deleted:
<
<
( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
 Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-after="ApplicationMgr*initialize*" --dump-before="ApplicationMgr*finalize*" Gaudi.exe options.opts
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-after="ApplicationMgr::start()" --dump-before="ApplicationMgr::stop()" Gaudi.exe options.opts
 
Deleted:
<
<
 Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.

Revision 512009-09-12 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 114 to 114
  ( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
Added:
>
>
Alternatively, it is sometimes more useful to average the profiling information over several events in order to get a better overall picture. This can be done using the following options, which will produce one dump for the initialize() phase, one for execute() and a third for finalize().

 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-after="ApplicationMgr*initialize*" --dump-before="ApplicationMgr*finalize*" Gaudi.exe options.opts 
 Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.

Revision 502009-08-12 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 100 to 100
  > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Gaudi.exe options.opts
Changed:
<
<
for Boole, or for Brunel
>
>
for Boole, for Brunel
 
 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Gaudi.exe options.opts
Added:
>
>
or for DaVinci

 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="DaVinciInit*execute*" Gaudi.exe options.opts
 ( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).

Revision 492009-08-12 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 19 to 19
 --15591:0:aspacem Increase it and rebuild. Exiting now.
Changed:
<
<
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 10000 seems to do the trick.
>
>
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 15000 seems to do the trick in all cases I've found so far.
 

Usage at CERN

Revision 482009-08-11 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 19 to 19
 --15591:0:aspacem Increase it and rebuild. Exiting now.
Changed:
<
<
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory and I found increasing the value from 2000 to 10000 seems to do the trick.
>
>
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory (or coregrind/m_aspacemgr/aspacemgr.c in older releases). I found increasing the value from 2000 to 10000 seems to do the trick.
 

Usage at CERN

Revision 472009-07-24 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 19 to 19
 --15591:0:aspacem Increase it and rebuild. Exiting now.
Changed:
<
<
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr.c in the valgrind build directory and I found increasing the value from 2000 to 10000 seems to do the trick.
>
>
It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr-linux.c in the valgrind build directory and I found increasing the value from 2000 to 10000 seems to do the trick.
 

Usage at CERN

Revision 462009-06-18 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 138 to 138
 
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed

Further reading

Changed:
<
<
The following talks given at CHEP2009 give hints on strategies and tools for optimising code:
>
>
 
Changed:
<
<
Similar talks were given at CERN on 2009-04-16 at a MultiCore R&D meeting dedicated to performance monitoring
>
>
 
ChrisRJones - 27 Feb 2006
Added:
>
>
MarcoCattaneo - 18 Jun 2009
 
META TOPICMOVED by="ChrisRJones" date="1161806930" from="LHCb.CodeProfiling" to="LHCb.CodeAnalysisTools"

Revision 452009-04-16 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 142 to 142
 
Added:
>
>
Similar talks were given at CERN on 2009-04-16 at a MultiCore R&D meeting dedicated to performance monitoring
 
ChrisRJones - 27 Feb 2006

Revision 442009-03-30 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 137 to 137
 You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed

Added:
>
>

Further reading

The following talks given at CHEP2009 give hints on strategies and tools for optimising code:
 
ChrisRJones - 27 Feb 2006

Revision 432009-03-02 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 59 to 59
 Then invoke valgrind with something like :-
Changed:
<
<
> valgrind --tool=memcheck Gaudi.exe options.opt
>
>
> valgrind --tool=memcheck Gaudi.exe options.opts
 

It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.

Revision 422009-02-09 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 48 to 48
  The main use of valgrind is perform a memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.
Changed:
<
<
Simply run something like :-
>
>
Unfortunately, valgrind cannot be used directly with gaudirun.py, so you should first use gaudirun.py to generate "old style" Job Options:
 
Changed:
<
<
> valgrind --tool=memcheck Gaudi.exe opts.opt
>
>
> gaudirun.py options.py -n -v --old-opts > options.opts

Depending on your options and the version of gaudirun.py used to generate them, options.opts may contain some invalid comment lines. If this happens simply edit the file by hand and remove these lines.

Then invoke valgrind with something like :-

 > valgrind --tool=memcheck Gaudi.exe options.opt
 

It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.

Line: 61 to 69
 So the final full command is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Gaudi.exe opts.opt
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Gaudi.exe options.opt
 

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Line: 89 to 97
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Gaudi.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Gaudi.exe options.opts
 

for Boole, or for Brunel

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Gaudi.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Gaudi.exe options.opts
 

( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Line: 116 to 124
 

Understanding the output

The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
Deleted:
<
<

Usage with gaudirun.py

Unfortunately valgrind cannot be used directly with gaudirun.py. The easiest work around is to first use gaudirun.py to generate old style Job Options and then to use the generated as above. E.g.

 > gaudirun.py options.py -n -v --old-opts > oldstyle.opts

Note the above might, depending on your options and the version of gaudi you are using place some invalid comment lines in the generate file. If this happens simply edit the file by hand and remove these lines.

 

Debugging gaudirun.py on Linux

We recommend that you use Emacs to start a debug session. The following works:

Revision 412009-01-30 - GloriaCorti

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 132 to 132
  > emacs M-X gdb (or Esc-X on many keyboards) gdb python
Changed:
<
<
(gdb) `which gaudirun.py` yourOptions.py
>
>
(gdb) run `which gaudirun.py` yourOptions.py
 

You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:

  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed
Deleted:
<
<
  • (gdb) Ctrl-X space sets a break point in the line where the pointer is in the buffer
 


Revision 402009-01-30 - GloriaCorti

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 132 to 132
  > emacs M-X gdb (or Esc-X on many keyboards) gdb python
Changed:
<
<
gdb `which gaudirun.py` yourOptions.py
>
>
(gdb) `which gaudirun.py` yourOptions.py
 
Added:
>
>
You can use the emacs toolbar to set break in lines, unset them and issue debugger commands, or you can pass them as command lines at the (gdb) prompt. In which case here are a couple of useful short-cuts:
  • (gdb) Ctrl- up-arrow/down-arrow allows to navigate through the commands you already typed
  • (gdb) Ctrl-X space sets a break point in the line where the pointer is in the buffer
 
ChrisRJones - 27 Feb 2006

Revision 392009-01-28 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 51 to 51
 Simply run something like :-
Changed:
<
<
> valgrind --tool=memcheck Brunel.exe opts.opt
>
>
> valgrind --tool=memcheck Gaudi.exe opts.opt
 
Deleted:
<
<
for Brunel, for example.
 It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.

In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp.

Line: 63 to 61
 So the final full command is then :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Gaudi.exe opts.opt
 

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

Line: 91 to 89
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Boole.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Gaudi.exe opts.opt
 

for Boole, or for Brunel

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Brunel.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Gaudi.exe opts.opt
 

( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Line: 137 to 135
  gdb `which gaudirun.py` yourOptions.py
Deleted:
<
<

Data-base Timeout Problems

The latest versions of the LHCb applications are using the new database system (COOL etc.). If when profiling you see message like

LHCBCOND.TimeOu...   INFO Disconnect from database
DDDB.TimeOutChe...   INFO Disconnect from database

Then you have run into timeout problems with the data-base server, due to the much slower run times. To fix this add the options

DDDB.ConnectionTimeOut = 0;
LHCBCOND.ConnectionTimeOut = 0;

 
ChrisRJones - 27 Feb 2006

Revision 382009-01-21 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 118 to 118
 

Understanding the output

The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
Changed:
<
<

Usage with Gaudi Python Configurables

>
>

Usage with gaudirun.py

 
Changed:
<
<
The LHCb software is switching from Job Objects to Python Configurables. Unfortunately valgrind cannot be used directly with gaudirun.py. The easiest work around is to first use gaudirun.py to generate old style Job Options and then to use the generated as above. E.g.
>
>
Unfortunately valgrind cannot be used directly with gaudirun.py. The easiest work around is to first use gaudirun.py to generate old style Job Options and then to use the generated as above. E.g.
 
 > gaudirun.py options.py -n -v --old-opts > oldstyle.opts
Line: 128 to 128
  Note the above might, depending on your options and the version of gaudi you are using place some invalid comment lines in the generate file. If this happens simply edit the file by hand and remove these lines.
Added:
>
>

Debugging gaudirun.py on Linux

We recommend that you use Emacs to start a debug session. The following works:
 > emacs
 M-X gdb  (or Esc-X on many keyboards)
 gdb python
 gdb `which gaudirun.py` yourOptions.py 
 

Data-base Timeout Problems

The latest versions of the LHCb applications are using the new database system (COOL etc.). If when profiling you see message like

Revision 372008-11-26 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 118 to 118
 

Understanding the output

The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind
Added:
>
>

Usage with Gaudi Python Configurables

 
Added:
>
>
The LHCb software is switching from Job Objects to Python Configurables. Unfortunately valgrind cannot be used directly with gaudirun.py. The easiest work around is to first use gaudirun.py to generate old style Job Options and then to use the generated as above. E.g.

 > gaudirun.py options.py -n -v --old-opts > oldstyle.opts

Note the above might, depending on your options and the version of gaudi you are using place some invalid comment lines in the generate file. If this happens simply edit the file by hand and remove these lines.

 

Data-base Timeout Problems

Revision 362008-09-02 - MarcoCattaneo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 115 to 115
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).
Added:
>
>

Understanding the output

The Atlas valgrind TWiki contains hints on interpreting the valgrind output: https://twiki.cern.ch/twiki/bin/view/Atlas/UsingValgrind

 

Data-base Timeout Problems

The latest versions of the LHCb applications are using the new database system (COOL etc.). If when profiling you see message like

Revision 352008-04-06 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 56 to 56
  for Brunel, for example.
Changed:
<
<
I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.
>
>
It is useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.
 
Changed:
<
<
In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line I tend to use is :-
>
>
In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp.

So the final full command is then :-

 
 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt
Line: 84 to 86
  and where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".
Changed:
<
<
Personally I find the options "--dump-instr=yes --trace-jump=yes" useful as they provide more information.
>
>
The options --dump-instr=yes --trace-jump=yes are also useful as they provide more information.
  One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Line: 100 to 102
  ( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
Changed:
<
<
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).
>
>
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, try using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).
  kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.
Changed:
<
<
Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
>
>
Using kcachegrind takes some getting used to. One of the first things to do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
 

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Changed:
<
<
This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
>
>
This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. A good approach is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).

Revision 342008-03-05 - SposS

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 100 to 100
  ( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
Changed:
<
<
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or ?, although we should avoid all running at the same time on the same node ...).
>
>
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or 042, although we should avoid all running at the same time on the same node ...).
  kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.

Revision 332008-03-05 - SposS

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 100 to 100
  ( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
Changed:
<
<
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (although we should avoid all running at the same time on the same node ...).
>
>
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (XXX = 035 or ?, although we should avoid all running at the same time on the same node ...).
  kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.

Revision 322007-12-13 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 8 to 8
 

The "valgrind, callgrind and kcachegrind" Utilities

Changed:
<
<
Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.
>
>
Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.
 

Building valgrind and callgrind

Revision 312007-12-12 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 10 to 10
  Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.
Deleted:
<
<
A tool called "callgrind" exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package.

In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind).

See here for more details on callgrind and kcachegrind.

 

Building valgrind and callgrind

valgrind and callgrind can be built easily using the tar files available from the respective web pages (see above). However, it appears that the size of our LHCb applications is larger than can be handled by the default values coded into valgrind. This is seen by the error

Line: 50 to 44
  which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
Added:
>
>

Memory Tests

The main use of valgrind is perform a memory tests of your code for things like memory corruptions, memory leaks and uninitialised variables, using the memcheck valgrind tool.

Simply run something like :-

 > valgrind --tool=memcheck Brunel.exe opts.opt

for Brunel, for example.

I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.

In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line I tend to use is :-

 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

 

Code Profiling

Added:
>
>
A valgrind tool called "callgrind" also exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package.

In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind).

See here for more details on callgrind and kcachegrind.

 In the simple case, usage is just
Line: 82 to 106
  Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
Deleted:
<
<

Memory Tests

valgrind can also be used to perform a memory test of your code for things like memory corruptions, memory leaks and uninitialised variables. Simply run something like :-

 > valgrind --tool=memcheck Brunel.exe opts.opt

for Brunel, for example. I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.

In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line I tend to use is :-

 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt

The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).

This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

 

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Revision 302007-12-07 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 92 to 92
  for Brunel, for example. I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.
Changed:
<
<
In additional, there are some warnings with are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line is :-
>
>
In additional, there are some warnings which are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line I tend to use is :-
 
 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt

Revision 292007-12-06 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 98 to 98
  > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt
Changed:
<
<
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so what I do is first redirect STDERR to STDOUT, then direct STDOUT to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
>
>
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so you will need to redirect both to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
  This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.
Line: 109 to 109
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).
Changed:
<
<

Data-base Timeout Problems

>
>

Data-base Timeout Problems

  The latest versions of the LHCb applications are using the new database system (COOL etc.). If when profiling you see message like

Revision 282007-12-05 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 90 to 90
  > valgrind --tool=memcheck Brunel.exe opts.opt
Changed:
<
<
for Brunel, for example.
>
>
for Brunel, for example. I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies.
 
Changed:
<
<
However, I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies. In additional, there are some warnings with are well known about, and it is best to suppress these (otherwise the log file can be huge). This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line is :-
>
>
In additional, there are some warnings with are well known about (for instance from the Gaudi framework or even thrid party libraries, like STL or boost) and it is best to suppress these, otherwise the amount of output to deal with is huge. This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line is :-
 
 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt

Revision 272007-12-05 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 84 to 84
 

Memory Tests

Changed:
<
<
valgrind can also be used to perform a memory test of your code for things like memory corruptions, memory leaks and uninitialised variables. Simply run :-
>
>
valgrind can also be used to perform a memory test of your code for things like memory corruptions, memory leaks and uninitialised variables. Simply run something like :-
 
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes Brunel.exe opts.opt
>
>
> valgrind --tool=memcheck Brunel.exe opts.opt
 

for Brunel, for example.

Changed:
<
<
The above command will send a lot of output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so what I do is first redirect STDERR to STDOUT, then direct STDOUT to file... This can be done (for bash like shells) with
>
>
However, I find it useful to add a few additional options, to improve the quality of the output. The options -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full increase the amount of information valgrind supplies. In additional, there are some warnings with are well known about, and it is best to suppress these (otherwise the log file can be huge). This can be done with --suppressions=$STDOPTS/Gaudi.supp, so the final full command line is :-
 
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes Brunel.exe opts.opt &>mem.log or on csh like shells
 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes Brunel.exe opts.opt >& mem.log
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes --suppressions=$STDOPTS/Gaudi.supp Brunel.exe opts.opt
 
Added:
>
>
The above command will send the output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so what I do is first redirect STDERR to STDOUT, then direct STDOUT to file... This can be done (for bash like shells) by appending &>mem.log to the full valgrind command (for csh like shells use >& mem.log).
 This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

Debug and Optimised Libraries

Revision 262007-12-05 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 42 to 42
 
 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
Added:
>
>
or
 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.sh
for csh or bash like shells respectively.
  which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
Line: 87 to 92
  for Brunel, for example.
Added:
>
>
The above command will send a lot of output to the terminal. It is best to redirect this to file. One additional complication is the valgrind output is sent to STDERR, not STDOUT, so what I do is first redirect STDERR to STDOUT, then direct STDOUT to file... This can be done (for bash like shells) with

 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes Brunel.exe opts.opt &>mem.log
or on csh like shells
 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes Brunel.exe opts.opt >& mem.log

This will produce a large output file contain detailed information on any memory problems. You can either simply read this file directly, or if you prefer use the alleyoop application to help interpret the errors.

 

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Revision 242007-09-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 82 to 82
 valgrind can also be used to perform a memory test of your code for things like memory corruptions, memory leaks and uninitialised variables. Simply run :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 Brunel.exe opts.opt
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 --leak-check=full --show-reachable=yes Brunel.exe opts.opt
 

for Brunel, for example.

Revision 232007-06-12 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 82 to 82
 valgrind can also be used to perform a memory test of your code for things like memory corruptions, memory leaks and uninitialised variables. Simply run :-
Changed:
<
<
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes Brunel.exe opts.opt
>
>
> valgrind --tool=memcheck -v --error-limit=no --leak-check=yes --num-callers=9999 Brunel.exe opts.opt
 

for Brunel, for example.

Revision 222007-03-24 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 92 to 92
 callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings. This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
Changed:
<
<
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFSLoginProblems quota you might need to increase it (email Joel Closier).
>
>
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS user quota you might need to increase it (email Joel Closier).

Data-base Timeout Problems

The latest versions of the LHCb applications are using the new database system (COOL etc.). If when profiling you see message like

LHCBCOND.TimeOu...   INFO Disconnect from database
DDDB.TimeOutChe...   INFO Disconnect from database

Then you have run into timeout problems with the data-base server, due to the much slower run times. To fix this add the options

DDDB.ConnectionTimeOut = 0;
LHCBCOND.ConnectionTimeOut = 0;
 
ChrisRJones - 27 Feb 2006

Revision 212007-03-23 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 60 to 60
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before=BooleInit*execute*" Boole.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Boole.exe opts.opt
 

for Boole, or for Brunel

Revision 202006-11-16 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 92 to 92
 callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings. This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
Changed:
<
<
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).
>
>
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFSLoginProblems quota you might need to increase it (email Joel Closier).
 
ChrisRJones - 27 Feb 2006

Revision 192006-10-25 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Analysis Tools

Line: 96 to 96
 
ChrisRJones - 27 Feb 2006 \ No newline at end of file
Added:
>
>
META TOPICMOVED by="ChrisRJones" date="1161806930" from="LHCb.CodeProfiling" to="LHCb.CodeAnalysisTools"

Revision 182006-10-10 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"
Changed:
<
<

Code Profiling and Analysis

>
>

Code Analysis Tools

  This page contain a guide to using various code profiling and analysis tools on the LHCb software. Please feel free to any information to this page.
Line: 45 to 45
  which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
Added:
>
>

Code Profiling

 In the simple case, usage is just
Line: 75 to 77
  Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
Added:
>
>

Memory Tests

valgrind can also be used to perform a memory test of your code for things like memory corruptions, memory leaks and uninitialised variables. Simply run :-

 > valgrind --tool=memcheck -v --error-limit=no --leak-check=yes  Brunel.exe opts.opt

for Brunel, for example.

 

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Revision 172006-08-02 - unknown

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 69 to 69
  ( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
Changed:
<
<
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).
>
>
Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on one of the lxbuildXXX machines (although we should avoid all running at the same time on the same node ...).
 
Changed:
<
<
kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your path.
>
>
kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your PATH.
  Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Changed:
<
<
This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to first try running the normal optimised build. If you find you want finer grained information that this provides, then try the debug build.
>
>
This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to first try running the normal optimised build. If you find you want finer grained information than that this method provides, then try the debug build.
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).

Revision 162006-06-22 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 58 to 58
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Boole.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before=BooleInit*execute*" Boole.exe opts.opt

for Boole, or for Brunel

 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="RecInit*execute*" Brunel.exe opts.opt
 

( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Revision 152006-06-09 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 58 to 58
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="*BooleInit*execute*" Boole.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="BooleInit*execute*" Boole.exe opts.opt
 

( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Revision 142006-06-08 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 10 to 10
  Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.
Changed:
<
<
The core valgrind package contains a tool called "cachegrind" which does some code profiling. However, a much more complete tool is available as a third party add-on, called "callgrind". In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind). See here for more details on callgrind and kcachegrind.
>
>
A tool called "callgrind" exists which does some detailed code profiling. For valgrind versions 3.1.1 and earlier callgrind is not included in the main valgrind package, it must be installed seperately. As of valgrind 3.2.0, callgrind was integrated into the mainstream valgrind package.

In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind).

See here for more details on callgrind and kcachegrind.

 

Building valgrind and callgrind

Line: 39 to 43
  > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
Changed:
<
<
which currently provides the versions valgrind (3.1.1) and callgrind (0.10.1), with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
>
>
which currently provides the latest versions of valgrind and callgrind, with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
  In the simple case, usage is just
Line: 47 to 51
  > valgrind --tool=callgrind application
Deleted:
<
<
or via the equivalent shortcut

 > callgrind application
 and where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".

Personally I find the options "--dump-instr=yes --trace-jump=yes" useful as they provide more information.

Line: 60 to 58
 One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
Changed:
<
<
> callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="*BooleInit*execute*" Boole.exe opts.opt
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="*BooleInit*execute*" Boole.exe opts.opt
 

( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Revision 132006-06-01 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 33 to 33
 valgrind: is there a hard virtual memory limit set?
Changed:
<
<
These bugs have been fixed in the more recent versions than that installed. Unfortunately such a version is not available by default, so I have installed privately. To access this version run :-
>
>
These bugs have been fixed in the more recent versions than that installed. Unfortunately such a version is not available by default, so I had to install it privately. To access this version run :-
 
 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
Line: 67 to 67
  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).
Changed:
<
<
kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file adds the appropriate path for kcachegrind to your path.
>
>
kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file (/afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh) adds the appropriate path for kcachegrind to your path.
  Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.

Revision 122006-05-18 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 41 to 41
  which currently provides the versions valgrind (3.1.1) and callgrind (0.10.1), with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
Changed:
<
<
In the simple case, usage is simply
>
>
In the simple case, usage is just
 
 > valgrind --tool=callgrind application

Revision 112006-03-24 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 74 to 74
 

Debug and Optimised Libraries

callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.

Changed:
<
<
This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to try both and compare the two sets of results.
>
>
This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to first try running the normal optimised build. If you find you want finer grained information that this provides, then try the debug build.
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).

Revision 102006-03-23 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 25 to 25
 

Usage at CERN

Changed:
<
<
The version of valgrind installed by default on lxplus has problems running with applications that require reasonably large amounts of memory due to bugs in its internal memory model. These bugs have been fixed in the most recent version. Run :-
>
>
The version of valgrind installed by default on lxplus has problems running with applications that require reasonably large amounts of memory due to bugs in its internal memory model. It also cannot run on the normal lxplus nodes due to the virtual memory limit in place - e.g. you will see

[lxplus066] ~ > valgrind --version
valgrind: mmap(0x8bf5000, -1488932864) failed during startup.
valgrind: is there a hard virtual memory limit set?

These bugs have been fixed in the more recent versions than that installed. Unfortunately such a version is not available by default, so I have installed privately. To access this version run :-

 
 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
Changed:
<
<
which currently provides the versions valgrind (3.1.1) and callgrind (0.10.1).
>
>
which currently provides the versions valgrind (3.1.1) and callgrind (0.10.1), with a private patch applied to increase VG_N_SEGMENTS to 10000, which seems enough for the LHCb applications I have tried (If you find otherwise, please let me know).
  In the simple case, usage is simply
Line: 49 to 57
  Personally I find the options "--dump-instr=yes --trace-jump=yes" useful as they provide more information.
Changed:
<
<
One very useful option is "--dump-before" which can be used for the creation of an output file before calling an particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
>
>
One very useful option is "--dump-before" which can be used for the creation of an output file before calling particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be
 
 > callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="*BooleInit*execute*" Boole.exe opts.opt

Revision 92006-03-22 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 12 to 12
  The core valgrind package contains a tool called "cachegrind" which does some code profiling. However, a much more complete tool is available as a third party add-on, called "callgrind". In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind). See here for more details on callgrind and kcachegrind.
Added:
>
>

Building valgrind and callgrind

valgrind and callgrind can be built easily using the tar files available from the respective web pages (see above). However, it appears that the size of our LHCb applications is larger than can be handled by the default values coded into valgrind. This is seen by the error

--15591:0:aspacem  Valgrind: FATAL: VG_N_SEGMENTS is too low.
--15591:0:aspacem    Increase it and rebuild.  Exiting now.

It can be easily fixed, by doing as it says ! The file is coregrind/m_aspacemgr/aspacemgr.c in the valgrind build directory and I found increasing the value from 2000 to 10000 seems to do the trick.

 

Usage at CERN

The version of valgrind installed by default on lxplus has problems running with applications that require reasonably large amounts of memory due to bugs in its internal memory model. These bugs have been fixed in the most recent version. Run :-

Line: 20 to 31
  > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
Changed:
<
<
which currently provides the versions valgrind (3.1.0) and callgrind (0.10.1).
>
>
which currently provides the versions valgrind (3.1.1) and callgrind (0.10.1).
  In the simple case, usage is simply
Line: 28 to 39
  > valgrind --tool=callgrind application
Changed:
<
<
where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".
>
>
or via the equivalent shortcut

 > callgrind application

and where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".

  Personally I find the options "--dump-instr=yes --trace-jump=yes" useful as they provide more information.

One very useful option is "--dump-before" which can be used for the creation of an output file before calling an particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be

Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes -dump-before="*BooleInit*execute*" application
>
>
> callgrind -v --dump-instr=yes --trace-jump=yes --dump-before="*BooleInit*execute*" Boole.exe opts.opt
 

( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )

Revision 82006-03-22 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 20 to 20
  > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh
Changed:
<
<
which provides the latest valgrind (3.1.0) and callgrind (0.10.1).
>
>
which currently provides the versions valgrind (3.1.0) and callgrind (0.10.1).
  In the simple case, usage is simply
Line: 48 to 48
 

Debug and Optimised Libraries

Changed:
<
<
callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.
>
>
callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away (those which are actually inlined by the compiler) and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings.
 This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to try both and compare the two sets of results.

Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).

Revision 72006-03-01 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 38 to 38
  > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes -dump-before="*BooleInit*execute*" application
Changed:
<
<
( Note the *'s are needed with the --dump-before method, I think to copy with name mangling under -O2 )
>
>
( Note the *'s are needed with the --dump-before method, I think to cope with name mangling under -O2 )
  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).
Line: 48 to 48
 

Debug and Optimised Libraries

Changed:
<
<
callgrind can work on both optimised and un-optimised builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different. ( That said, if you find something dominates the CPU time in one build, it probably will as well in the other. )
>
>
callgrind can work on both optimised (CMTCONFIG) and un-optimised (CMTDEB) builds, although when running on optimised builds bear in mind you will not get all possible information. For instance inline functions, by their nature, are optimised away and thus are not seen by callgrind and will not appear in the profiling. Similarly you will not get annotated source code listings. This does not happen in un-optimised builds, where all information is available, but of course you must then bear in mind the timing information you get will be different. The best approach I have found is to try both and compare the two sets of results.
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).

Revision 62006-02-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 22 to 22
  which provides the latest valgrind (3.1.0) and callgrind (0.10.1).
Changed:
<
<
Usage is simple,
>
>
In the simple case, usage is simply
 
Changed:
<
<
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes application
>
>
> valgrind --tool=callgrind application
 
Changed:
<
<
where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help". I have added the options "--dump-instr=yes --trace-jump=yes" as I find them useful.
>
>
where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".

Personally I find the options "--dump-instr=yes --trace-jump=yes" useful as they provide more information.

One very useful option is "--dump-before" which can be used for the creation of an output file before calling an particular methods. Using this with for instance BooleInit::execute allows the creation of one dump per event, which can then be read in individually. I.e. a full command line could be

 > valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes -dump-before="*BooleInit*execute*" application

( Note the *'s are needed with the --dump-before method, I think to copy with name mangling under -O2 )

  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).
Line: 37 to 48
 

Debug and Optimised Libraries

Changed:
<
<
callgrind can work on both optimised and un-optimed builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different. ( That said, if you find something dominates the CPU time in one build, it probably will aswell in the other. )
>
>
callgrind can work on both optimised and un-optimised builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different. ( That said, if you find something dominates the CPU time in one build, it probably will as well in the other. )
  Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).
Deleted:
<
<

Known Problems

In older releases of callgrind, there was an option -dump-before=XXX::YYY with which you could request that callgrind dumps an output file whenever a given method XXX::YYY is called. Using this with for example BooleInit::execute allowed for a single dump per event, which was very useful. Unfortunately this option no longer seems to work in the latest release, although it is still listed as avaliable. Maybe someone else can get this working ?

 
ChrisRJones - 27 Feb 2006

Revision 52006-02-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 31 to 31
  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).
Changed:
<
<
kcachegrind is now available on lxplus - see here for details.
>
>
kcachegrind is now available on lxplus - see here for details. Note, sourcing the above setup file adds the appropriate path for kcachegrind to your path.
  Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.

Revision 42006-02-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 6 to 6
 
Changed:
<
<

valgrind, callgrind and kcachegrind

>
>

The "valgrind, callgrind and kcachegrind" Utilities

  Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.
Line: 24 to 24
  Usage is simple,
Changed:
<
<
> valgrind --tool=callgrind -v application
>
>
> valgrind --tool=callgrind -v --dump-instr=yes --trace-jump=yes application
 
Changed:
<
<
where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".
>
>
where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help". I have added the options "--dump-instr=yes --trace-jump=yes" as I find them useful.
  Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).
Line: 39 to 39
  callgrind can work on both optimised and un-optimed builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different. ( That said, if you find something dominates the CPU time in one build, it probably will aswell in the other. )
Changed:
<
<
Also, much more information is available if the "-g" option is used. Our CMTDEB builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).
>
>
Also, much more information is available if the "-g" option is used. Our un-optimised debug (CMTDEB) builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).
 

Known Problems

Changed:
<
<
In older releases of callgrind, there was an option -dump-before=XXX::YYY with which you could request that callgrind dumps an output file whenever a given method XXX::YYY is called. Using this with for example BooleInit::execute allowed for a single dump per event, which was very useful. Unfortunately this option no longer seems to work in the latest release, although it is still listed as avaliable. Maybe someone else can get this working ?
>
>
In older releases of callgrind, there was an option -dump-before=XXX::YYY with which you could request that callgrind dumps an output file whenever a given method XXX::YYY is called. Using this with for example BooleInit::execute allowed for a single dump per event, which was very useful. Unfortunately this option no longer seems to work in the latest release, although it is still listed as avaliable. Maybe someone else can get this working ?
 
ChrisRJones - 27 Feb 2006

Revision 32006-02-28 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Changed:
<
<
This page contain a guide to using various code profiling and analysis tools on the LHCb software.
>
>
This page contain a guide to using various code profiling and analysis tools on the LHCb software. Please feel free to any information to this page.
 
Line: 10 to 10
  Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.
Changed:
<
<
The core valgrind package contains a tool called "cachegrind" which does some code profiling. However, a much more complete tool is available as a third party add-on, called "callgrind". In addition, a nice GUI is available to view the output of this tool, called "kcachegrind". See here for more details.
>
>
The core valgrind package contains a tool called "cachegrind" which does some code profiling. However, a much more complete tool is available as a third party add-on, called "callgrind". In addition, a nice GUI is available to view the output of this tool, called "kcachegrind" (kcachegrind can view the output of cachegrind, but despite its confusing name it is actually primarily designed for callgrind). See here for more details on callgrind and kcachegrind.
 

Usage at CERN

Line: 33 to 33
  kcachegrind is now available on lxplus - see here for details.
Changed:
<
<
Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
>
>
Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.
 

Debug and Optimised Libraries

Revision 22006-02-27 - ChristopherRJones

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

Line: 37 to 37
 

Debug and Optimised Libraries

Changed:
<
<
callgrind can work on both optimised and un-optimed builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different.
>
>
callgrind can work on both optimised and un-optimed builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different. ( That said, if you find something dominates the CPU time in one build, it probably will aswell in the other. )
 
Changed:
<
<
Also, much more information is available if the "-g" option is used. Our CMTDEB builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy and updating the optimisation flags to include "-g" as well as "-O2".
>
>
Also, much more information is available if the "-g" option is used. Our CMTDEB builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy, update the optimisation flags in the requirements file to include "-g" as well as "-O2" and rebuild the libraries you wish to profile. Note this will increase the size of the binaries so if you have many libraries and a small AFS quota you might need to increase it (email Joel Closier).
 

Known Problems

Changed:
<
<
In older releases of callgrind, there was an option -dump-before=XXX::YYY with which you could request that callgrind dumps an output file whenever a given method XXX::YYY is called. Using this with for example BooleInit::execute allowed for a single dump per event, which was very useful. Unfortunately this option no longer seems to work in the latest release.
>
>
In older releases of callgrind, there was an option -dump-before=XXX::YYY with which you could request that callgrind dumps an output file whenever a given method XXX::YYY is called. Using this with for example BooleInit::execute allowed for a single dump per event, which was very useful. Unfortunately this option no longer seems to work in the latest release, although it is still listed as avaliable. Maybe someone else can get this working ?
 
ChrisRJones - 27 Feb 2006

Revision 12006-02-27 - ChristopherRJones

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="LHCbComputing"

Code Profiling and Analysis

This page contain a guide to using various code profiling and analysis tools on the LHCb software.

valgrind, callgrind and kcachegrind

Valgrind is a general purpose utility for analyzing software. It contains various "tools" that perform tasks such as memory allocation checking, heap analysis and code profiling.

The core valgrind package contains a tool called "cachegrind" which does some code profiling. However, a much more complete tool is available as a third party add-on, called "callgrind". In addition, a nice GUI is available to view the output of this tool, called "kcachegrind". See here for more details.

Usage at CERN

The version of valgrind installed by default on lxplus has problems running with applications that require reasonably large amounts of memory due to bugs in its internal memory model. These bugs have been fixed in the most recent version. Run :-

 > source /afs/cern.ch/lhcb/group/rich/vol4/jonrob/scripts/new-valgrind.csh

which provides the latest valgrind (3.1.0) and callgrind (0.10.1).

Usage is simple,

 > valgrind --tool=callgrind -v application

where application is for example "Boole.exe options-file". This will produce an output file of the form callgrind.xxxxx, which can be read in by kcachegrind. More information on available command line options are given on the web page, or by running "callgrind --help or valgrind --help".

Also note, valgrind generally requires a large amount of memory and CPU, and thus you may run into problems with the CPU time and virtual memory size limits in place on the general purpose lxplus nodes. If this happens, I suggest using a private afs box (e.g. pclhcbXX.cern.ch) if such a machine is available to you. If not, you could try running on lxbuild020 (although we should avoid all running at the same time...).

kcachegrind is now available on lxplus - see here for details.

Using kcachegrind takes some getting used to. One of the first things I recommend you do is add your code location (e.g. ~/cmtuser) to the list of known locations (settings -> configure kcachegrind). The kcachegrind web page contains more hints on getting started.

Debug and Optimised Libraries

callgrind can work on both optimised and un-optimed builds, although when running on optimised builds bear in mind some things, such as inline functions, can confusion the profiler a little. This does not happen in un-optimised builds, but of course you must then bear in mind the timing information you get will be slightly different.

Also, much more information is available if the "-g" option is used. Our CMTDEB builds have this but not in our optimised (CMTCONFIG) builds. If you want to use this with your optimsed builds, this can be done by checking out a private version of GaudiPolicy and updating the optimisation flags to include "-g" as well as "-O2".

Known Problems

In older releases of callgrind, there was an option -dump-before=XXX::YYY with which you could request that callgrind dumps an output file whenever a given method XXX::YYY is called. Using this with for example BooleInit::execute allowed for a single dump per event, which was very useful. Unfortunately this option no longer seems to work in the latest release.


ChrisRJones - 27 Feb 2006
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback