10.3 Optimizing your Code
Complete:
Optimising your code is simple. The process is follows:
- Determine a good moment for effective optimisation.
- Establish a baseline reference you know to be correct. You will not be able to conclude your changes were an improvement until you have been able to verify the results remain correct. Obviously enough this implies you need to have a realistic test suite.
- Establish the goals you are trying to achieve in the light of most significant performance factors, so you know when you are done. Assign priority and amount of effort that can be afforded to reach the goals.
- Measure performance of representative tasks. Make sure the tests cover a sufficient part of the phase space of real life usage. The measurements need to be repeatable and unambiguous.
- Analyse the benchmark results for performance bottlenecks and memory-related problems.
- Review data structures and algorithms used. If deep code changes are required, make first these corrections and repeat until orders of magnitude issues have been addressed.
- Address remaining hot spots. Measure again until satisfied.
It is very common only one issue can be addressed at a time.
One frequently runs into a single "dragon head" that needs to
be chopped off before anything else can be even seen, and then
other heads will pop up and can be addressed. Many of such
issues are "low-hanging fruit," requiring very modest effort
to identify.
We have collected
common performance issues
you might wish to review.
When to begin optimizing your code
Performance measurement and optimisation can begin when the code is correct and essentially complete. Until then effort is better spent in reaching these two goals. It is an excellent idea to work iteratively: producing first a complete but a limited system, testing, validating and benchmarking it, and using the experience to build a better system. This allows performance to be addressed early in a healthy matter. Premature and uninformed optimisation rarely yields the desired gains and frequently reduces performance, for the high price of distorting the programme design unnecessarily and other losses in maintainability.
For the optimisation effort to yield measurable improvements on representative tasks, one must know whether and where effort is needed. A well-informed optimisation depends on sufficiently realistic, detailed and accurate measurements, developers are notoriously bad at guessing where optimisation is necessary. The optimisation effort should be proportional to its value, substantial development in particular is likely to change the performance characteristics so early efforts should be duly limited.
Factors contributing to performance
A number of design and programming factors affect performance,
roughly in the following order of significance:
- It seems obvious perhaps, but one needs to understand the problem the programme is solving. Usually there is vast latitude in how any one problem can be solved, and it is important to know how much of that latitude can be used and has already been tried.
- Algorithmic and data structure design changes have orders of magnitude more performance impact than tuning a specific algorithm.
- Mapping concepts well to an implementation is very important. Specifically watch for over-reliance on strings, recomputing values or lookups unnecessarily, excessive sorting and copying large objects.
- Every class should have a single clear mission and there needs to be a clear object life time and ownership policy. Apart from the design aspects that otherwise require these, it is next to impossible to optimise anything that serves numerous purposes or has ambiguous ownership.
- Simple linear code easily understood is preferable. Hiding a simple linear logic into hundreds of seemingly unrelated small code fragments or in impenetrable thicket of indirection layers are just as bad as the more traditional monstrous algorithms of several thousands of lines of complex branching and looping patterns.
- Memory usage patterns and performance correlate well. High-performance code should not allocate memory only to release it soon afterwards. Use the least necessary amount of memory required by the algorithm and aim for memory locality. Modern CPUs take a big hit for poor memory access patterns and the problems take an expert to analyse. Exceeding CPU cache limits, frequent pointer dereferences and pointer walks around memory compound to a high price.
- Most "clever tricks" make things worse, not better. Only use tricks where they have clearly measurable significant impact.
Measuring Performance
Under construction
Identifying bottlenecks
Valgrind/kcachegrind can be very helpful for identifying bottlenecks.
Here is a recipe of how to use them:
- in the cfg file something like this:
service = ProfilerService {
untracked int32 firstEvent = 2
untracked int32 lastEvent = 51
untracked vstring paths = { "p1"}
}
valgrind --tool=callgrind --combine-dumps=yes --instr-atstart=no
--dump-instr=yes --separate-recs=1 cmsRun ****.cfg
# edit the lines above to one single line
- use kcachegrind to interpret the valgrind output which is in the file callgrind.out... ending with the PID of the job. You may need to add something to your path, similar to the example
setenv PATH /afs/cern.ch/cms/bin/i386_linux2:${PATH}
kcachegrind callgrind.out.xxxxx
If the results indicate that no part of the programme is a bottleneck, but the code is too slow, this usually indicates there is an overall structural problem.
Identifying memory performance issues
valgrind --tool=memcheck --leak-check=full cmsRun run.py > & vglog.out &
- Or, suppressing the leaks from root::
valgrind --tool=memcheck --leak-check=full --suppressions=${ROOTSYS}/etc/valgrind-root.supp cmsRun run.py > & vglog.out &
Making performance improvements
Under construction
Some very short practical hints
- prefer fixed data structures, reusable for each event, to dynamic push_backs per event. But of course this works only if you know a maximum size.
- avoid (of course) unnecessary new/delete
- avoid unnecessary operator= usage
- unnecessary creation of object on the stack
Optimal programming of EventSetup issues
The following code example , called once per event, is fast, but still slower than needed:
edm::ESHandle<EcalTPGFineGrainEBGroup>theEcalTPGFineGrainEBGroup_handle;
setup.get<EcalTPGFineGrainEBGroupRcd>().get(theEcalTPGFineGrainEBGroup_handle);
const EcalTPGFineGrainEBGroup * ecaltpgFgEBGroup =
theEcalTPGFineGrainEBGroup_handle.product();
You can make it optional, depending on whether the cacheIdentifier() has changed from one event to the other:
if (setup.get<EcalTPGLinearizationConstRcd>().cacheIdentifier()!=theCacheIDThatWasKept_) {
const EcalTPGFineGrainEBGroup * ecaltpgFgEBGroup =
theEcalTPGFineGrainEBGroup_handle.product();
}
This is really interesting when there are several EventSetupRecords to be got.
Review status
Responsible:
PeterElmer
Last reviewed by:
UrsulaBerthon 26 Feb 2008