Vectorize Source-Code
Brief introduction to Vectorization
General
Vectorization is nothing but parallel processing of multiple data. According to
Flynn's taxonomy
it belongs to the
SIMD
(Single Instruction Multiple Data) type of concurrency. The idea is to take multiple data of the same type and perform one operation on them. The hardware realization of SIMD has many names. For us
SSE
and
AVX
matter. The most distributed version, available in all x86-64 architectures, is
SSE4
.
Using SIMD operations is not easy, if one does not want to program in assembler or in the so called
intrinsics.
Still intrinsics give a nice understanding what kind of operations are supported by the CPU. SIMD operations can only be used by relying on the compilers capability of auto-vectorization, or by explicite vectorization. The latter can be conducted using intrinsics, or more comfortably Vc or
OpenMP 4.0. The usage of
OpenMP in productive code has still to be discussed, but Vc is a library supported with latest ROOT releases.
To get real improvements, beyond 5%, it seems to be that
explicit vectorization has to be used, the alternative is auto-vectorization, which gives on unoptimized Gaudi source-code between 2-5% speedup. Please see, that between both you can also optimize your code to be more easily auto-vectorized by the compiler. To this technique we refer to as
implicit vectorization.
Use Auto-Vectorization
Real vectorization (by hand) is hard to use. Several compiler specific instructions, compiler macros and even new languages have been invented to improve the gains from vectorization. BUT we are not suppose to use them, since source-code for GAUDI must rely on plain C++ and should in general not be compiler specific. Instead the GCC, Clang and ICC are all supporting
auto-vectorization
. Most recent approaches rely on the ability of compilers to vectorize your code. STILL, always assume your compiler be as stupid as you can barely imagine.
Available vectorized and vectorization libraries
Usually math operations are already be available in specific vectorized math libraries. Like:
*
VDT
- Supports general math operation.
*
Eigen Library
- Supports linear algebra operations.
Vectorized libraries are not just interesting to use already vectorized code, but also to get ideas what can how be vectorized. If you want additional support to vectorize C++ source-code easier, use:
*
Vc
: portable, zero-overhead SIMD library for C++. Description: "[...] Vc is a free software library to ease explicit vectorization of C++ code. [...]"
*
OpenMP
: portable, zero-overhead SIMD library for C++. Description: "[...] Vc is a free software library to ease explicit vectorization of C++ code. [...]"
Profile vectorized source-code
... to do ...
How to start
It appears that not all introduced vectorizations make your source-code faster. Hence before you start, please do the following:
- Point out a use-case (of a function/algorithm/project/application), on which all developers in your group agree to as representative for your vectorization project.
- Measure the ordinary ( unvectorized ) performance of your code, BEFOR you make serious changes due to vectorization.
- Measure the performance using eplicit only auto-vectorization flags. Not -O3 (there are further optimizations, not only vector related optimizations), but -O2 and vetorization flags, as described below.
In most cases you should gain between 2-5% performance. 2% means almost unobservable due to runtime indeterminism. So you have to make your results
statistical reliable. TIP: Do measurements on the same host if you want to compare those. If the first run last longer than the others, it was probably loading shared libraries into memory, which does not happen again after the first run. Don't take it into account. Next:
- First try to use vectorized math libraries, we still need to evaluate this topic, so every experience and contribution is important.
- Produce a vectorization report (using -ftree-vectorizer-verbose=7 or 9), try to understand the issues for the loops with most iterations.
- Try to start with simple changes, if possible, befor you rewrite the entire framework and RETHINK your data structures, TIP: Structs of Array instead of Arrays of Structs (CHEP '13
, slide 10 with pro vs. con's).
Compilation
GCC
Auto-vectorization in a usable state became available with gcc-4.6. Since then it is going to become better with each new compiler version. Try to use the most recent one that works for your project.If you use the -O3 optimization flag, auto-vectorization is included in GCC 4.6, 4.7, 4.8 and 4.9. The current standard optimization flag is -O2. To make vectorization work you require at least -O1. In -O3 auto-vectorization flag is set.
Typical example to switch on vectorization:
g++ -O2 -ftree-vectorize -ftree-vectorizer-verbose=5 -ffast-math <cpp-file>
Compiler Flags
Due to optimize options, you can find details on GCC documentation for
GCC 4.6
,
GCC 4.7
,
GCC 4.8
and
GCC 4.9
.
Flags |
Description |
-ftree-vectorize |
To switch on auto-vectorization just add -ftree-vectorize as compile flag. |
-ftree-vectorizer-verbos=n |
To obtain information about success of vectorization, add -ftree-vectorize-verbos=n, with 1<n<9. The higher the level, the more details you get if vectorization failed. (DEPRICATED with GCC 4.9) |
-fopt-info-vec |
Shows summary of vectorization success (with GCC 4.9, replaces vectorization report) |
-fopt-info-vec-missed |
Shows reasons for unsuccessful vectorization. (with GCC 4.9, replaces vectorization report) |
-ffast-math |
To allow the vectorizer, to change the order of computations. (leads to more efficient code) Warning: [...] it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions [...] |
-fassociative-math |
Is doing what we expect from -ffast-math, reorder of operations. But it is less dangerous than -ffast-math. It must be used with -fno-trapping-math and -fno-signed-zeros. |
-funsafe-loop-optimizations |
If unsigned int loop counters are used, the compiler assumes that there is no overflow. |
-ftree-loop-if-convert-stores |
More aggressive if conversion. It tries to if-convert conditional jumps containing memory writes. This transformation can be unsafe for multi-threaded programs. |
Add Flags for Gaudi Project
Each project has a requirements file in its cmt folder. Here you can add compiler flags to enable vectorization only for the corresponding libraries.
Add to file: cmt/requirements
macro_append cppflags " -ftree-vectorize -march=native -ftree-vectorizer-verbose=7 -ffast-math "
macro_append fflags " -ftree-vectorize -march=native -ftree-vectorizer-verbose=7 -ffast-math "
macro_append cflags " -ftree-vectorize -march=native -ftree-vectorizer-verbose=7 -ffast-math "
To reduce side effects, decomposite "-ffast-math" and use only the necessary flags, e.g.
macro_append cflags " -ftree-vectorize -march=native -ftree-vectorizer-verbose=7 -fassociative-math -fno-trapping-math -fno-signed-zeros "
Using Fast-Math (-ffast-math)
Fast-Math is being used by setting
-ffast-math in GCC. Fast-Math is a flag to speed up math operations, or at least to allow the compiler additional optimizations. Fast-Math violates IEEE 754 compliance to floating point numbers. It used to deliver correct results, but results can differ slightly on bit per bit level. Using auto-vectorization, it speeds up certain SIMD optimizations and leads to further vectorized loops, as I found out by practice. How to deal with the issue, either to switch it on or not, has to be evaluated.
It can not be assumed to be a harmless optimization flag.
Fast-Math is a collection of several optimization flags, for GCC it would be:
-fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range.
Please investigate the meaning of all this optimization flags regarding the correctness of your source-code, or use only the most important which impact you are able to take into account. The necessary minimum would be:
-fassociative-math -fno-trapping-math -fno-signed-zeros.
If you do not want to investigate the impact of your optimization with -ffast-math, please switch it off also for prototyping, development and research.
Results
Results to follow best practice and persistant issues, for artificial benchmarks and real use cases, see
Results of Source-Code Vectorization.
Links
Introductions
Further Readings (now it becomes interestint)
- Structs of Array instead of Arrays of Structs (CHEP '13
)
--
StefanLohn - 10 Dec 2013