Auto-vectorize trigonometric and trascendental functions

It is very difficult, essentially impossible, to find high quality open-source implementations of trigonometric and trascendental functions that are vectorized or easlily to vectorize. My goal was to provide an implementation of the most common functions used in HEP that are vectorizable by GCC. I was inspired by the work described in http://gruntthepeon.free.fr/ssemath/ that successfully vectorize some of the well known cephes float functions implementing them using SSE intrinsics. Following a similar technique I started form the original Cephes source code and modified all conditional code to make it digestible to GCC vectorization engine. The result is quite satisfactory both in terms of speed and accuracy. The usability is maximal as they are just drop in for either original cephes or libc functions. The do not depend on any underling vector engine and fully rely on GCC for the generation of optimized vector code. The benchmark below have been produced with a modified version of the sse_mathfun_test program provided with the sse code: cephes:: functions are the new c++ function autovectorized by gcc.

on my Mac Intel Core 2 Duo T9900, 3.05 GHz

benching                 sinf .. ->   19.6 millions of vector evaluations/second ->  39 cycles/value on a 3050MHz computer
benching                 cosf .. ->   19.0 millions of vector evaluations/second ->  40 cycles/value on a 3050MHz computer
benching         sincos (x87) .. ->    5.7 millions of vector evaluations/second -> 133 cycles/value on a 3050MHz computer
benching                 expf .. ->   17.8 millions of vector evaluations/second ->  43 cycles/value on a 3050MHz computer
benching                 logf .. ->   18.9 millions of vector evaluations/second ->  40 cycles/value on a 3050MHz computer
benching                 ln16 .. ->   48.7 millions of vector evaluations/second ->  16 cycles/value on a 3050MHz computer
benching               atan2f .. ->   19.2 millions of vector evaluations/second ->  40 cycles/value on a 3050MHz computer
benching                atan2 .. ->    4.1 millions of vector evaluations/second -> 184 cycles/value on a 3050MHz computer
benching      cephes::sincosf .. ->   25.6 millions of vector evaluations/second ->  30 cycles/value on a 3050MHz computer
benching       cephes::atan2f .. ->   28.2 millions of vector evaluations/second ->  27 cycles/value on a 3050MHz computer
benching         cephes::expf .. ->   37.1 millions of vector evaluations/second ->  21 cycles/value on a 3050MHz computer
benching         cephes::logf .. ->   36.5 millions of vector evaluations/second ->  21 cycles/value on a 3050MHz computer
benching                 sinl .. ->    6.0 millions of vector evaluations/second -> 127 cycles/value on a 3050MHz computer
benching                 cosl .. ->    6.0 millions of vector evaluations/second -> 125 cycles/value on a 3050MHz computer
benching                 expl .. ->    4.3 millions of vector evaluations/second -> 174 cycles/value on a 3050MHz computer
benching                 logl .. ->   10.5 millions of vector evaluations/second ->  72 cycles/value on a 3050MHz computer
benching          cephes_sinf .. ->   13.5 millions of vector evaluations/second ->  56 cycles/value on a 3050MHz computer
benching          cephes_cosf .. ->   16.0 millions of vector evaluations/second ->  48 cycles/value on a 3050MHz computer
benching          cephes_expf .. ->   11.2 millions of vector evaluations/second ->  68 cycles/value on a 3050MHz computer
benching          cephes_logf .. ->   10.7 millions of vector evaluations/second ->  71 cycles/value on a 3050MHz computer
benching               sin_ps .. ->   34.6 millions of vector evaluations/second ->  22 cycles/value on a 3050MHz computer
benching               cos_ps .. ->   34.0 millions of vector evaluations/second ->  22 cycles/value on a 3050MHz computer
benching            sincos_ps .. ->   29.9 millions of vector evaluations/second ->  25 cycles/value on a 3050MHz computer
benching               exp_ps .. ->   27.3 millions of vector evaluations/second ->  28 cycles/value on a 3050MHz computer
benching               log_ps .. ->   29.1 millions of vector evaluations/second ->  26 cycles/value on a 3050MHz computer

On a standard CERN box Intel(R) Xeon(R) CPU L5520 @ 2.27GHz

benching                 sinf .. ->    6.8 millions of vector evaluations/second ->  83 cycles/value on a 2270MHz computer
benching                 cosf .. ->    7.2 millions of vector evaluations/second ->  79 cycles/value on a 2270MHz computer
benching         sincos (x87) .. ->    4.6 millions of vector evaluations/second -> 123 cycles/value on a 2270MHz computer
benching                 expf .. ->    1.3 millions of vector evaluations/second -> 426 cycles/value on a 2270MHz computer
benching                 logf .. ->    7.7 millions of vector evaluations/second ->  74 cycles/value on a 2270MHz computer
benching                 ln16 .. ->   45.5 millions of vector evaluations/second ->  12 cycles/value on a 2270MHz computer
benching               atan2f .. ->    8.0 millions of vector evaluations/second ->  71 cycles/value on a 2270MHz computer
benching                atan2 .. ->    4.3 millions of vector evaluations/second -> 130 cycles/value on a 2270MHz computer
benching      cephes::sincosf .. ->   19.6 millions of vector evaluations/second ->  29 cycles/value on a 2270MHz computer
benching       cephes::atan2f .. ->   24.1 millions of vector evaluations/second ->  24 cycles/value on a 2270MHz computer
benching         cephes::expf .. ->   31.6 millions of vector evaluations/second ->  18 cycles/value on a 2270MHz computer
benching         cephes::logf .. ->   31.1 millions of vector evaluations/second ->  18 cycles/value on a 2270MHz computer
benching                 sinl .. ->    5.5 millions of vector evaluations/second -> 103 cycles/value on a 2270MHz computer
benching                 cosl .. ->    5.4 millions of vector evaluations/second -> 104 cycles/value on a 2270MHz computer
benching                 expl .. ->    3.5 millions of vector evaluations/second -> 162 cycles/value on a 2270MHz computer
benching                 logl .. ->    4.9 millions of vector evaluations/second -> 116 cycles/value on a 2270MHz computer
benching          cephes_sinf .. ->   15.9 millions of vector evaluations/second ->  36 cycles/value on a 2270MHz computer
benching          cephes_cosf .. ->   15.9 millions of vector evaluations/second ->  36 cycles/value on a 2270MHz computer
benching          cephes_expf .. ->    5.4 millions of vector evaluations/second -> 104 cycles/value on a 2270MHz computer
benching          cephes_logf .. ->    9.8 millions of vector evaluations/second ->  58 cycles/value on a 2270MHz computer
benching               sin_ps .. ->   30.6 millions of vector evaluations/second ->  19 cycles/value on a 2270MHz computer
benching               cos_ps .. ->   30.6 millions of vector evaluations/second ->  19 cycles/value on a 2270MHz computer
benching            sincos_ps .. ->   26.7 millions of vector evaluations/second ->  21 cycles/value on a 2270MHz computer
benching               exp_ps .. ->   22.5 millions of vector evaluations/second ->  25 cycles/value on a 2270MHz computer
benching               log_ps .. ->   24.8 millions of vector evaluations/second ->  23 cycles/value on a 2270MHz computer

on my slc6 desktop Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (code compiled on the NHL above)

benching                 sinf .. ->   15.1 millions of vector evaluations/second ->  56 cycles/value on a 3400MHz computer
benching                 cosf .. ->   15.6 millions of vector evaluations/second ->  54 cycles/value on a 3400MHz computer
benching         sincos (x87) .. ->    6.7 millions of vector evaluations/second -> 126 cycles/value on a 3400MHz computer
benching                 expf .. ->    1.3 millions of vector evaluations/second -> 654 cycles/value on a 3400MHz computer
benching                 logf .. ->   14.3 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching                 ln16 .. ->   76.6 millions of vector evaluations/second ->  11 cycles/value on a 3400MHz computer
benching               atan2f .. ->   14.6 millions of vector evaluations/second ->  58 cycles/value on a 3400MHz computer
benching                atan2 .. ->    7.6 millions of vector evaluations/second -> 112 cycles/value on a 3400MHz computer
benching      cephes::sincosf .. ->   38.9 millions of vector evaluations/second ->  22 cycles/value on a 3400MHz computer
benching       cephes::atan2f .. ->   40.6 millions of vector evaluations/second ->  21 cycles/value on a 3400MHz computer
benching         cephes::expf .. ->   54.8 millions of vector evaluations/second ->  16 cycles/value on a 3400MHz computer
benching         cephes::logf .. ->   50.1 millions of vector evaluations/second ->  17 cycles/value on a 3400MHz computer
benching                 sinl .. ->    7.2 millions of vector evaluations/second -> 118 cycles/value on a 3400MHz computer
benching                 cosl .. ->    7.4 millions of vector evaluations/second -> 114 cycles/value on a 3400MHz computer
benching                 expl .. ->    5.1 millions of vector evaluations/second -> 165 cycles/value on a 3400MHz computer
benching                 logl .. ->    7.4 millions of vector evaluations/second -> 113 cycles/value on a 3400MHz computer
benching          cephes_sinf .. ->   26.2 millions of vector evaluations/second ->  32 cycles/value on a 3400MHz computer
benching          cephes_cosf .. ->   26.3 millions of vector evaluations/second ->  32 cycles/value on a 3400MHz computer
benching          cephes_expf .. ->   10.0 millions of vector evaluations/second ->  85 cycles/value on a 3400MHz computer
benching          cephes_logf .. ->   17.6 millions of vector evaluations/second ->  48 cycles/value on a 3400MHz computer
benching               sin_ps .. ->   44.2 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching               cos_ps .. ->   44.3 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching            sincos_ps .. ->   38.7 millions of vector evaluations/second ->  22 cycles/value on a 3400MHz computer
benching               exp_ps .. ->   30.7 millions of vector evaluations/second ->  28 cycles/value on a 3400MHz computer
benching               log_ps .. ->   34.9 millions of vector evaluations/second ->  24 cycles/value on a 3400MHz computer

on my slc6 desktop Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (code compiled on the same machine with SSE4 as above)

benching                 sinf .. ->   14.3 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching                 cosf .. ->   15.3 millions of vector evaluations/second ->  55 cycles/value on a 3400MHz computer
benching         sincos (x87) .. ->    6.7 millions of vector evaluations/second -> 127 cycles/value on a 3400MHz computer
benching                 expf .. ->    1.3 millions of vector evaluations/second -> 654 cycles/value on a 3400MHz computer
benching                 logf .. ->   15.0 millions of vector evaluations/second ->  57 cycles/value on a 3400MHz computer
benching                 ln16 .. ->   76.3 millions of vector evaluations/second ->  11 cycles/value on a 3400MHz computer
benching               atan2f .. ->   15.0 millions of vector evaluations/second ->  57 cycles/value on a 3400MHz computer
benching                atan2 .. ->    8.1 millions of vector evaluations/second -> 105 cycles/value on a 3400MHz computer
benching      cephes::sincosf .. ->   38.8 millions of vector evaluations/second ->  22 cycles/value on a 3400MHz computer
benching       cephes::atan2f .. ->   40.5 millions of vector evaluations/second ->  21 cycles/value on a 3400MHz computer
benching         cephes::expf .. ->   54.9 millions of vector evaluations/second ->  15 cycles/value on a 3400MHz computer
benching         cephes::logf .. ->   50.1 millions of vector evaluations/second ->  17 cycles/value on a 3400MHz computer
benching                 sinl .. ->    6.9 millions of vector evaluations/second -> 123 cycles/value on a 3400MHz computer
benching                 cosl .. ->    7.2 millions of vector evaluations/second -> 118 cycles/value on a 3400MHz computer
benching                 expl .. ->    5.3 millions of vector evaluations/second -> 160 cycles/value on a 3400MHz computer
benching                 logl .. ->    7.6 millions of vector evaluations/second -> 112 cycles/value on a 3400MHz computer
benching          cephes_sinf .. ->   26.1 millions of vector evaluations/second ->  33 cycles/value on a 3400MHz computer
benching          cephes_cosf .. ->   26.4 millions of vector evaluations/second ->  32 cycles/value on a 3400MHz computer
benching          cephes_expf .. ->    9.8 millions of vector evaluations/second ->  87 cycles/value on a 3400MHz computer
benching          cephes_logf .. ->   16.6 millions of vector evaluations/second ->  51 cycles/value on a 3400MHz computer
benching               sin_ps .. ->   44.1 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching               cos_ps .. ->   44.3 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching            sincos_ps .. ->   38.7 millions of vector evaluations/second ->  22 cycles/value on a 3400MHz computer
benching               exp_ps .. ->   30.7 millions of vector evaluations/second ->  28 cycles/value on a 3400MHz computer
benching               log_ps .. ->   34.9 millions of vector evaluations/second ->  24 cycles/value on a 3400MHz computer

on my slc6 desktop Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (code compiled on the same machine with AVX)

benching                 sinf .. ->   14.3 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching                 cosf .. ->   15.2 millions of vector evaluations/second ->  56 cycles/value on a 3400MHz computer
benching         sincos (x87) .. ->    6.8 millions of vector evaluations/second -> 125 cycles/value on a 3400MHz computer
benching                 expf .. ->    1.2 millions of vector evaluations/second -> 660 cycles/value on a 3400MHz computer
benching                 logf .. ->   15.3 millions of vector evaluations/second ->  56 cycles/value on a 3400MHz computer
benching                 ln16 .. ->   79.0 millions of vector evaluations/second ->  11 cycles/value on a 3400MHz computer
benching               atan2f .. ->   15.4 millions of vector evaluations/second ->  55 cycles/value on a 3400MHz computer
benching                atan2 .. ->    8.1 millions of vector evaluations/second -> 105 cycles/value on a 3400MHz computer
benching      cephes::sincosf .. ->   44.7 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching       cephes::atan2f .. ->   32.2 millions of vector evaluations/second ->  26 cycles/value on a 3400MHz computer
benching         cephes::expf .. ->   59.6 millions of vector evaluations/second ->  14 cycles/value on a 3400MHz computer
benching         cephes::logf .. ->   52.7 millions of vector evaluations/second ->  16 cycles/value on a 3400MHz computer
benching                 sinl .. ->    7.0 millions of vector evaluations/second -> 121 cycles/value on a 3400MHz computer
benching                 cosl .. ->    6.9 millions of vector evaluations/second -> 123 cycles/value on a 3400MHz computer
benching                 expl .. ->    5.2 millions of vector evaluations/second -> 163 cycles/value on a 3400MHz computer
benching                 logl .. ->    7.5 millions of vector evaluations/second -> 113 cycles/value on a 3400MHz computer
benching          cephes_sinf .. ->   26.1 millions of vector evaluations/second ->  33 cycles/value on a 3400MHz computer
benching          cephes_cosf .. ->   27.8 millions of vector evaluations/second ->  31 cycles/value on a 3400MHz computer
benching          cephes_expf .. ->    9.8 millions of vector evaluations/second ->  87 cycles/value on a 3400MHz computer
benching          cephes_logf .. ->   17.2 millions of vector evaluations/second ->  49 cycles/value on a 3400MHz computer
benching               sin_ps .. ->   44.6 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching               cos_ps .. ->   44.7 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching            sincos_ps .. ->   39.3 millions of vector evaluations/second ->  22 cycles/value on a 3400MHz computer
benching               exp_ps .. ->   30.9 millions of vector evaluations/second ->  27 cycles/value on a 3400MHz computer
benching               log_ps .. ->   33.8 millions of vector evaluations/second ->  25 cycles/value on a 3400MHz computer

as above compiled with just -02 to check scalar performance

benching                 sinf .. ->   14.4 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching                 cosf .. ->   14.4 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching         sincos (x87) .. ->    6.2 millions of vector evaluations/second -> 137 cycles/value on a 3400MHz computer
benching                 expf .. ->    1.3 millions of vector evaluations/second -> 650 cycles/value on a 3400MHz computer
benching                 logf .. ->   16.2 millions of vector evaluations/second ->  52 cycles/value on a 3400MHz computer
benching                 ln16 .. ->   45.4 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching               atan2f .. ->   16.2 millions of vector evaluations/second ->  52 cycles/value on a 3400MHz computer
benching                atan2 .. ->    5.4 millions of vector evaluations/second -> 156 cycles/value on a 3400MHz computer
benching      cephes::sincosf .. ->   17.6 millions of vector evaluations/second ->  48 cycles/value on a 3400MHz computer
benching       cephes::atan2f .. ->   29.1 millions of vector evaluations/second ->  29 cycles/value on a 3400MHz computer
benching         cephes::expf .. ->   20.1 millions of vector evaluations/second ->  42 cycles/value on a 3400MHz computer
benching         cephes::logf .. ->   22.1 millions of vector evaluations/second ->  38 cycles/value on a 3400MHz computer
benching                 sinl .. ->   14.4 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching                 cosl .. ->   14.4 millions of vector evaluations/second ->  59 cycles/value on a 3400MHz computer
benching                 expl .. ->    5.6 millions of vector evaluations/second -> 151 cycles/value on a 3400MHz computer
benching                 logl .. ->   16.1 millions of vector evaluations/second ->  52 cycles/value on a 3400MHz computer
benching          cephes_sinf .. ->   22.5 millions of vector evaluations/second ->  38 cycles/value on a 3400MHz computer
benching          cephes_cosf .. ->   22.9 millions of vector evaluations/second ->  37 cycles/value on a 3400MHz computer
benching          cephes_expf .. ->    9.1 millions of vector evaluations/second ->  93 cycles/value on a 3400MHz computer
benching          cephes_logf .. ->   16.9 millions of vector evaluations/second ->  50 cycles/value on a 3400MHz computer
benching               sin_ps .. ->   44.4 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching               cos_ps .. ->   44.8 millions of vector evaluations/second ->  19 cycles/value on a 3400MHz computer
benching            sincos_ps .. ->   39.2 millions of vector evaluations/second ->  22 cycles/value on a 3400MHz computer
benching               exp_ps .. ->   30.9 millions of vector evaluations/second ->  27 cycles/value on a 3400MHz computer
benching               log_ps .. ->   35.0 millions of vector evaluations/second ->  24 cycles/value on a 3400MHz computer

The code contains several integer operations that cannot exploit the wider vectors of avx. I'm not able to explain the poor performance of cephes::sincosf on NLH as the very same binary performs as expected on SB.

The attached tar file: vecfunBench.tgz contains all source code to run the benchmarks:

  • build a script to build the two test programs
  • cephes.h the c++ vectorizable code
  • icsiLog.h a fast limited accuracy log
  • sse_mathfun.h the original sse2 code
  • sse_mathfun_test.c the modified benchmark program
  • testCephes.cpp a specific test program of the vectorizable functions

the code need to be compiled with GCC 4.7.0 snapshot 20110702 or newer

-- VincenzoInnocente - 12-Jul-2011

Topic attachments
I Attachment History Action Size Date Who Comment
Compressed Zip archivetgz vecfunBench.tgz r1 manage 19.7 K 2011-07-15 - 10:28 VincenzoInnocente tar file containing all source code to run the benchmarks
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-08-09 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback