Compilers optimization for SMatrix usage

SMatrix is a highly optimized C++ library for "small" matrix manipulation. Originally developed for the Hera-B experiment is now part of Root-Math package. It is known to overperform other implementations such as TMatrix and CLHEP-Matrix.

Its is extensively used in CMS in particular in fitting code such as track Kalman filter and constrained kinematic fits.

SMatrix has been already the subject of performance investigations w.r.t. different compiler optimization strategies. See for instance Open Lab report and the Workshop with Intel.

Here we try to look in details to few performance-critical Matrix operations that shows up as hot-spots in CMS reconstruction.

Similarity

The Similarity operation between a generic matrix and a symmetric (actually positive-defined) matrix is a very typical operation when covariance matrices are involved. It can be decomposed in a transposition and two matrix multiplications. The result is a symmetric matrix as well. in SMatrix, symmetric matrices are stored using a lower-triangular representation which saves space and improve performances in several operations that act on each element sequencially. This representation seems on the other hand to prevent some compiler optimization in case of multiplications due to the non sequential access pattern to the elements.

We have compared the performances of four different implementations:

  • (root) the original Root one: both input and output are symmetric matrices in lower-triangular representation
  • (sym) having in input the symmetric matrix represented as a full-square matrix and in output a lower-triangular one
  • (std) having in both input and input square matrices
  • (loop) a brute force loop implementation B(i,j) = U(i,k)*A(k,l)*U(j,l)

code can be found in hte CMSSW cvs repository as well as in /afs/cern.ch/user/i/innocent/w1/Similarity, in the latter together with all the executable ready to run.

we have tested gcc 4.6 and icc 12 compilers with various options on three different architectures: Intel Core 2 Duo @3.06 GHz (my MacBook), Intel Core i7-2600K CPU @ 3.40GHz (my workstation), Intel Xeon L5520 (Nehalem) @ 2.27GHz (standard CERN node) All results are expressed in ticks obtained using rdtsc instruction. in case of gcc 4.6 O2v stands for "-O2 -ftree-vectorize", O2f for -O2 -fast math and ul ofr -funroll-loops. in case of icc nv stands for -no-vec.

Intel Core 2 Duo @3.06 GHz
    3x3 5x5 5x15 15x15
compiler code size root sym std loop root sym std loop root sym std loop root sym std loop
gcc 4.6. -O2 46.9 358 156 164 495 557 431 547 2241 6439 3174 3317 17220 22501 12692 13617 136743
gcc 4.6. -O2f 47.2 182 164 180 477 541 421 517 2250 3224 2270 2416 17158 12455 9335 10218 135901
gcc 4.6. -Ofast 63.7 128 134 141 131 318 416 479 693 3182 2249 2355 6431 11986 9441 10078 49532
gcc 4.6. -Ofast ul 100.6 119 134 136 137 329 428 446 654 3110 2174 2274 6239 11629 8585 9518 49651

Intel Xeon L5520 (Nehalem) @ 2.27GHz
    3x3 5x5 5x15 15x15
compiler code size root sym std loop root sym std loop root sym std loop root sym std loop
gcc 4.6. -O2 43.1 165 140 151 378 524 410 458 1781 4191 2583 2664 14097 15077 10379 10969 112606
gcc 4.6. -O2v 44.3 159 127 141 344 508 391 451 1763 4159 2546 2654 14175 14856 10027 10917 112752
gcc 4.6. -O2f 43.3 158 137 147 409 523 424 458 1959 2577 1793 1932 13880 9660 7200 8018 111207
gcc 4.6. -O2vf 43.8 154 127 145 368 510 416 464 1887 2591 1774 1907 13910 9514 6997 8199 111390
gcc 4.6. -O3 74.1 112 108 136 144 356 353 439 761 4132 2514 2648 8883 14915 9977 10951 70756
gcc 4.6. -Ofast 70.0 127 123 139 140 363 359 450 513 2530 1711 1898 4746 9465 6951 9053 37833
gcc 4.6. -Ofast ul 108.7 117 112 132 125 362 356 440 513 2518 1706 1830 4556 9248 6704 7332 37089
icc 12 -O2 nv 237.0 174 172 157 111 487 480 448 515 2413 2708 3547 6752 10298 10588 14700 39199
icc 12 -O2 181.4 203 197 179 111 506 504 515 521 3023 2773 3582 8681 12091 11446 12639 68379

Intel Core i7-2600K @ 3.40GHz running sse code
    3x3 5x5 5x15 15x15
compiler code size root sym std loop root sym std loop root sym std loop root sym std loop
gcc 4.6. -O2 43.1 131 129 133 266 425 346 383 1648 3258 1941 2010 11840 11393 7586 7764 93685
gcc 4.6. -O2v 44.3 142 113 137 324 384 331 394 1434 3212 1896 2012 11775 11396 7394 7809 94098
gcc 4.6. -O2f 43.3 125 120 128 257 402 352 391 1639 2075 1708 1808 11735 7725 6702 6995 93257
gcc 4.6. -O2vf 43.8 137 106 123 307 404 336 401 1508 1995 1708 1807 11718 7436 6567 7125 101895
gcc 4.6. -O3 74.0 96 99 113 130 287 272 376 749 3232 1884 2030 8787 11350 7305 7825 70375
gcc 4.6. -Ofast 70.0 116 116 132 138 299 271 377 491 2062 1667 1787 4747 7531 6552 7180 37713
gcc 4.6. -Ofast ul 108.7 117 108 136 127 271 267 348 479 1949 1677 1750 4681 7276 6400 6889 37295
icc 12 -O2 nv 237.0 132 131 111 81 373 354 368 439 2667 2663 3401 6313 8403 8735 10387 34587
icc 12 -O2 181.4 146 147 133 84 399 403 409 452 2102 1925 2607 7285 8555 7897 9195 57647

Intel Core i7-2600K @ 3.40GHz running avx code
    3x3 5x5 5x15 15x15
compiler code size root sym std loop root sym std loop root sym std loop root sym std loop
gcc 4.6. -O2 43.1 134 122 128 264 392 318 376 1419 3473 2481 2565 11801 12563 9693 9939 93615
gcc 4.6. -O2v 46.6 138 119 137 327 369 299 370 1497 3400 2439 2558 11640 12150 9414 9960 93746
gcc 4.6. -O2f 43.3 119 115 119 248 413 348 384 1469 2049 1494 1761 11782 7482 5818 6503 101356
gcc 4.6. -O2vf 43.5 128 101 121 302 357 314 375 1429 1986 1458 1587 11783 7317 5542 6806 93257
gcc 4.6. -O3 74.0 115 103 123 144 250 244 379 645 3470 2452 2569 9099 12258 9407 9994 68395
gcc 4.6. -Ofast 66.4 112 97 116 114 259 248 378 458 1967 1415 1548 4224 7147 5500 6102 32718
gcc 4.6. -Ofast ul 102.8 106 97 123 124 248 239 319 468 1948 1431 1648 4476 7081 5513 6128 35301
icc 12 -O2 nv 220.6 149 137 119 93 360 355 361 473 2186 2283 3036 6471 7583 8038 10183 35438
icc 12 -O2 178.4 136 139 151 105 443 424 445 517 2458 2131 2712 7327 9503 7663 9503 56308

-- VincenzoInnocente - 25-Mar-2011

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2011-03-26 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback