Haswell vs SandyBridge

Haswell is the new "TIC" architecture by INTEL. In this report we compare the two top workstation models for Haswell and SandyBridge i7-4770K vs i7-2600 (comparison on INTEL site).

For comprehensive reviews please refer to

We shall note that on the i7-4770K Hardware Restricted Transactional Memory instructions are not available (thy are on the i7-4770 (w/o K) ). So we will not be able to test those.

A previous comparison of performance of SandyBridge vs Nehalem can be found here.

Executive Summary

Under the caveat that all these results have been obtained on a single machine and not verified on any other (in particular on other similar models) ww can summarize

  • Running identical binary code Haswell seems to be about 25% faster than SandyBridge
  • HyperThreading seems to behave as on SandyBridge (no more, no less)
  • It seems that on Haswell AVX code, in contrast to SSE code, does not reach the maximal turbo step
  • The benefit on AVX2 instructions (as generated by GCC compiler) is the expected one for what concern vectorization of integers and fma. More work will be required to optimize the use of gathering instructions.

i7-4770 installation

The machine was delivered with Windows 8. Installed at cern worked with no particular problem. It was reinstalled with slc6 which does not fully support the hardware. In particular standard slc6 installation crashes during final cern configuration (it may well be that slc6 does not fully support the new graphics card) I stop before the final step and finished the configuration on the command line through ssh. It is up and running since then.

SLC6 does not support HASWELL yet. In particular we may note how perf is not updated to all its performance counters. Neither gcc nor the rest of the toolbox (the assembler in particular) supports AVX2 instructions. We therefore use either CMS installation (gcc 4.8.0, GNU Binutils 2.23.1) and or our own (gcc 4.9 trunk revision 200570, GNU Binutils 2.22)

scimark2

This results are based on the original C code of scimark2 slightly modified to ensure that the compiler does not optimize away dead code.

More details about the performance of schimark2 for different compilers and compiler options on SandyBridge and Nehalem can be found here.

It has been compiled (mostly for historical reason) with "c++ -pthread -v -Wl,-v scimark2.cpp *.c -std=gnu++0x -march=ARCH -Wall -Ofast -fvisibility-inlines-hidden -flto --param vect-max-version-for-alias-checks=100 -funsafe-loop-optimizations -ftree-loop-distribution -ftree-loop-if-convert-stores -fipa-pta -Wunsafe-loop-optimizations -fvisibility=hidden -DHIDDEN -lrt"

where ARCH is

  • coreI7
  • corei7-avx
  • native (equivalent to -march-corei7-avx -mavx2 -mfma) (caveat: native was broken in gcc for ivy-bridge and haswell at the time of the test: -march=core-avx2 gives better results for some benchmark)

prefetch means adding -fprefetch-loop-arrays.
avx2-128 means adding -mprefer-avx128.
fma-128 means just -mavx -mfma -mprefer-avx128.
HT means running scimark2 while 4 cmsRun jobs (AVX2) were already running pinned on cpus 0-4

  march clock small large very large
      FFT SOR MC SMM LU FFT SOR MC SMM LU FFT SOR MC SMM LU
SB corei7 3.79 1498.96 1255.18 725.17 2140.90 3672.92 635.94 1139.67 725.20 1860.96 3714.12
avx 3.78 1543.74 1251.13 710.14 2175.27 3841.90 633.53 1137.49 710.04 1874.89 4191.39 227.36 1121.32 710.06 1685.14 2173.26
HW corei7 3.89 1912.57 1305.10 889.76 2532.16 4088.19 624.26 1175.58 890.34 2537.39 4878.07 317.87 1162.20 889.79 1894.30 2247.59
avx 3.75 1741.74 1298.98 879.92 2504.80 4305.30 637.92 1174.61 880.58 2497.68 4940.50 318.96 1162.40 879.94 1929.78 2235.62
avx2 3.75 1936.25 1574.53 896.18 2049.92 4555.03 648.36 1385.21 896.84 1566.07 5066.75 321.89 1367.76 896.86 1365.10 2228.53
avx2 prefetch 3.72 1926.21 2187.02 896.16 2050.15 4475.29 652.74 1939.28 896.10 1527.66 4965.81 319.07 1912.88 896.04 1413.40 2239.55
HW gcc4.9 avx2 3.73 1878.89 2399.19 942.31 2716.11 4737.13 648.90 2135.30 941.67 1740.51 5135.15
corei7 3.89 1932.80 2084.32 922.74 2509.70 4630.13 641.86 1878.10 922.02 2583.79 4919.05
avx2-128 3.83/3.73 2092.12 2398.18 941.57 2696.06 4189.29 657.36 2135.21 941.61 1291.18 4604.66
fma-128 3.83/3.80 2126.92 2403.87 932.91 2964.31 4250.64 660.86 2135.96 932.89 2538.18 4591.93
avx2-O2 3.89 2206.25 1576.54 673.65 2964.43 3411.80 660.27 1385.08 673.64 2510.23 3753.41
corei7-O2 3.89 1929.41 1306.24 670.81 2509.88 3180.35 646.41 1175.06 670.65 2564.10 3704.00
avx2 pgo 3.73 1920.75 2412.54 957.41 3047.93 5560.08 653.99 2134.92 954.34 2319.38 5179.12
corei7 pgo 3.89 1919.95 2094.01 947.23 2989.55 6006.23 647.11 1876.27 947.22 2596.41 5317.52
HW fix freq corei7 3.495 1548.97 1171.72 797.64 2273.28 3666.48 593.24 1054.84 797.65 2378.34 4243.39
avx 3.495 1565.83 1166.16 789.70 2249.12 4347.40 596.87 1054.82 789.60 2341.08 4718.40
avx2 3.495 1728.96 1413.47 800.70 1846.05 4564.97 599.21 1243.95 800.71 1573.33 4987.86
HW HT corei7 3.491 991.46 1044.30 432.50 1298.79 2077.25 572.02 978.45 430.50 803.21 2245.00
avx2 3.491 1152.91 1225.49 431.93 980.33 2805.42 587.55 1149.35 434.14 834.04 3100.83
avx2 prefetch 3.491 1142.67 1792.08 433.13 976.11 2812.73 582.97 1668.14 426.62 819.81 3162.60

It should be noted that for "very large" the average clock for "HW avx(2)" is 3.5GHz while stays 3.89 for the corei7 code.

ParallelPdf (VIFIT)

ParallelPdf is a RooFit-like multi-threaded application that evaluate the gradient of a Log-Likelihood for a given number of events. it can un either evaluating all pdfs each time or reading the values of a pdf from a cache if its parameters did not change. For a not small number of events it is essentially dominated by memory (and caches!) access. Numerical computation is dominated by the evaluation of transcendental functions in double precision that implies at least one division and few conditions each (besides the usual polynomials).

arch n ev iter cache clock time
sse4.2 O2 1K 40K eval 3.68 70
sse4.2 O2 1K 40K cache 3.68 6.7
sse4.2 O2 200K 200 eval 3.68 57
sse4.2 O2 200K 200 cache 3.68 5.2
sse4.2 vect 1K 40K eval 3.68 37
sse4.2 vect 1K 40K cache 3.67 4.1
sse4.2 vect 200K 200 eval 3.68 27
sse4.2 vect 200K 200 cache 3.67 2.8
avx2 O2 1K 40K eval 3.48 59
avx2 O2 1K 40K cache 3.48 6.0
avx2 O2 200K 200 eval 3.48 47
avx2 O2 200K 200 cache 3.48 4.4
avx2 vect 1K 40K eval 3.48 28
avx2 vect 1K 40K cache 3.47 3.3
avx2 vect 200K 200 eval 3.48 16
avx2 vect 200K 200 cache 3.47 2.2

CMSSW Reco

We have used CMSSW_6_2_0_pre7 compiled with gcc480. We have recompiled most of the numerical sensible code as following

  • sse: original cmd release i.e. -O2 -ftree_vectorize -msse3
  • avx: recompiled with -Ofast -mrecip -march=corei7-avx
  • avx2: recompiled with -Ofast -mrecip -march=native (on Haswell)
  • avx2-128: recompiled with -Ofast -mrecip -march=native -mprefer-avx128 (on Haswell)
  • avx2 prefetch: recompiled with -Ofast -mrecip -march=native -fprefetch-loop-array (on Haswell)

Externals (in particular fastjet and root) has not been recompiled.

A similar comparison of INTEL SandyBridge vs AMD Bulldozer can be found here.

single instance

Here We run standard reco step for 300 high pile-up events for high pt jet simple taken in late 2012 run. I/O is local.

  clock time task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses cache-references branch-misses L1-dcache-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses
SB sse 3.77 1588 1556 5863976 7169199 0.43 3113898 1788612 4385 36484 18085 76276 86317 5378 1193
HW sse 3.85 1233 1221 4704728 7170645 0 0 0 1714 29466 17060
3.480 1373 1368 4759365 7157811 0 0 0 1918 34718 17161
HW avx 3.62 1259 1225 4430837 6349838 0 0 0 1651 27607 16669
HW avx2 3.58 1253 1236 4427514 6212071 0 0 0 1638 27730 16737
HW avx2 prefetch 3.58 1263 1245 4461519 6220002 0 0 0 1638 30673 16813
3.480 1290 1267 4410608 6217452 0 0 0 1713 27222 16636

igprof details reco

Detailed results for each producer can be obtained using igprof and are presented in this page

CMSSW multi job

we have run the very same job in parallel: 4 and 8 at the same time. The jobs are fully identical reading the very same file: some synchronization is not excluded. results are reported for a single job: the best and the worst which means the fastest and slowest of the N jobs in terms of real time For referece we report again the result for a single job

      clock time task-clock cycles instructions cache-misses cache-references branch-misses
HW avx2 prefetch 1   3.58 1263 1245 4461519 6220002 1638 30673 16813
4 best 3.49 1317 1305 4558777 6224284 2747 27740 16649
worst 3.49 1323 1307 4569511 6218870 2744 27951 16688
8 best 3.47 2107 2077 7207153 6209046 3965 53833 18959
worst 3.47 2142 2117 7351724 6222229 4423 54493 19036

CMSSW Reco multichildren

Again the same code this time running multi-children jobs on 1600 events of minimum bias from early run of 2012 (for historical reasons). Events are processes in round-robin batches of 10 events each. The timing includes the overhead of the scalar component that is as high as 60 seconds (see discussion in here).

  • 4x4 means four children on the whole machine
  • 4x8 means eight children on the whole machine
  • 4x2 means four children on two cpus (taskset -c 0,1,4,5)

    clock time task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses cache-references branch-misses L1-dcache-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses
SB sse 4x4 3.49 970 3609 12583573 14451372 0.48 7004787 4185678 17767 87234 42296 195898 157806 11000 2019
4x2 3.65 1374 5309 19346974 14438122 0.98 14096350 8602680 18256 152449 49762 291977 231166 16244 6314
8x4 3.47 824 6050 21033048 15009005 1.03 15503209 9795102 27637 163413 50349 308031 230258 17611 7154
HW sse 4x4 3.68 790 2867 10539718 14406288 0 0 0 8260 69123 40641
4x2 3.85 1074 4076 15674278 14398121 0 0 0 8005 119533 44342
8x4 3.67 615 4564 16747618 14554123 0 0 0 11123 122425 44721
3.66 626 4564 16735022 14553224 0 0 0 11248 120911 44260
7x4 3.67 638 4209 15451440 14504683 0 0 0 10356 112762 43952
HW avx 4x4 3.49 804 2863 9990704 13249792 0 0 0 7614 65468 40103
4x2 3.50 1127 4261 14894653 13277334 0 0 0 7771 114828 43705
8x4 3.47 633 4694 16305783 13424699 0 0 0 11225 117131 43787
HW avx2 4x4 3.48 752 2860 9964146 13003814 0 0 0 7748 62965 39979
4x2 3.48 1111 4234 14745904 13016519 0 0 0 7709 114158 43590
8x4 3.47 648 4653 16163627 13171569 0 0 0 11172 118144 44321
HW avx2 prefetch 8x4 3.47 631 4597 15962785 13193125 0 0 0 11474 116580 43752
7x4 3.48 640 4204 14611804 13123042 0 0 0 10403 109286 43147
Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2014-06-24 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback