SandyBridge vs Bulldozer performance for CMS reco

Executive summary

On SandyBridge Hyperthreading increases cmd-reco throughput by 26% when running 8 processes instead of 4 (or, if preferred one observes a slowdown of 58%). On Bulldozer running 8 processes instead of 4 increases the throughput by 63% (corresponding to a slowdown of 23%).

Introduction

In the next years the main CPU architecture to be used in HEP will be INTEL SandyBridge and AMD Bulldozer and their evolution. Here we compare the performance on multi-process CMS reconstruction on the first version of those chips i.e.

  • Intel® Core(™) i7-2600K CPU @ 3.40GHz acquired on January 2011
  • AMD FX(™)-8120 Eight-Core Processor @ 3.10GHz acquired on January 2012
For a general description of those chips and general performance evaluation I refer the reader to the excellent documentation available on the web. Here we just remind that the SandyBridge has four cores running 2 "hyperthead" each, while the Bulldozer has 4 modules each "divided" in two cores. So two processes/threads running on the same core(SB)/module(BD) will share common low level resources. Processes/threads running on the same chip will share more and more common high level resources. Details of course varies depending on the actual chip architecture.

CMS multiprocess Reconstruction

CMS framework is able to process events in parallel spawning child-processes after a common initialization phase where most of the read-only components (conditions) are read-in and built. Each child-process will then run autonomously sampling events from the input file in a round-robin fashion and writing his own output log and root files.

The benchmark is the so called step2 pass running RAW2DIGI,L1Reco,RECO,ENDJOB. It processes 800 "Minimum Bias" events in raw format from Run 190705 taken in early 2012 and produces RECO and AOD files. I used the official slc6_amd64_gcc462 build of CMSSW_6_0_0_pre11.

Benchmark procedure

I run cmsRun with either 2,4 or 8 children selecting the cpu to run on with taskset. I used time and perf to measure performance. A detailed description of perf can be found for instance in http:/perf.wiki.kernel.org/index.php/Tutorial . for instance taskset -c 0,1 ~/bin/perf stat cmsRun step2_Run2012.py > & perfmulti01.out & would run a 2 child job (this is selected editing the config file) on cpu 0 and 1 measuring performances using the version of pert currently installed on SLC6 /usr/libexec/perf.3.0.9-rt26.46.el6rt.x86_64 The output will look like (on SandyBride) this
cat perfmulti01.out
29-Aug-2012 11:20:30 CEST  Initiating request to open file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
29-Aug-2012 11:20:30 CEST  Successfully opened file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 11:20:34 CEST  BeforeEvents
 prefetching for run 190705
%MSG
%MSG-e BeamHaloPropagator:  AfterBeginJob 29-Aug-2012 11:20:45 CEST BeforeEvents
composite propagator set with inconsistent direction components
EndCap propagator is: 1
 to be set on: 2
%MSG
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 11:20:48 CEST  BeforeEvents
  done prefetching
%MSG
29-Aug-2012 11:20:48 CEST  Closed file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root

 Performance counter stats for 'cmsRun step2_2_Run2012.py':

    1887008.530723 task-clock                #    1.955 CPUs utilized          
            210192 context-switches          #    0.000 M/sec                  
                83 CPU-migrations            #    0.000 M/sec                  
          22986333 page-faults               #    0.012 M/sec                  
     6936207997892 cycles                    #    3.676 GHz                    
     3880456074608 stalled-cycles-frontend   #   55.94% frontend cycles idle   
     2251329027408 stalled-cycles-backend    #   32.46% backend  cycles idle   
     8006610371702 instructions              #    1.15  insns per cycle        
                                             #    0.48  stalled cycles per insn
     1265108456093 branches                  #  670.431 M/sec                  
       22208806434 branch-misses             #    1.76% of all branches        

     965.382202255 seconds time elapsed

Running the same job under time I get
cat multi01.out
29-Aug-2012 10:51:18 CEST  Initiating request to open file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
29-Aug-2012 10:51:18 CEST  Successfully opened file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 10:51:22 CEST  BeforeEvents
 prefetching for run 190705
%MSG
%MSG-e BeamHaloPropagator:  AfterBeginJob 29-Aug-2012 10:51:33 CEST BeforeEvents
composite propagator set with inconsistent direction components
EndCap propagator is: 1
 to be set on: 2
%MSG
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 10:51:36 CEST  BeforeEvents
  done prefetching
%MSG
29-Aug-2012 10:51:36 CEST  Closed file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
1824.70user 51.16system 15:57.74elapsed 195%CPU (0avgtext+0avgdata 6117776maxresident)k
203264inputs+1269816outputs (40302major+18797340minor)pagefaults 0swaps
In addition the CMS framework is equipped with internal timers that report the average cpu and elapsed time per event (for each child), for the case above (two processes on different core) I get
grep "Avg event" *.log
redirectout_15125_0.log: - Avg event:   2.23423
redirectout_15125_0.log: - Avg event:   2.12753
redirectout_15125_1.log: - Avg event:   2.47873
redirectout_15125_1.log: - Avg event:   2.36136
the first pair for the first child (elapsed and cpu) the second pair for the second child.

Estimating the "Sequential overhead"

The attentive reader will have noticed that the parent-process spends several tens of seconds before spawning the processing children. This time does not depend on the amount of children and should be subtracted from the total time of the job in order to get a correct comparison valid for a large number of events. To estimate one can look at ps -o time,etime once the children are running. On SandyBridge I get 23 seconds cpu and about 30 elapsed while on Bulldozer is 38 cpu and about 45 elapsed. Another way is to run a single child process for one event. doing so I get on SandyBridge
date ; taskset -c 0,2 ~/bin/perf stat cmsRun step2_0_Run2012.py
Wed Aug 29 13:43:06 CEST 2012
29-Aug-2012 13:43:19 CEST  Initiating request to open file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
29-Aug-2012 13:43:19 CEST  Successfully opened file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 13:43:23 CEST  BeforeEvents
 prefetching for run 190705
%MSG
%MSG-e BeamHaloPropagator:  AfterBeginJob 29-Aug-2012 13:43:40 CEST BeforeEvents
composite propagator set with inconsistent direction components
EndCap propagator is: 1
 to be set on: 2
%MSG
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 13:43:42 CEST  BeforeEvents
  done prefetching
%MSG
29-Aug-2012 13:43:42 CEST  Closed file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root

 Performance counter stats for 'cmsRun step2_0_Run2012.py':

      38782.936021 task-clock                #    0.760 CPUs utilized          
             20083 context-switches          #    0.001 M/sec                  
                14 CPU-migrations            #    0.000 M/sec                  
           3147466 page-faults               #    0.081 M/sec                  
      139105593922 cycles                    #    3.587 GHz                    
       74452938659 stalled-cycles-frontend   #   53.52% frontend cycles idle   
       45131537368 stalled-cycles-backend    #   32.44% backend  cycles idle   
      160094178396 instructions              #    1.15  insns per cycle        
                                             #    0.47  stalled cycles per insn
       32727626816 branches                  #  843.867 M/sec                  
         495360613 branch-misses             #    1.51% of all branches        

      51.058750622 seconds time elapsed
and on Bulldozer
date ; taskset -c 0,2 ~/bin/perf stat cmsRun step2_0_Run2012.py
Wed Aug 29 13:43:08 CEST 2012
29-Aug-2012 13:43:25 CEST  Initiating request to open file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
29-Aug-2012 13:43:25 CEST  Successfully opened file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 13:43:31 CEST  BeforeEvents
 prefetching for run 190705
%MSG
%MSG-e BeamHaloPropagator:  AfterBeginJob 29-Aug-2012 13:43:53 CEST BeforeEvents
composite propagator set with inconsistent direction components
EndCap propagator is: 1
 to be set on: 2
%MSG
%MSG-s ForkingEventSetupPreFetching:  AfterBeginJob 29-Aug-2012 13:43:58 CEST  BeforeEvents
  done prefetching
%MSG
29-Aug-2012 13:43:58 CEST  Closed file file:Run2012A_MinimumBias_RAW_v1_000_190_705_C65A6A58-B781-E111-9D06-003048678110.root

 Performance counter stats for 'cmsRun step2_0_Run2012.py':

      67870.413387 task-clock                #    0.819 CPUs utilized          
             14513 context-switches          #    0.000 M/sec                  
                11 CPU-migrations            #    0.000 M/sec                  
           2117963 page-faults               #    0.031 M/sec                  
      257778465120 cycles                    #    3.798 GHz                     [83.33%]
       27995836514 stalled-cycles-frontend   #   10.86% frontend cycles idle    [83.32%]
      136798686818 stalled-cycles-backend    #   53.07% backend  cycles idle    [33.35%]
      156936035383 instructions              #    0.61  insns per cycle        
                                             #    0.87  stalled cycles per insn [50.01%]
       32193434926 branches                  #  474.337 M/sec                   [66.69%]
         761427550 branch-misses             #    2.37% of all branches         [83.34%]

      82.913481353 seconds time elapsed
which seems to indicate that additional overhead exists in closing down the application as confirmed by time to process the single event in the child process: 0.55 cpu, 33 elapsed on SB, 0.76 cpu., 59 elapsed on BD.


Performance Metrics

As performance metrics I therefore propose to use
  • Total elapsed time corrected for the "sequential overhead" normalized to the number of children (TIME)
  • as above normalized to the cpu clock (NTIME)
  • Average cpu time per event (evtime)
  • instruction per cycle (ipc)
  • stalled cycles per instruction (sci)

In case of perfect scaling they should all be constant for a given workload

arch clock core/module children TIME NTIME evtime ipc sci
SB 3.68 2 2     2.2 1.15 0.48
3.76 1 2     3.4 0.75 0.97
3.50 4 4     2.4 1.13 0.51
3.67 2 4     3.5 0.74 0.99
3.50 4 8     3.8 0.72 1.03
BD 3.98 2 2     3.3 0.72 0.60
3.97 1 2     3.8 0.62 0.31
3.30 4 4     4.0 0.71 0.62
3.94 2 4     4.0 0.60 0.33
3.15 4 8     4.9 0.59 0.33

The decrease in stall-rate on bulldozer is almost all due to the backend: my interpretation is that in running a single process per module the backend is essentially "starving": running two processes gives it more instructions to process with a net decrease in ilc of only 15% compared to the 35% decrease in case of SB which is essentially unable to interleave instructions from the two processes in a way to decrease the total number of stalled cycles (which instead remain essentially at the same level most probably because a port stalled on a load/store blocks both processes).

Detailed analysis of performance counters

Here I used "perf" as

exec "/usr/libexec/perf.3.0.9-rt26.46.el6rt.x86_64" stat \
-e task-clock,cycles,instructions,\
stalled-cycles-frontend,stalled-cycles-backend,\
cache-misses,branch-misses,L1-dcache-load-misses,\
L1-dcache-store-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses \
${1+"$@"}
to perform a detailed study of the cpu behavior under cmd-reco load. notice that "L1-dcache-store-misses" are not measured on SB (and most probably irrelevant anyhow).

The output is parsed with

grep "task-clock" -A 14 $f | sed 's/\#//' | sed 's/not counted/not-counted/' |\
 awk '{ if($1>1000) {printf(" %10.f|", ($1>10000000 ? $1/1000000 : $1/1000)) }else {printf("%10.2f|", $1)} }' | xargs | sed 's/|/|  /g'

Counters are in Millions, task-clocks and elapsed time in seconds.

single process

here I run two identical jobs (reco of 200 events) in parallel either on different cores/modules or on the some one. Below is reported the result for just one (checked that they are essentially identical).

    task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses branch-misses L1-dcache-load-misses L1-dcache-store-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses   time-elapsed
SB 2/2 492 1798934 2078797 0.48 1005483 583272 2385 5814 23363 6431 22792 1400 438 0.00 500.40
2/1 746 2785766 2078692 0.98 2027939 1227737 2437 7148 36068 9040 33842 2317 1157 0.00 751.35
BD 2/2 764 3017140 2101293 0.64 471632 1346266 28723 11720 34708 0 28792 457 63 0.00 773.13
2/1 860 3423557 2085876 0.31 650275 464821 42891 14522 33683 0 42918 740 399 0.00 864.67

full machine

here I run cmd-reco multi-child for 800 events either with 4 children on 4 different cores/modules or with 8 children on all 8 cpu of the machine. As all numbers are absolute, to compare with the above one needs to multiply the previous values by 4.

    task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses branch-misses L1-dcache-load-misses L1-dcache-store-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses   time-elapsed
SB 4/4 2055 7166785 8116383 0.50 4073222 2393313 11416 21885 91241 25537 87564 5462 1492 0.00 544.40
8/4 3288 11442783 8250653 1.02 8434951 5265410 14372 27568 142934 37013 133143 9296 4877 0.00 452.14
BD 4/4 3448 11398653 8098368 0.62 1787857 4996042 111604 45349 130577 0 111654 1400 169 0.00 910.91
8/4 4460 14067864 8236707 0.33 2721664 2238305 166525 57121 133299 0 166731 2432 1928 0.00 611.84

-

Analysis

The number of instruction in each set better be the same as I am running the very same job! Cycles and clock-ticks gives us the relative efficiency already measured in the previous paragraph. The most striking feature is the relatively low number of cache-misses in SB (each costs about 350 cycle) compared to BD where the number is "suspiciously" similar to the L1-icache-misses not to make me to conclude that bulldozer suffers of a severe instruction starvation (as already noticed by others). On the other hand TLB pressure seems to be higher on SB (iTLB seems to be statically allocated to "cpu" when HT is enabled, while the rest is allocated dynamically, so one my have to boot w/o HT to see if iTLB pressure reduces). The rest looks to be pretty obvious given the two architectures: on SB everything is shared while on BD L1-Dcaches and integer-processor are decoupled: SB starves on dcache-missed, BD waiting for instructions! We make little use of the floating-point-processor (measured in the past to be in the 10-15% range).

It is interesting to note that AMD will address exactly these issues in the third generation of Bulldozer chips to be due on 2014 (http:http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture).

no output: memory analysis multiple-single jobs vs multi-child

the following is just on sandy-bridge

  single multi
Active(anon) 5872120 kB 2399204 kB
Inactive(anon) 1065344 kB 39748 kB
AnonPages 6912140 kB 2419336 kB
AnonHugePages 1525760 kB 413696 kB

  task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses cache-references branch-misses L1-dcache-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses   time-elapsed
single 200 5638 19584883 13995170 1.03 14394127 9067717 8818 154531 47830 241930 262786 19575 10348 0.00 730.48
multi 1600 5234 18210861 13111027 1.02 13340268 8373038 7742 143226 43647 219683 258773 16334 9500 0.00 690.68
5207 18114739 13110860 1.01 13246366 8290412 7735 142141 43873 219205 257417 15771 9324 0.00 699.90
single 1600 5273 18334423 13026761 1.04 13502440 8516123 8029 147536 43671 227222 259905 19887 9689 0.00 5305.45
5240 18220553 13027828 1.03 13384249 8418033 8036 143419 43724 223377 260260 16306 8887 0.00 5287.40
5285 18381371 13027120 1.04 13540150 8565511 8125 147161 44107 227511 262432 18087 10592 0.00 5317.26

comparing different compilers and option

  task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses cache-references branch-misses L1-dcache-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses x87 SS SP DS DP 256 256P
slc6 462 5013 17435263 12419534 1.03 12800547 8078582 7594 132957 42191 210612 241597 16074 8990 398201 247576 77284 1257518 96563 13 624
slc5 462 4977 17306811 12425119 1.02 12670033 7973021 7694 133480 41683 207433 240949 15204 8055 397779 247619 77183 1258356 96454 19 883
slc5 472 4919 17105662 12421884 1.00 12478254 7820428 7789 126959 41453 202070 236944 14524 7608 379601 238271 73574 1241375 101738 18 940
slc5 472 lto 4900 17035601 12419520 1.00 12435639 7834053 7703 125759 40789 199904 231315 14795 9411 378142 239337 73740 1254869 99563 17 946
4887 16989412 12421106 1.00 12397479 7789944 7683 125100 40527 199571 230826 14712 8295 378316 239323 73744 1254680 99740 16 1046
slc5 472 fast 4819 16756850 11505375 1.08 12396676 7941921 7781 130973 41878 196354 261759 15204 9115 374328 237974 85956 1220503 109780 18 597
4827 16782211 11505588 1.08 12427121 7955621 7763 133268 41643 196495 259248 15476 8128 375173 237960 86088 1220854 110795 18 1034

this new set of measurements compares gcc 4.7.2 with gcc 4.8.0 using CMSSW_6_2_0_pre2 on 400 events of high multiplity jets taken in Run2012C (single child job)

  task-clock cycles instructions stalled-cycles-per-insn stalled-cycles-frontend stalled-cycles-backend cache-misses cache-references branch-misses L1-dcache-misses L1-icache-load-misses dTLB-load-misses iTLB-load-misses x87 SS SP DS DP 256 256P
slc5 472 1993 7515264 9235019 0.44 4052753 2305475 5561 57469 23300 110323 102935 12349 1836 183590 316173 65413 703837 65630 7 283
slc5 480 1944 7338662 9123443 0.43 3900763 2207627 5317 52862 23007 106406 92307 11042 1541 182995 314487 67226 698295 65062 7 484
slc5 472 pre4+ 1944 7341880 8980153 0.44 3923068 2282037 5385 46503 22503 86560 99821 6036 1379 186541 315257 61998 705811 64776 7 353
slc6 480 pre4 1890 7137408 8871046 0.42 3743611 2158462 5347 42273 22297 85676 90606 5961 1259 185756 315392 67095 696821 64926 6 473
slc6 480 avx lto fast pre5 1803 6816062 7911598 0.47 3701933 2141586 5255 43475 21396 83783 100365 5510 1221 174638 388996 55077 682369 39113 203 15115

-- VincenzoInnocente - 29-Aug-2012

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2013-09-06 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback