TWiki
>
LCG Web
>
TWikiUsers
>
VincenzoInnocente
>
MultiCoreRD
>
MultiCoreBlogs
>
VIHugePages
(2017-03-26,
VincenzoInnocente
)
(raw view)
E
dit
A
ttach
P
DF
---+ Experimenting with Huge Pages ---++ 23/03/2017 ---+++ Building biglibs with huge pages in CMSSW _failed!_ Just adding and rebuilding the biglib <verbatim> [innocent@vinavx3 CMSSW_9_1_0_pre1]$ diff config/BuildFile.xml $CMSSW_RELEASE_BASE/config/BuildFile.xml 40c40 < <flags BIGOBJ_CXXFLAGS="-Wl,--exclude-libs,ALL -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align"/> --- > <flags BIGOBJ_CXXFLAGS="-Wl,--exclude-libs,ALL"/> </verbatim> at run time got <verbatim> libhugetlbfs: WARNING: Layout problem with segments 0 and 1: Segments would overlap </verbatim> and no evidence of use of huge pages... ---++ 16/03/2017 ---+++ !HugePages strikes again Following a very interesting seminar by Marco Guerri I investigated again !HugePages for code and discovered that life is now a bit easier with new kernels and distributions... see https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO * _libhugetlbfs_ is now in the distributions: CC7 for instance (under /usr/share/libhugetlbfs) * it is easer to reserve huge pages * one still has to mount hugetlbfs configuration as root becomes <verbatim> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages mkdir -p /mnt/hugetlbfs mount -t hugetlbfs none /mnt/hugetlbfs chmod 777 /mnt/hugetlbfs/ </verbatim> linking with libhugetlbfs loader <verbatim> c++ -Wall -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align ... </verbatim> at runtime one needs <verbatim> setenv HUGETLB_ELFMAP RW </verbatim> I have used Marco's benchmark https://gitlab.cern.ch/snippets/216 to test it on my workstation <verbatim> icache.py > icache.cpp c++ -O0 -Wall icache.cpp -o icacheStandard c++ -O0 -Wall icache.cpp -B /usr/share/libhugetlbfs -o icacheHuge -Wl,--hugetlbfs-align setenv HUGETLB_ELFMAP RW </verbatim> and observed <verbatim> perf stat -e cycles -e iTLB-load-misses -e iTLB-loads ./icacheStandard Performance counter stats for './icacheStandard': 73122850326 cycles 1041337 iTLB-load-misses # 0.17% of all iTLB cache hits 599872659 iTLB-loads 18.719051666 seconds time elapsed perf stat -e cycles -e iTLB-load-misses -e iTLB-loads ./icacheHuge Performance counter stats for './icacheHuge': 63403839270 cycles 755673 iTLB-load-misses # 50479.16% of all iTLB cache hits 1497 iTLB-loads 16.218030985 seconds time elapsed </verbatim> (caveat iTLB-load-misses may be misleading: it actually counts the "itlb_misses.miss_causes_a_walk" so the sTLB-misses, while iTLB-loads counts the misses that actually hit the sTLB) while *icacheHuge* was running I looked into _meminfo_ and actually saw the huge pages in use <verbatim> cat /proc/meminfo | grep Huge AnonHugePages: 104448 kB HugePages_Total: 5 HugePages_Free: 1 HugePages_Rsvd: 1 HugePages_Surp: 5 Hugepagesize: 2048 kB </verbatim> ---+++ effect on Geant4 I have measured the number of iTLB-loads in CMS simulation and found to be 0.1% of the cycles Assuming (from the results about) that each iTLB-load cost about 15 cycles we can predict a speed gain of about 1.5% in using huge pages to allocate text-segments ---+++ how to verify the status of THP <verbatim> egrep 'trans|thp' /proc/vmstat </verbatim> ---++ 06/04/2010 ---+++ !HugePages reloaded Following the publication of a set of new article on lwn http://lwn.net/Articles/374424/ (part 1 and link to others) http://lwn.net/Articles/379748/ (part5) I tested again the use of hugepages in cmssw using pfmon to measure the amonut of DLTB misses. I just allocate the whole HEAP on hugepages thanks to the new library described in the articles above results for single and mutiple (5) processes * no hugepages <verbatim> pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY cmsRun recoqcd_RAW2DIGI_RECO_MC_one.py 1646660047412 UNHALTED_CORE_CYCLES 1979891584 DTLB_MISSES:ANY 621.742u 3.730s 10:38.23 98.0% 0+0k 0+0io 22512pf+0w pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY cmsRun recoqcd_RAW2DIGI_RECO_MC_multi.py 1740125704760 UNHALTED_CORE_CYCLES 2042917165 DTLB_MISSES:ANY 654.580u 7.739s 3:18.09 334.3% 0+0k 0+0io 24619pf+0w </verbatim> * hugepages <verbatim> pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY /users/innocent/usrlocal/bin/hugectl --library-use-path --heap cmsRun recoqcd_RAW2DIGI_RECO_MC_one.py 1628420054827 UNHALTED_CORE_CYCLES 1522338644 DTLB_MISSES:ANY 615.098u 3.509s 10:30.17 98.1% 0+0k 0+0io 23553pf+0w pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY /users/innocent//usrlocal/bin/hugectl --library-use-path --heap cmsRun recoqcd_RAW2DIGI_RECO_MC_multi.py 1720013658083 UNHALTED_CORE_CYCLES 1342258581 DTLB_MISSES:ANY 647.690u 7.351s 3:11.93 341.2% 0+0k 0+0io 24344pf+0w </verbatim> the observation is that *DTLB_MISSES* account to a bit more of 1% of the cycles (each DTLB miss costs about 10 cycles). Using hugepages for the heap reduces the *DTLB_MISSES* by more than 30%, still negligible for what CMSSW is concerned. one shall also note that hugepages are *NOT* accounted in the standard memory reports <verbatim> 10823 innocent pfmon --with-header --aggre 0 772.0K 805.0K 1.3M 9164 innocent -tcsh 0 1.7M 1.9M 2.7M 8953 innocent -tcsh 0 2.8M 3.0M 3.8M 10974 innocent /usr/bin/python /afs/cern.c 0 7.2M 7.4M 8.3M 10824 innocent cmsRun recoqcd_RAW2DIGI_REC 0 286.8M 287.0M 288.1M ------------------------------------------------------------------------------- 64 1 0 299.1M 300.1M 304.2M [pcphsft50] ~ $ ~/w1/smem-0.9/smem -t -k PID User Command Swap USS PSS RSS 10977 innocent pfmon --with-header --aggre 0 1.1M 1.1M 1.7M 9164 innocent -tcsh 0 1.7M 1.9M 2.7M 8953 innocent -tcsh 0 2.8M 3.0M 3.8M 10991 innocent /usr/bin/python /afs/cern.c 0 7.5M 7.7M 8.6M 10978 innocent cmsRun recoqcd_RAW2DIGI_REC 0 4.3M 40.8M 226.8M 10984 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.0M 101.9M 296.8M 10981 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 101.9M 296.7M 10985 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 101.9M 296.7M 10983 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 102.0M 297.0M 10982 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 102.1M 297.1M ------------------------------------------------------------------------------- 69 1 0 333.1M 564.2M 1.7G %MSG-w MemoryCheck: PostProcessPath 03-Apr-2010 17:30:55 CEST Run: 1 Event: 60 MemoryCheck: event : VSIZE 1004.96 0 RSS 226.828 0 HEAP-ARENA [ SIZE-BYTES 980017152 N-UNUSED-CHUNKS 101 TOP-FREE-BYTES 3917264 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 957265664 HEAP-UNUSED-BYTES 22751488 </verbatim> ---++ 23/07/2008 ---+++ new performance test using 219 I run reco 219 on 200 events bbar as produced by !RelVal. standard <verbatim> 1732729539271 UNHALTED_CORE_CYCLES 1539713454028 INSTRUCTIONS_RETIRED 230642073370 BRANCH_INSTRUCTIONS_RETIRED 5670263475 MISPREDICTED_BRANCH_RETIRED 603715307797 INST_RETIRED:LOADS 0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE 316752798881 INST_RETIRED:STORES 144606669202 X87_OPS_RETIRED:ANY 629460043234 RESOURCE_STALLS:ANY 2522977571 BUS_TRANS_ANY:ALL_AGENTS 2737708714 BUS_DRDY_CLOCKS:ALL_AGENTS 69208177 BUS_BNR_DRV:ALL_AGENTS 36272303280 LAST_LEVEL_CACHE_REFERENCES 427617760 LAST_LEVEL_CACHE_MISSES 217670175552 CPU_CLK_UNHALTED:BUS ----------------------------------------------------------------------- Ratios: CPI: 1.1254 load instructions %: 39.210% store instructions %: 20.572% load and store instructions %: 59.782% resource stalls % (of cycles): 36.328% branch instructions %: 14.980% % of branch instr. mispredicted: 2.458% % of l2 loads missed: 1.179% bus utilization %: 2.318% data bus utilization %: 1.258% bus not ready %: 0.064% comp. SIMD inst. ('new FP') %: 0.000% comp. x87 instr. ('old FP') %: 9.392% </verbatim> with !TCMALLOC <verbatim> 1584570646894 UNHALTED_CORE_CYCLES 1509316585174 INSTRUCTIONS_RETIRED 224658865844 BRANCH_INSTRUCTIONS_RETIRED 5025722897 MISPREDICTED_BRANCH_RETIRED 597442500267 INST_RETIRED:LOADS 0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE 306462564773 INST_RETIRED:STORES 144577985459 X87_OPS_RETIRED:ANY 527521246432 RESOURCE_STALLS:ANY 1961824853 BUS_TRANS_ANY:ALL_AGENTS 2126966626 BUS_DRDY_CLOCKS:ALL_AGENTS 67851327 BUS_BNR_DRV:ALL_AGENTS 33442404368 LAST_LEVEL_CACHE_REFERENCES 355141679 LAST_LEVEL_CACHE_MISSES 199109646642 CPU_CLK_UNHALTED:BUS ----------------------------------------------------------------------- Ratios: CPI: 1.0499 load instructions %: 39.584% store instructions %: 20.305% load and store instructions %: 59.888% resource stalls % (of cycles): 33.291% branch instructions %: 14.885% % of branch instr. mispredicted: 2.237% % of l2 loads missed: 1.062% bus utilization %: 1.971% data bus utilization %: 1.068% bus not ready %: 0.068% comp. SIMD inst. ('new FP') %: 0.000% comp. x87 instr. ('old FP') %: 9.579% </verbatim> ---++ 23/07/2008 ---+++ new performance test using 210(pre9) I run reco 210_pre9 on 200 events bbar as produced by !RelVal. I run the standard code, that applied the patch to =ROOT::Cintex::Allocate_code= to seve pages out of 2MB block pool then the same using tcmalloc and finally using hugepages <verbatim> cat /proc/meminfo | grep -i huge HugePages_Total: 1000 HugePages_Free: 678 HugePages_Rsvd: 1 HugePages_Surp: 0 Hugepagesize: 2048 kB </verbatim> I even run 8 =cmsRun= at the same time. the use of _hugepages_ does not have any effect on performance. reco scales up to 8 cpus (on lxbuild114) w/o problems. ---++ 22/07/2008 ---+++ running with 210 <verbatim> cmsDriver.py BJets_Pt_50_120_cfi -s GEN -n 100 cmsDriver.py BJets_Pt_50_120_cfi -s SIM -n 100 --filein file:BJets_Pt_50_120_cfi_GEN.root </verbatim> did not manage to get =--customize= to work copied Gabriele file locally run =cmsDriver= with =--no_exec= and used pico to mix the two... at the end I used Sharham instruction.... and copied a sim-file from dbs <verbatim> rfcp /castor/cern.ch/cms/store/relval/2008/7/20/RelVal-RelValBJets_Pt_50_120-1216579576/RelValBJets_Pt_50_120/GEN-SIM-DIGI-RAW-HLTDEBUG/CMSSW_2_1_0_pre9-RelVal-1216579576-STARTUP_V4-unmerged/0000/00A1B29A-4457-DD11-AF37-000423D9939C.root . </verbatim> <verbatim> cmsDriver.py BJets_Pt_50_120_cfi -s GEN,SIM,DIGI,L1,DIGI2RAW -n 100 --conditions FrontierConditions_GlobalTag,STARTUP_V4::All --datatier 'GEN-SIM-DIGI-RAW' --eventcontent FEVTDEBUGHLT cp $CMSSW_RELEASE_BASE/src/Configuration/Examples/python/RecoExample_cfg.py . cmsRun RecoExample_cfg.py </verbatim> ---++ 16/07/2008 ---+++ testing memory churn in multi-thread six threads on lxbuild114 (8 core) | method | Real time | cpu/real | | | malloc || tcmalloc || | naive | 81 | 534% | 300 | 138% | | boost-pool | 35 | 462% | 36 | 504% | | large chunks | 34 | 477% | 36 | 506% | ---++ 15/07/2008 ---+++ summary of measurement of cms reco | time per event | malloc | tcmalloc | | reco modules | 2.8 | 2.5| | output module | 0.25 | 0.24 | | counter | malloc | tcmalloc | | UNHALTED_CORE_CYCLES | 970293569059 | 903705695000 | | INSTRUCTIONS_RETIRED | 943404623915 | 931115219416 | | BRANCH_INSTRUCTIONS_RETIRED | 153400155587 | 151185416169 | | MISPREDICTED_BRANCH_RETIRED | 4283583203 | 4032950096 | | INST_RETIRED:LOADS | 366093204853 | 363136780290 | | SIMD_COMP_INST_RETIRED | 0 | 0 | | INST_RETIRED:STORES | 198866522599 | 194064751706 | | X87_OPS_RETIRED:ANY | 57650513756 | 57567303752 | | RESOURCE_STALLS:ANY | 305133549321 | 259105502466 | | BUS_TRANS_ANY:ALL_AGENTS | 1103247481 | 965820152 | | BUS_DRDY_CLOCKS:ALL_AGENTS | 1164574768 | 1004396040 | | BUS_BNR_DRV:ALL_AGENTS | 365249031 | 346899990 | | LAST_LEVEL_CACHE_REFERENCES | 21200587333 | 20041091763 | | LAST_LEVEL_CACHE_MISSES | 187148195 | 168698808 | | CPU_CLK_UNHALTED:BUS | 138684654138 | 129149087560 | | ratio | malloc | tcmalloc | | CPI | 1.0285 | 0.9706 | | load instructions % | 38.806% | 39.000% | | store instructions % | 21.080% | 20.842% | | load and store instructions % | 59.885% | 59.842% | | resource stalls % (of cycles) | 31.448% | 28.671% | | branch instructions % | 16.260% | 16.237% | | % of branch instr. mispredicted | 2.792% | 2.668% | | % of l2 loads missed | 0.883% | 0.842% | | bus utilization % | 1.591% | 1.496% | | data bus utilization % | 0.840% | 0.778% | | bus not ready % | 0.527% | 0.537% | | comp. SIMD inst. ('new FP') % | 0.000% | 0.000% | | comp. x87 instr. ('old FP') % | 6.111% | 6.183% | ---++ 07/07/2008 ---+++ allocating huge pages on 32 bit tcmalloc fails this message: <verbatim> Check failed: (fstatfs(hugetlb_fd, &sfs)) != -1: Bad address </verbatim> #HugePages ---++ 06/07/2008 (sunday) _it's raining!_ ---+++ allocating huge pages *fasten your seatbelt* following Giulio's receipt I tried to convince tcmalloc to use [[http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.txt][hugetlbpages]] I used lxbuild066 (standard slc4) and pcphsft50 (new kernel) setting up <verbatim> sudo /bin/tcsh -f -c "echo 1000 > /proc/sys/vm/nr_hugepages" sudo /bin/tcsh -f -c "mkdir -p /mnt/hugepage;mount -t hugetlbfs -o uid=1039,gid=1399,mode=0775 none /mnt/hugepage" setenv TCMALLOC_MEMFS_MALLOC_PATH /mnt/hugepage/ </verbatim> caveat lxbuild machines have little contigous memory to allocate hugetlb so one gets on 12 pages (even only 6 on lxbuild065) <verbatim> cat /proc/meminfo | grep -i huge HugePages_Total: 12 HugePages_Free: 12 Hugepagesize: 2048 kB </verbatim> this is enough for a small test with scimark2: <verbatim> unsetenv TCMALLOC_MEMFS_MALLOC_PATH [lxbuild066] ~/public/multicore > ~/w1/scimark2/scimark2 -large > scimark2_tc& [1] 2257 [lxbuild066] ~/public/multicore > cat /proc/meminfo | grep -i huge HugePages_Total: 12 HugePages_Free: 12 Hugepagesize: 2048 kB [lxbuild066] ~/public/multicore > cat /proc/meminfo | grep -i huge HugePages_Total: 12 HugePages_Free: 12 Hugepagesize: 2048 kB [lxbuild066] ~/public/multicore > [1] Done ~/w1/scimark2/scimark2 -large > scimark2_tc [lxbuild066] ~/public/multicore > cat scimark2_tc ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 473.27 FFT Mflops: 55.27 (N=1048576) SOR Mflops: 846.92 (1000 x 1000) MonteCarlo: Mflops: 191.06 Sparse matmult Mflops: 455.11 (N=100000, nz=1000000) LU Mflops: 818.00 (M=1000, N=1000) -------------------------------------------------- ---------------------------------------------------- setenv TCMALLOC_MEMFS_MALLOC_PATH /mnt/hugepage/testTC [lxbuild066] ~/public/multicore > ~/w1/scimark2/scimark2 -large > scimark2_tc_hugetlb & [1] 3585 at /proc/meminfo | grep -i huge HugePages_Total: 12 HugePages_Free: 4 Hugepagesize: 2048 kB [lxbuild066] ~/public/multicore > cat /proc/meminfo | grep -i huge HugePages_Total: 12 HugePages_Free: 4 Hugepagesize: 2048 kB [1] Done ~/w1/scimark2/scimark2 -large > scimark2_tc_hugetlb [lxbuild066] ~/public/multicore > cat scimark2_tc_hugetlb ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 452.73 FFT Mflops: 34.39 (N=1048576) SOR Mflops: 849.27 (1000 x 1000) MonteCarlo: Mflops: 191.06 Sparse matmult Mflops: 485.31 (N=100000, nz=1000000) LU Mflops: 703.61 (M=1000, N=1000) </verbatim> so it works, but it is slower! btw scimark2 is faster with standard malloc (mainly because of the =MonteCarlo= that does not stress memory much) <verbatim> unsetenv LD_PRELOAD $TCMALLOC_ROOT/lib/libtcmalloc_minimal.so [lxbuild066] ~/public/multicore > ~/w1/scimark2/scimark2 -large ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 522.15 FFT Mflops: 55.42 (N=1048576) SOR Mflops: 851.63 (1000 x 1000) MonteCarlo: Mflops: 403.66 Sparse matmult Mflops: 471.89 (N=100000, nz=1000000) LU Mflops: 828.16 (M=1000, N=1000) </verbatim> ---+++ summary of the measurement below Using *tcmalloc* cms reconstruction for b-jets is about 10% faster. a system analysis shows that the number of _retired instruction_ is essentially the some. Most of the saving comes from less _resources stalls_ and some less _miss-predicted branches_ ---+++ measuring cms reconstruction *these instructions are valid for CMSSW 2_0_9* run the preparatory sequence =cmsDriver.py B_JETS -s STEP -n 100= with =STEP= = =GEN, SIM, DIGI= than run =~/pyModules/pfmon_deluxe.py cmsDriver.py B_JETS -s RECO -n 100= results <verbatim> TimeReport ---------- Event Summary ---[sec]---- TimeReport CPU/event = 3.046750 Real/event = 3.059505 TimeReport ---------- Path Summary ---[sec]---- TimeReport per event per path-run TimeReport CPU Real CPU Real Name TimeReport 2.798615 2.811101 2.798615 2.811101 reconstruction_step TimeReport -------End-Path Summary ---[sec]---- TimeReport per event per endpath-run TimeReport CPU Real CPU Real Name TimeReport 0.247656 0.247859 0.247656 0.247859 outpath 970293569059 UNHALTED_CORE_CYCLES 943404623915 INSTRUCTIONS_RETIRED 153400155587 BRANCH_INSTRUCTIONS_RETIRED 4283583203 MISPREDICTED_BRANCH_RETIRED 366093204853 INST_RETIRED:LOADS 0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE 198866522599 INST_RETIRED:STORES 57650513756 X87_OPS_RETIRED:ANY 305133549321 RESOURCE_STALLS:ANY 1103247481 BUS_TRANS_ANY:ALL_AGENTS 1164574768 BUS_DRDY_CLOCKS:ALL_AGENTS 365249031 BUS_BNR_DRV:ALL_AGENTS 21200587333 LAST_LEVEL_CACHE_REFERENCES 187148195 LAST_LEVEL_CACHE_MISSES 138684654138 CPU_CLK_UNHALTED:BUS ----------------------------------------------------------------------- Ratios: CPI: 1.0285 load instructions %: 38.806% store instructions %: 21.080% load and store instructions %: 59.885% resource stalls % (of cycles): 31.448% branch instructions %: 16.260% % of branch instr. mispredicted: 2.792% % of l2 loads missed: 0.883% bus utilization %: 1.591% data bus utilization %: 0.840% bus not ready %: 0.527% comp. SIMD inst. ('new FP') %: 0.000% comp. x87 instr. ('old FP') %: 6.111% </verbatim> repeated with !TCMALLOC <verbatim> TimeReport ---------- Event Summary ---[sec]---- TimeReport CPU/event = 2.776694 Real/event = 2.781829 TimeReport ---------- Path Summary ---[sec]---- TimeReport per event per path-run TimeReport CPU Real CPU Real Name TimeReport 2.536359 2.541528 2.536359 2.541528 reconstruction_step TimeReport -------End-Path Summary ---[sec]---- TimeReport per event per endpath-run TimeReport CPU Real CPU Real Name TimeReport 0.240255 0.240235 0.240255 0.240235 outpath 903705695000 UNHALTED_CORE_CYCLES 931115219416 INSTRUCTIONS_RETIRED 151185416169 BRANCH_INSTRUCTIONS_RETIRED 4032950096 MISPREDICTED_BRANCH_RETIRED 363136780290 INST_RETIRED:LOADS 0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE 194064751706 INST_RETIRED:STORES 57567303752 X87_OPS_RETIRED:ANY 259105502466 RESOURCE_STALLS:ANY 965820152 BUS_TRANS_ANY:ALL_AGENTS 1004396040 BUS_DRDY_CLOCKS:ALL_AGENTS 346899990 BUS_BNR_DRV:ALL_AGENTS 20041091763 LAST_LEVEL_CACHE_REFERENCES 168698808 LAST_LEVEL_CACHE_MISSES 129149087560 CPU_CLK_UNHALTED:BUS ----------------------------------------------------------------------- Ratios: CPI: 0.9706 load instructions %: 39.000% store instructions %: 20.842% load and store instructions %: 59.842% resource stalls % (of cycles): 28.671% branch instructions %: 16.237% % of branch instr. mispredicted: 2.668% % of l2 loads missed: 0.842% bus utilization %: 1.496% data bus utilization %: 0.778% bus not ready %: 0.537% comp. SIMD inst. ('new FP') %: 0.000% comp. x87 instr. ('old FP') %: 6.183% </verbatim> ---++ 04/07/2008 #TcMalloc ---+++ TCMalloc I've installed [[http://goog-perftools.sourceforge.net/doc/tcmalloc.html][TCMalloc]] (using Giulio's receipt) as <verbatim> setenv TCMALLOC_ROOT /afs/cern.ch/user/i/innocent/w1/tcmalloc mkdir -p $TCMALLOC_ROOT cd $TCMALLOC_ROOT wget http://google-perftools.googlecode.com/files/google-perftools-0.97.tar.gz tar xzvf google-perftools-0.97.tar.gz cd google-perftools-0.97 ./configure --prefix $TCMALLOC_ROOT --enable-frame-pointers make make install </verbatim> to use it just <verbatim> setenv TCMALLOC_ROOT /afs/cern.ch/user/i/innocent/w1/tcmalloc setenv LD_PRELOAD $TCMALLOC_ROOT/lib/libtcmalloc_minimal.so </verbatim> for 32 bit the best in to initialize a !CMSSW first and then <verbatim> setenv TCMALLOC_ROOT /afs/cern.ch/user/i/innocent/w1/tcmalloc32 mkdir -p $TCMALLOC_ROOT cd $TCMALLOC_ROOT wget http://google-perftools.googlecode.com/files/google-perftools-0.97.tar.gz tar xzvf google-perftools-0.97.tar.gz cd google-perftools-0.97 linux32 ./configure --prefix $TCMALLOC_ROOT --enable-frame-pointers linux32 make linux32 make install </verbatim> =ldd= should look like this <verbatim> ldd /afs/cern.ch/user/i/innocent/w1/tcmalloc32/lib/libtcmalloc_minimal.so linux-gate.so.1 => (0xf7f8d000) libstdc++.so.6 => /afs/cern.ch/cms/sw/slc4_ia32_gcc345/external/gcc/3.4.5-cms/lib/libstdc++.so.6 (0xf7e91000) libm.so.6 => /lib/tls/libm.so.6 (0xf7e48000) libc.so.6 => /lib/tls/libc.so.6 (0xf7d1c000) libgcc_s.so.1 => /afs/cern.ch/cms/sw/slc4_ia32_gcc345/external/gcc/3.4.5-cms/lib/libgcc_s.so.1 (0xf7d12000) /lib/ld-linux.so.2 (0x008bc000) </verbatim> Once preloaded one will get =ERROR: ld.so: object '/afs/cern.ch/user/i/innocent/w1/tcmalloc32/lib/libtcmalloc_minimal.so' from LD_PRELOAD cannot be preloaded: ignored.= for every possible shell command, just ignore... -- Main.VincenzoInnocente - 02-Mar-2011
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r4 - 2017-03-26
-
VincenzoInnocente
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback