Experimenting with Huge Pages
23/03/2017
Building biglibs with huge pages in CMSSW
failed!
Just adding and rebuilding the biglib
[innocent@vinavx3 CMSSW_9_1_0_pre1]$ diff config/BuildFile.xml $CMSSW_RELEASE_BASE/config/BuildFile.xml
40c40
< <flags BIGOBJ_CXXFLAGS="-Wl,--exclude-libs,ALL -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align"/>
---
> <flags BIGOBJ_CXXFLAGS="-Wl,--exclude-libs,ALL"/>
at run time got
libhugetlbfs: WARNING: Layout problem with segments 0 and 1:
Segments would overlap
and no evidence of use of huge pages...
16/03/2017
HugePages strikes again
Following a very interesting seminar by Marco Guerri I investigated again HugePages for code and discovered that life is now a bit easier with new kernels and distributions...
see
https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO
- libhugetlbfs is now in the distributions: CC7 for instance (under /usr/share/libhugetlbfs)
- it is easer to reserve huge pages
- one still has to mount hugetlbfs
configuration as root becomes
echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
mkdir -p /mnt/hugetlbfs
mount -t hugetlbfs none /mnt/hugetlbfs
chmod 777 /mnt/hugetlbfs/
linking with libhugetlbfs loader
c++ -Wall -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align ...
at runtime one needs
setenv HUGETLB_ELFMAP RW
I have used Marco's benchmark
https://gitlab.cern.ch/snippets/216
to test it on my workstation
icache.py > icache.cpp
c++ -O0 -Wall icache.cpp -o icacheStandard
c++ -O0 -Wall icache.cpp -B /usr/share/libhugetlbfs -o icacheHuge -Wl,--hugetlbfs-align
setenv HUGETLB_ELFMAP RW
and observed
perf stat -e cycles -e iTLB-load-misses -e iTLB-loads ./icacheStandard
Performance counter stats for './icacheStandard':
73122850326 cycles
1041337 iTLB-load-misses # 0.17% of all iTLB cache hits
599872659 iTLB-loads
18.719051666 seconds time elapsed
perf stat -e cycles -e iTLB-load-misses -e iTLB-loads ./icacheHuge
Performance counter stats for './icacheHuge':
63403839270 cycles
755673 iTLB-load-misses # 50479.16% of all iTLB cache hits
1497 iTLB-loads
16.218030985 seconds time elapsed
(caveat iTLB-load-misses may be misleading: it actually counts the "itlb_misses.miss_causes_a_walk" so the sTLB-misses, while iTLB-loads counts the misses that actually hit the sTLB)
while
icacheHuge was running I looked into
meminfo and actually saw the huge pages in use
cat /proc/meminfo | grep Huge
AnonHugePages: 104448 kB
HugePages_Total: 5
HugePages_Free: 1
HugePages_Rsvd: 1
HugePages_Surp: 5
Hugepagesize: 2048 kB
effect on Geant4
I have measured the number of iTLB-loads in CMS simulation and found to be 0.1% of the cycles
Assuming (from the results about) that each iTLB-load cost about 15 cycles we can predict a speed gain of about 1.5% in using huge pages to allocate text-segments
how to verify the status of THP
egrep 'trans|thp' /proc/vmstat
06/04/2010
HugePages reloaded
Following the publication of a set of new article on lwn
http://lwn.net/Articles/374424/
(part 1 and link to others)
http://lwn.net/Articles/379748/
(part5)
I tested again the use of hugepages in cmssw using pfmon to measure the amonut of DLTB misses.
I just allocate the whole HEAP on hugepages thanks to the new library described in the articles above
results for single and mutiple (5) processes
pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY
cmsRun recoqcd_RAW2DIGI_RECO_MC_one.py
1646660047412 UNHALTED_CORE_CYCLES
1979891584 DTLB_MISSES:ANY
621.742u 3.730s 10:38.23 98.0% 0+0k 0+0io 22512pf+0w
pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY
cmsRun recoqcd_RAW2DIGI_RECO_MC_multi.py
1740125704760 UNHALTED_CORE_CYCLES
2042917165 DTLB_MISSES:ANY
654.580u 7.739s 3:18.09 334.3% 0+0k 0+0io 24619pf+0w
pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY
/users/innocent/usrlocal/bin/hugectl --library-use-path --heap cmsRun recoqcd_RAW2DIGI_RECO_MC_one.py
1628420054827 UNHALTED_CORE_CYCLES
1522338644 DTLB_MISSES:ANY
615.098u 3.509s 10:30.17 98.1% 0+0k 0+0io 23553pf+0w
pfmon command line: pfmon --with-header --aggregate-results --follow-all -e UNHALTED_CORE_CYCLES,DTLB_MISSES:ANY
/users/innocent//usrlocal/bin/hugectl --library-use-path --heap cmsRun recoqcd_RAW2DIGI_RECO_MC_multi.py
1720013658083 UNHALTED_CORE_CYCLES
1342258581 DTLB_MISSES:ANY
647.690u 7.351s 3:11.93 341.2% 0+0k 0+0io 24344pf+0w
the observation is that
DTLB_MISSES account to a bit more of 1% of the cycles (each DTLB miss costs about 10 cycles).
Using hugepages for the heap reduces the
DTLB_MISSES by more than 30%, still negligible for what CMSSW is concerned.
one shall also note that hugepages are
NOT accounted in the standard memory reports
10823 innocent pfmon --with-header --aggre 0 772.0K 805.0K 1.3M
9164 innocent -tcsh 0 1.7M 1.9M 2.7M
8953 innocent -tcsh 0 2.8M 3.0M 3.8M
10974 innocent /usr/bin/python /afs/cern.c 0 7.2M 7.4M 8.3M
10824 innocent cmsRun recoqcd_RAW2DIGI_REC 0 286.8M 287.0M 288.1M
-------------------------------------------------------------------------------
64 1 0 299.1M 300.1M 304.2M
[pcphsft50] ~ $ ~/w1/smem-0.9/smem -t -k
PID User Command Swap USS PSS RSS
10977 innocent pfmon --with-header --aggre 0 1.1M 1.1M 1.7M
9164 innocent -tcsh 0 1.7M 1.9M 2.7M
8953 innocent -tcsh 0 2.8M 3.0M 3.8M
10991 innocent /usr/bin/python /afs/cern.c 0 7.5M 7.7M 8.6M
10978 innocent cmsRun recoqcd_RAW2DIGI_REC 0 4.3M 40.8M 226.8M
10984 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.0M 101.9M 296.8M
10981 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 101.9M 296.7M
10985 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 101.9M 296.7M
10983 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 102.0M 297.0M
10982 innocent cmsRun recoqcd_RAW2DIGI_REC 0 63.2M 102.1M 297.1M
-------------------------------------------------------------------------------
69 1 0 333.1M 564.2M 1.7G
%MSG-w MemoryCheck: PostProcessPath 03-Apr-2010 17:30:55 CEST Run: 1 Event: 60
MemoryCheck: event : VSIZE 1004.96 0 RSS 226.828 0 HEAP-ARENA [ SIZE-BYTES 980017152 N-UNUSED-CHUNKS 101 TOP-FREE-BYTES 3917264 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 957265664 HEAP-UNUSED-BYTES 22751488
23/07/2008
new performance test using 219
I run reco 219 on 200 events bbar as produced by RelVal.
standard
1732729539271 UNHALTED_CORE_CYCLES
1539713454028 INSTRUCTIONS_RETIRED
230642073370 BRANCH_INSTRUCTIONS_RETIRED
5670263475 MISPREDICTED_BRANCH_RETIRED
603715307797 INST_RETIRED:LOADS
0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE
316752798881 INST_RETIRED:STORES
144606669202 X87_OPS_RETIRED:ANY
629460043234 RESOURCE_STALLS:ANY
2522977571 BUS_TRANS_ANY:ALL_AGENTS
2737708714 BUS_DRDY_CLOCKS:ALL_AGENTS
69208177 BUS_BNR_DRV:ALL_AGENTS
36272303280 LAST_LEVEL_CACHE_REFERENCES
427617760 LAST_LEVEL_CACHE_MISSES
217670175552 CPU_CLK_UNHALTED:BUS
-----------------------------------------------------------------------
Ratios:
CPI: 1.1254
load instructions %: 39.210%
store instructions %: 20.572%
load and store instructions %: 59.782%
resource stalls % (of cycles): 36.328%
branch instructions %: 14.980%
% of branch instr. mispredicted: 2.458%
% of l2 loads missed: 1.179%
bus utilization %: 2.318%
data bus utilization %: 1.258%
bus not ready %: 0.064%
comp. SIMD inst. ('new FP') %: 0.000%
comp. x87 instr. ('old FP') %: 9.392%
with TCMALLOC
1584570646894 UNHALTED_CORE_CYCLES
1509316585174 INSTRUCTIONS_RETIRED
224658865844 BRANCH_INSTRUCTIONS_RETIRED
5025722897 MISPREDICTED_BRANCH_RETIRED
597442500267 INST_RETIRED:LOADS
0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE
306462564773 INST_RETIRED:STORES
144577985459 X87_OPS_RETIRED:ANY
527521246432 RESOURCE_STALLS:ANY
1961824853 BUS_TRANS_ANY:ALL_AGENTS
2126966626 BUS_DRDY_CLOCKS:ALL_AGENTS
67851327 BUS_BNR_DRV:ALL_AGENTS
33442404368 LAST_LEVEL_CACHE_REFERENCES
355141679 LAST_LEVEL_CACHE_MISSES
199109646642 CPU_CLK_UNHALTED:BUS
-----------------------------------------------------------------------
Ratios:
CPI: 1.0499
load instructions %: 39.584%
store instructions %: 20.305%
load and store instructions %: 59.888%
resource stalls % (of cycles): 33.291%
branch instructions %: 14.885%
% of branch instr. mispredicted: 2.237%
% of l2 loads missed: 1.062%
bus utilization %: 1.971%
data bus utilization %: 1.068%
bus not ready %: 0.068%
comp. SIMD inst. ('new FP') %: 0.000%
comp. x87 instr. ('old FP') %: 9.579%
23/07/2008
new performance test using 210(pre9)
I run reco 210_pre9 on 200 events bbar as produced by RelVal.
I run the standard code, that applied the patch to
ROOT::Cintex::Allocate_code
to seve pages out of 2MB block pool
then the same using tcmalloc and finally using hugepages
cat /proc/meminfo | grep -i huge
HugePages_Total: 1000
HugePages_Free: 678
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
I even run 8
cmsRun
at the same time.
the use of
hugepages does not have any effect on performance. reco scales up to 8 cpus (on lxbuild114) w/o problems.
22/07/2008
running with 210
cmsDriver.py BJets_Pt_50_120_cfi -s GEN -n 100
cmsDriver.py BJets_Pt_50_120_cfi -s SIM -n 100 --filein file:BJets_Pt_50_120_cfi_GEN.root
did not manage to get
--customize
to work
copied Gabriele file locally run
cmsDriver
with
--no_exec
and used pico to mix the two...
at the end I used Sharham instruction.... and copied a sim-file from dbs
rfcp /castor/cern.ch/cms/store/relval/2008/7/20/RelVal-RelValBJets_Pt_50_120-1216579576/RelValBJets_Pt_50_120/GEN-SIM-DIGI-RAW-HLTDEBUG/CMSSW_2_1_0_pre9-RelVal-1216579576-STARTUP_V4-unmerged/0000/00A1B29A-4457-DD11-AF37-000423D9939C.root .
cmsDriver.py BJets_Pt_50_120_cfi -s GEN,SIM,DIGI,L1,DIGI2RAW -n 100 --conditions FrontierConditions_GlobalTag,STARTUP_V4::All --datatier 'GEN-SIM-DIGI-RAW' --eventcontent FEVTDEBUGHLT
cp $CMSSW_RELEASE_BASE/src/Configuration/Examples/python/RecoExample_cfg.py .
cmsRun RecoExample_cfg.py
16/07/2008
testing memory churn in multi-thread
six threads on lxbuild114 (8 core)
method |
Real time |
cpu/real |
|
malloc |
tcmalloc |
naive |
81 |
534% |
300 |
138% |
boost-pool |
35 |
462% |
36 |
504% |
large chunks |
34 |
477% |
36 |
506% |
15/07/2008
summary of measurement of cms reco
time per event |
malloc |
tcmalloc |
reco modules |
2.8 |
2.5 |
output module |
0.25 |
0.24 |
counter |
malloc |
tcmalloc |
UNHALTED_CORE_CYCLES |
970293569059 |
903705695000 |
INSTRUCTIONS_RETIRED |
943404623915 |
931115219416 |
BRANCH_INSTRUCTIONS_RETIRED |
153400155587 |
151185416169 |
MISPREDICTED_BRANCH_RETIRED |
4283583203 |
4032950096 |
INST_RETIRED:LOADS |
366093204853 |
363136780290 |
SIMD_COMP_INST_RETIRED |
0 |
0 |
INST_RETIRED:STORES |
198866522599 |
194064751706 |
X87_OPS_RETIRED:ANY |
57650513756 |
57567303752 |
RESOURCE_STALLS:ANY |
305133549321 |
259105502466 |
BUS_TRANS_ANY:ALL_AGENTS |
1103247481 |
965820152 |
BUS_DRDY_CLOCKS:ALL_AGENTS |
1164574768 |
1004396040 |
BUS_BNR_DRV:ALL_AGENTS |
365249031 |
346899990 |
LAST_LEVEL_CACHE_REFERENCES |
21200587333 |
20041091763 |
LAST_LEVEL_CACHE_MISSES |
187148195 |
168698808 |
CPU_CLK_UNHALTED:BUS |
138684654138 |
129149087560 |
ratio |
malloc |
tcmalloc |
CPI |
1.0285 |
0.9706 |
load instructions % |
38.806% |
39.000% |
store instructions % |
21.080% |
20.842% |
load and store instructions % |
59.885% |
59.842% |
resource stalls % (of cycles) |
31.448% |
28.671% |
branch instructions % |
16.260% |
16.237% |
% of branch instr. mispredicted |
2.792% |
2.668% |
% of l2 loads missed |
0.883% |
0.842% |
bus utilization % |
1.591% |
1.496% |
data bus utilization % |
0.840% |
0.778% |
bus not ready % |
0.527% |
0.537% |
comp. SIMD inst. ('new FP') % |
0.000% |
0.000% |
comp. x87 instr. ('old FP') % |
6.111% |
6.183% |
07/07/2008
allocating huge pages on 32 bit
tcmalloc fails this message:
Check failed: (fstatfs(hugetlb_fd, &sfs)) != -1: Bad address
06/07/2008 (sunday)
it's raining!
allocating huge pages
fasten your seatbelt
following Giulio's receipt I tried to convince tcmalloc to use
hugetlbpages
I used lxbuild066 (standard slc4) and pcphsft50 (new kernel)
setting up
sudo /bin/tcsh -f -c "echo 1000 > /proc/sys/vm/nr_hugepages"
sudo /bin/tcsh -f -c "mkdir -p /mnt/hugepage;mount -t hugetlbfs -o uid=1039,gid=1399,mode=0775 none /mnt/hugepage"
setenv TCMALLOC_MEMFS_MALLOC_PATH /mnt/hugepage/
caveat lxbuild machines have little contigous memory to allocate hugetlb so one gets on 12 pages (even only 6 on lxbuild065)
cat /proc/meminfo | grep -i huge
HugePages_Total: 12
HugePages_Free: 12
Hugepagesize: 2048 kB
this is enough for a small test with scimark2:
unsetenv TCMALLOC_MEMFS_MALLOC_PATH
[lxbuild066] ~/public/multicore > ~/w1/scimark2/scimark2 -large > scimark2_tc&
[1] 2257
[lxbuild066] ~/public/multicore > cat /proc/meminfo | grep -i huge
HugePages_Total: 12
HugePages_Free: 12
Hugepagesize: 2048 kB
[lxbuild066] ~/public/multicore > cat /proc/meminfo | grep -i huge
HugePages_Total: 12
HugePages_Free: 12
Hugepagesize: 2048 kB
[lxbuild066] ~/public/multicore >
[1] Done ~/w1/scimark2/scimark2 -large > scimark2_tc
[lxbuild066] ~/public/multicore > cat scimark2_tc
** **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 473.27
FFT Mflops: 55.27 (N=1048576)
SOR Mflops: 846.92 (1000 x 1000)
MonteCarlo: Mflops: 191.06
Sparse matmult Mflops: 455.11 (N=100000, nz=1000000)
LU Mflops: 818.00 (M=1000, N=1000)
--------------------------------------------------
----------------------------------------------------
setenv TCMALLOC_MEMFS_MALLOC_PATH /mnt/hugepage/testTC
[lxbuild066] ~/public/multicore > ~/w1/scimark2/scimark2 -large > scimark2_tc_hugetlb &
[1] 3585
at /proc/meminfo | grep -i huge
HugePages_Total: 12
HugePages_Free: 4
Hugepagesize: 2048 kB
[lxbuild066] ~/public/multicore > cat /proc/meminfo | grep -i huge
HugePages_Total: 12
HugePages_Free: 4
Hugepagesize: 2048 kB
[1] Done ~/w1/scimark2/scimark2 -large > scimark2_tc_hugetlb
[lxbuild066] ~/public/multicore > cat scimark2_tc_hugetlb
** **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 452.73
FFT Mflops: 34.39 (N=1048576)
SOR Mflops: 849.27 (1000 x 1000)
MonteCarlo: Mflops: 191.06
Sparse matmult Mflops: 485.31 (N=100000, nz=1000000)
LU Mflops: 703.61 (M=1000, N=1000)
so it works, but it is slower!
btw scimark2 is faster with standard malloc (mainly because of the
MonteCarlo
that does not stress memory much)
unsetenv LD_PRELOAD $TCMALLOC_ROOT/lib/libtcmalloc_minimal.so
[lxbuild066] ~/public/multicore > ~/w1/scimark2/scimark2 -large
** **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 522.15
FFT Mflops: 55.42 (N=1048576)
SOR Mflops: 851.63 (1000 x 1000)
MonteCarlo: Mflops: 403.66
Sparse matmult Mflops: 471.89 (N=100000, nz=1000000)
LU Mflops: 828.16 (M=1000, N=1000)
summary of the measurement below
Using
tcmalloc cms reconstruction for b-jets is about 10% faster.
a system analysis shows that the number of
retired instruction is essentially the some.
Most of the saving comes from less
resources stalls and some less
miss-predicted branches
measuring cms reconstruction
these instructions are valid for CMSSW 2_0_9
run the preparatory sequence
cmsDriver.py B_JETS -s STEP -n 100
with
STEP
=
GEN, SIM, DIGI
than run
~/pyModules/pfmon_deluxe.py cmsDriver.py B_JETS -s RECO -n 100
results
TimeReport ---------- Event Summary ---[sec]----
TimeReport CPU/event = 3.046750 Real/event = 3.059505
TimeReport ---------- Path Summary ---[sec]----
TimeReport per event per path-run
TimeReport CPU Real CPU Real Name
TimeReport 2.798615 2.811101 2.798615 2.811101 reconstruction_step
TimeReport -------End-Path Summary ---[sec]----
TimeReport per event per endpath-run
TimeReport CPU Real CPU Real Name
TimeReport 0.247656 0.247859 0.247656 0.247859 outpath
970293569059 UNHALTED_CORE_CYCLES
943404623915 INSTRUCTIONS_RETIRED
153400155587 BRANCH_INSTRUCTIONS_RETIRED
4283583203 MISPREDICTED_BRANCH_RETIRED
366093204853 INST_RETIRED:LOADS
0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE
198866522599 INST_RETIRED:STORES
57650513756 X87_OPS_RETIRED:ANY
305133549321 RESOURCE_STALLS:ANY
1103247481 BUS_TRANS_ANY:ALL_AGENTS
1164574768 BUS_DRDY_CLOCKS:ALL_AGENTS
365249031 BUS_BNR_DRV:ALL_AGENTS
21200587333 LAST_LEVEL_CACHE_REFERENCES
187148195 LAST_LEVEL_CACHE_MISSES
138684654138 CPU_CLK_UNHALTED:BUS
-----------------------------------------------------------------------
Ratios:
CPI: 1.0285
load instructions %: 38.806%
store instructions %: 21.080%
load and store instructions %: 59.885%
resource stalls % (of cycles): 31.448%
branch instructions %: 16.260%
% of branch instr. mispredicted: 2.792%
% of l2 loads missed: 0.883%
bus utilization %: 1.591%
data bus utilization %: 0.840%
bus not ready %: 0.527%
comp. SIMD inst. ('new FP') %: 0.000%
comp. x87 instr. ('old FP') %: 6.111%
repeated with TCMALLOC
TimeReport ---------- Event Summary ---[sec]----
TimeReport CPU/event = 2.776694 Real/event = 2.781829
TimeReport ---------- Path Summary ---[sec]----
TimeReport per event per path-run
TimeReport CPU Real CPU Real Name
TimeReport 2.536359 2.541528 2.536359 2.541528 reconstruction_step
TimeReport -------End-Path Summary ---[sec]----
TimeReport per event per endpath-run
TimeReport CPU Real CPU Real Name
TimeReport 0.240255 0.240235 0.240255 0.240235 outpath
903705695000 UNHALTED_CORE_CYCLES
931115219416 INSTRUCTIONS_RETIRED
151185416169 BRANCH_INSTRUCTIONS_RETIRED
4032950096 MISPREDICTED_BRANCH_RETIRED
363136780290 INST_RETIRED:LOADS
0 SIMD_COMP_INST_RETIRED:PACKED_SINGLE:SCALAR_SINGLE:PACKED_DOUBLE:SCALAR_DOUBLE
194064751706 INST_RETIRED:STORES
57567303752 X87_OPS_RETIRED:ANY
259105502466 RESOURCE_STALLS:ANY
965820152 BUS_TRANS_ANY:ALL_AGENTS
1004396040 BUS_DRDY_CLOCKS:ALL_AGENTS
346899990 BUS_BNR_DRV:ALL_AGENTS
20041091763 LAST_LEVEL_CACHE_REFERENCES
168698808 LAST_LEVEL_CACHE_MISSES
129149087560 CPU_CLK_UNHALTED:BUS
-----------------------------------------------------------------------
Ratios:
CPI: 0.9706
load instructions %: 39.000%
store instructions %: 20.842%
load and store instructions %: 59.842%
resource stalls % (of cycles): 28.671%
branch instructions %: 16.237%
% of branch instr. mispredicted: 2.668%
% of l2 loads missed: 0.842%
bus utilization %: 1.496%
data bus utilization %: 0.778%
bus not ready %: 0.537%
comp. SIMD inst. ('new FP') %: 0.000%
comp. x87 instr. ('old FP') %: 6.183%
04/07/2008
TCMalloc
I've installed
TCMalloc
(using Giulio's receipt) as
setenv TCMALLOC_ROOT /afs/cern.ch/user/i/innocent/w1/tcmalloc
mkdir -p $TCMALLOC_ROOT
cd $TCMALLOC_ROOT
wget http://google-perftools.googlecode.com/files/google-perftools-0.97.tar.gz
tar xzvf google-perftools-0.97.tar.gz
cd google-perftools-0.97
./configure --prefix $TCMALLOC_ROOT --enable-frame-pointers
make
make install
to use it just
setenv TCMALLOC_ROOT /afs/cern.ch/user/i/innocent/w1/tcmalloc
setenv LD_PRELOAD $TCMALLOC_ROOT/lib/libtcmalloc_minimal.so
for 32 bit the best in to initialize a CMSSW first and then
setenv TCMALLOC_ROOT /afs/cern.ch/user/i/innocent/w1/tcmalloc32
mkdir -p $TCMALLOC_ROOT
cd $TCMALLOC_ROOT
wget http://google-perftools.googlecode.com/files/google-perftools-0.97.tar.gz
tar xzvf google-perftools-0.97.tar.gz
cd google-perftools-0.97
linux32 ./configure --prefix $TCMALLOC_ROOT --enable-frame-pointers
linux32 make
linux32 make install
ldd
should look like this
ldd /afs/cern.ch/user/i/innocent/w1/tcmalloc32/lib/libtcmalloc_minimal.so
linux-gate.so.1 => (0xf7f8d000)
libstdc++.so.6 => /afs/cern.ch/cms/sw/slc4_ia32_gcc345/external/gcc/3.4.5-cms/lib/libstdc++.so.6 (0xf7e91000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7e48000)
libc.so.6 => /lib/tls/libc.so.6 (0xf7d1c000)
libgcc_s.so.1 => /afs/cern.ch/cms/sw/slc4_ia32_gcc345/external/gcc/3.4.5-cms/lib/libgcc_s.so.1 (0xf7d12000)
/lib/ld-linux.so.2 (0x008bc000)
Once preloaded one will get
ERROR: ld.so: object '/afs/cern.ch/user/i/innocent/w1/tcmalloc32/lib/libtcmalloc_minimal.so' from LD_PRELOAD cannot be preloaded: ignored.
for every possible shell command, just ignore...
--
VincenzoInnocente - 02-Mar-2011