Reducing memory footprint using jemalloc
Facebook malloc is now released as a opens source product:
jemalloc
. It features efficient memory management for multi-trhead application and aggressive memory redemption.
This last feature is of interest for us as the memory footprint of HEP event processing application my greatly vary from one event to the other.
I tested
jemalloc in reconstruction of high pile up events in cms (using release *CMSSW_5_0_0_pre7"). I used the version linked in
cmsJERun which is included in the
CMSSW distribution.
Single process
The following plot shows the value of
VSS and
RSS measured after each event for various runs of the very same application (reconstruction of 250 events) reading either from
EOS or from a local file and using either standard malloc or jemalloc.
the plot is obtained parsing the log file with
foreach log (*_250*.out)
grep "MemoryCheck: e" $log | awk '{print e $5 ", " $8 }' > `basename $log .out`.csv
end
and merging the file using
pr -m -t -s,\ ...
before importing into Exel.
* VSS and RSS for 250 events:
- VSS and RSS for 250 events (alternative take):
One can observe, firstly, the steps in both virtual and resident memory while reading from
EOS (with
xrootd) using standard malloc. These steps are absent when using jmalloc or simply reading from a local file.
The other cleat feature is how jmalloc effectively redeem memory after each large event.
VSS is not reduces though and remain at the same level of the one used by standard malloc when
xrootd is avoided.
Multi process
The effect of a lower memory footprint is even more evident in a multi-process environment where, obviously, different processes will, at any given time, use a very different
RSS depending of the event in hand.
We shall therefore expect that using jemalloc the total
RSS (and in case of shared memory
PSS) to be much lower than using standard malloc.
The followiing two plots shows the evolution of the total
PSS and
RSS, sampled each 30 seconds, for four and eight children processing 500 and 1000 events respectively for standard malloc and jemalloc.
Memory has been sampled using
smem
touch $1
while(1)
~/w1/smem-0.9/smem -t |& tail -1 >> $1
sleep 30
end
and the usual
awk + pr
tcat sample_je8.log | awk '{ print $5 ", " $6}' > sample_je8.csv
cat sample_std8.log | awk '{ print $5 ", " $6}' > sample_std8.csv
pr -m -t -s,\ sample_std8.csv sample_je8.csv > multi8_pss_rss.csv
* total PSS and RSS for 4 children 500 events sampled each 30 seconds:
* total PSS and RSS for 8 children 1000 events sampled each 30 seconds::
Particularly in the case of eight children we note how using jemalloc the event-by-event variations in memory footprint are well amortized reducing even the peak effective memory use 8GB while usin standard malloc the value of 10GB is rapidly reached and maintained.
--
VincenzoInnocente - 13-Dec-2011