Reducing memory footprint using jemalloc

what is by now known as the "Facebook malloc" is also released as a opens source product: jemalloc. It features efficient memory management for multi-trhead application and aggressive memory redemption. This last feature is of interest for us as the memory footprint of HEP event processing application my greatly vary from one event to the other. I tested jemalloc in reconstruction of high pile up events in cms (using release *CMSSW_5_0_0_pre7"). I used the version linked in cmsJERun which is included in the CMSSW distribution.

Single process

The following plot shows the value of VSS and RSS measured after each event for various runs of the very same application (reconstruction of 250 events) reading either from EOS or from a local file and using either standard malloc or jemalloc.

the plot is obtained parsing the log file with

foreach log (*_250*.out)
grep "MemoryCheck: e" $log | awk '{print e $5 ", " $8 }' > `basename $log .out`.csv
end

and merging the file using pr -m -t -s,\  ... before importing into Exel.

* VSS and RSS for 250 events:
HPu_all.png

  • VSS and RSS for 250 events (alternative take):
    HPu_ori_all.png

One can observe, firstly, the steps in both virtual and resident memory while reading from EOS (with xrootd) using standard malloc. These steps are absent when using jmalloc or simply reading from a local file. The other cleat feature is how jmalloc effectively redeem memory after each large event. VSS is not reduces though and remain at the same level of the one used by standard malloc when xrootd is avoided.

Multi process

The effect of a lower memory footprint is even more evident in a multi-process environment where, obviously, different processes will, at any given time, use a very different RSS depending of the event in hand. We shall therefore expect that using jemalloc the total RSS (and in case of shared memory PSS) to be much lower than using standard malloc.

The followiing two plots shows the evolution of the total PSS and RSS, sampled each 30 seconds, for four and eight children processing 500 and 1000 events respectively for standard malloc and jemalloc. Memory has been sampled using smem

touch $1
while(1)
~/w1/smem-0.9/smem -t |& tail -1 >> $1
sleep 30
end
and the usual awk + pr

tcat sample_je8.log | awk '{ print $5 ", " $6}' > sample_je8.csv
cat sample_std8.log | awk '{ print $5 ", " $6}' > sample_std8.csv
pr -m -t -s,\  sample_std8.csv sample_je8.csv > multi8_pss_rss.csv

* total PSS and RSS for 4 children 500 events sampled each 30 seconds:
multi4_pss_rss.png

* total PSS and RSS for 8 children 1000 events sampled each 30 seconds::
multi8_pss_rss.png

Particularly in the case of eight children we note how using jemalloc the event-by-event variations in memory footprint are well amortized reducing even the peak effective memory use 8GB while usin standard malloc the value of 10GB is rapidly reached and maintained.

-- VincenzoInnocente - 13-Dec-2011

Topic attachments
I Attachment History Action SizeSorted descending Date Who Comment
PNGpng HPu_ori_all.png r1 manage 405.4 K 2011-12-13 - 11:39 VincenzoInnocente VSS and RSS for 250 events (alternative take)
PNGpng multi4_pss_rss.png r1 manage 399.8 K 2011-12-13 - 11:42 VincenzoInnocente total PSS and RSS for 4 children 500 events sampled each 30 seconds
PNGpng multi8_pss_rss.png r1 manage 381.5 K 2011-12-13 - 12:03 VincenzoInnocente total PSS and RSS for 8 children 1000 events sampled each 30 seconds:
PNGpng HPu_all.png r1 manage 314.5 K 2011-12-13 - 11:38 VincenzoInnocente VSS and RSS for 250 events

This topic: LCG > TWikiUsers > VincenzoInnocente > MultiCoreRD > MultiCoreBlogs > VIJemalloc
Topic revision: r2 - 2011-12-13 - VincenzoInnocente
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback