Reducing memory footprint using jemalloc

what is by now known as the "Facebook malloc" is also released as a opens source product: jemalloc. It features efficient memory management for multi-trhead application and aggressive memory redemption. This last feature is of interest for us as the memory footprint of HEP event processing application my greatly vary from one event to the other. I tested jemalloc in reconstruction of high pile up events in cms (using release *CMSSW_5_0_0_pre7"). I used the version linked in cmsJERun which is included in the CMSSW distribution. The results show that in running on a multicore machine JEMalloc allows to efficiently amortize the cost of peak allocation by each job making the total RSS used at any moment much smaller then the sum of the peak memory usage.

Single process

The following plot shows the value of VSS and RSS measured after each event for various runs of the very same application (reconstruction of 250 events) reading either from EOS or from a local file and using either standard malloc or jemalloc.

the plot is obtained parsing the log file with

foreach log (*_250*.out)
grep "MemoryCheck: e" $log | awk '{print e $5 ", " $8 }' > `basename $log .out`.csv
end

and merging the file using pr -m -t -s,\  ... before importing into Exel.

VSS and RSS for 250 events:
HPu_all.png

VSS and RSS for 250 events (alternative take):
HPu_ori_all.png

One can observe, firstly, the steps in both virtual and resident memory while reading from EOS (with xrootd) using standard malloc. These steps are absent when using jmalloc or simply reading from a local file. The other cleat feature is how jmalloc effectively redeem memory after each large event. VSS is not reduces though and remain at the same level of the one used by standard malloc when xrootd is avoided.

Multi process

The effect of a lower memory footprint is even more evident in a multi-process environment where, obviously, different processes will, at any given time, use a very different RSS depending of the event in hand. We shall therefore expect that using jemalloc the total RSS (and in case of shared memory PSS) to be much lower than using standard malloc.

The followiing two plots shows the evolution of the total PSS and RSS, sampled each 30 seconds, for four and eight children processing 500 and 1000 events respectively for standard malloc and jemalloc. Memory has been sampled using smem

touch $1
while(1)
~/w1/smem-0.9/smem -t |& tail -1 >> $1
sleep 30
end
and the usual awk + pr

tcat sample_je8.log | awk '{ print $5 ", " $6}' > sample_je8.csv
cat sample_std8.log | awk '{ print $5 ", " $6}' > sample_std8.csv
pr -m -t -s,\  sample_std8.csv sample_je8.csv > multi8_pss_rss.csv

Total PSS and RSS for 4 children 500 events sampled each 30 seconds:
multi4_pss_rss.png

Total PSS and RSS for 8 children 1000 events sampled each 30 seconds::
multi8_pss_rss.png

Particularly in the case of eight children we note how using jemalloc the event-by-event variations in memory footprint are well amortized reducing even the peak effective memory use below 8GB while using standard malloc the value of 10GB is rapidly reached and maintained.

long run 10K events

The plot below shows the memory evolution (RSS and VSS) for four jobs on the full file with 10K events from bottom to top (rss) I remind that we run at cern with kernel 2.6.18-274.12.1.el5

  • reco only using jemalloc reading from local file
  • same reading from EOS
  • reco+alca+dqm using jemalloc reading from local file
  • same using standard malloc

all the plots shows a structure most probably due to the event content. As ususal RSS from jemalloc follows very closely the real needs of memory of the jobs. its VSS instead grows quite fast, more than standard malloc, in particular for reco+alca+dqm Reading from EOS shows a similar RSS behavior than reading from local file, while VSS is systematically larger by about 100 MB.

While is still very possible that the grows of VSS with jmalloc is just an artifact of the OS, it could also be an indication of some sort of memory leak

Reco (jmalloc reading either from EOS or from local file) and reco+alca+dqm either jmalloc or standard:
full_all.png

-- VincenzoInnocente - 13-Dec-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng HPu_all.png r1 manage 314.5 K 2011-12-13 - 11:38 VincenzoInnocente VSS and RSS for 250 events
PNGpng HPu_ori_all.png r1 manage 405.4 K 2011-12-13 - 11:39 VincenzoInnocente VSS and RSS for 250 events (alternative take)
PNGpng full_all.png r1 manage 515.5 K 2011-12-23 - 19:38 VincenzoInnocente reco (jmalloc reading either from EOS or from local file) and reco+alca+dqm either jmalloc or standard
PNGpng multi4_pss_rss.png r1 manage 399.8 K 2011-12-13 - 11:42 VincenzoInnocente total PSS and RSS for 4 children 500 events sampled each 30 seconds
PNGpng multi8_pss_rss.png r1 manage 381.5 K 2011-12-13 - 12:03 VincenzoInnocente total PSS and RSS for 8 children 1000 events sampled each 30 seconds:
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2012-01-27 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback