Proposal for a common Performance Measurement procedure

Introduction

This is a workbook for creating a procedure for measuring site/hw performances for CMS. This can be important in the next future for:

  • Evaluating new HW to be bought
  • Identifying bottlenecks in the infrastructure/settings

Having a common tool will allow comparisons between sites on a common ground.

Tests Description

General considerations

The idea is to use CMSSW as baseline for these tests, using different settings/configurations. We can grossly divide a site into three main components:

  1. CPU
  2. Storage
  3. Network

(okok, it is much more complex than this, but this is to get a general picture). The tests should then try to test eash single component in decoupled way. Translating this in CMSSW jobs, a proposal could be:

  1. Running a cmsRun exe reading files from a local disk (in a serial and parallel way) Giacinto: [It will be interesting to have the performance of the jobs related to the number of concurrent job running on the same local disk. I have done it in the past and it seems not scaling well... so I'm interesting in measure this better]
  2. Running the same cmsRun exe on files placed in different pools/Fileservers
  3. Running cmsRun exe with different network loads (changes in TFileAdaptor settings)

Tests 2 and 3 are not really decoupled: this can maybe completed with a pure network test (scp transfers from Fileserver local disk to WNs local disk, or sth more elegant like Iperf ). A pure network test will also help to compare theoretical to actual network performances.

CMSSW jobs

There is not a single type of CMSSW jobs. Some of them need little data to be read from files (this causes seeking etc), some much more. Some of them, e.g. MC generation, does not read data at all. All these behaviours can be further enhanced by particular CMSSW settings in reading/caching files through the TFileAdaptor interface, as described here. The pairing of different CMSSW cfgs and TFileAdaptors (and bugs, as described here and measured here) can trigger very different behaviours, and a CPU bound job can become IO bound. Three simple CMSSW cfg files can be proposed then:

  • A simple JetAnalyzer (JPE). This only reads only three variables from a jet branch and produces three histograms. This reads few data, so it's more Storage bound than network bound (in principle). The cfg has been taken from here
  • The Standard PAT-tuple creator (with AOD settings). This reads much more data that JPE jobs (~1MB/evt vs ~10kB/evt). Even so, the network load seems not heavy (2-4MB/s) with different settings (see here for an example). Anyway, here can be seen how much the job is IO bound, e.g. looking at the different perfomrances of the two FSs. This behaviour can be changed with different TFileAdaptor settings. As it performs also different complex operations (e.g. computation of isolation deposits), it should be also CPU bound (to be checked with reading of local files). A job for all seasons.
  • A good CPU-bound jobs could be a GEN-SIM process created with the cmsDriver script. to be tested

WARNING: at least for PAT jobs, a drop * should be put for the output file, in order not to be spoiled by writing performances. This can be re-enabled once the rest is understood.

Data formats

In order to complicate the situation, CMSSW jobs depends also on the type of data actually read (RECO or AOD), as can be seen here. Jobs reading RECO data take longer time to run, both for the fact that a RECO file contains less data than an AOD file and that it contains much more information (these two factors are correlated). Also different datasets can lead to different results... so, different tests can be necessary:

  1. Each site will use the same dataset both in RECO and AOD version (actually, only RECO can also be used, also because the necessary files are not so many). This gives a common ground for tests. This step should also be divided into:
    1. Using files only on a chosen pool/FS (e.g., one set from one pool and one from another one)
    2. If needed, extend the test using files distributed among pools. In this way, we can both test single FS performances and also site-wide performances
  2. Each site choses a dataset to be tested. This will spot how much the tests are sensitive and general.

Running the test

Each of the test described below can be run in two flavours: serial and parallel. In serial mode on the testing node only one job per time will be run: this will allow to exploit CPU/IO performances of the WN+SE+NET system. In serial mode, a more real situation can be thought, running the maximum number of jobs allowed on the WN (usually one per core): this will show in one of the real and worst scenarios (multiple jobs accessing the same files on the same node) the site performance, and also show if the system scales.

A possible test set can be:

  1. Run the GEN cfg. This should rely only on CPU performances
  2. Choose a set of files hosted only on a FS (the number of sets depends of the number of pool you wanna test)
  3. Copy those files on the local disk of the WN and let run a CMSSW exe which reads them. This settles a baseline for CMSSW job CPU performance.
  4. Run both JPE and PAT cfgs on each set, at least using standard and lazy-download+cache=20 TFileAdaptor settings. By now, application-only+cache is bugged, so even if it can load a nice network load better not to rely on it.

First run these steps in serial mode, then in parallel mode. Obviously, run the tests onan unloaded machine.

An Example: Running at T3_CH_PSI

(please have also a look here)

At PSI we are running these tests on a new blade:

Characteristics
Name t3wn29
Model Sun Blade X6270
CPU 8 Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
RAM 16 GB
SO SL53

and a dCache SE v1.9.2.5. Two FSs are being tested: fs01 and fs05 (also, thanks to these tests we discovered different reading performances, trying to understand why).

CMSSW_3_3_6 has been used (on a nfs area, compiled through a SL4 node), and the jobs are run through this script:

#!/bin/bash

CFG=JPE_RHAUTO_CHLD_CACHE20
DIR=T3_CH_PSI-MultiTest-${CFG}

for N in `seq 1 2`; do mkdir ${DIR}_${N}; done

for i in `seq 1 20`; do
    for N in `seq 1 2`; do
        ( (echo "Network start " `date +%s && /sbin/ifconfig eth0 | grep bytes` ) > ${DIR}_${N}/${CFG}_eth0-${N}_${i}.txt ) && \
            ( /usr/bin/time cmsRun -j ${DIR}_${N}/${CFG}-${N}_${i}.xml ${CFG}_${N}.py ) &> ${DIR}_${N}/${CFG}-${N}_${i}.stdout &&\
            ( (echo "Network stop " `date +%s && /sbin/ifconfig eth0 | grep bytes` ) >> ${DIR}_${N}/${CFG}_eth0-${N}_${i}.txt );
        cat ${DIR}_${N}/${CFG}_eth0-${N}_${i}.txt >> ${DIR}_${N}/${CFG}-${N}_${i}.stdout
        sleep 60;
    done
done

Explaining the script:

  • For each set, it creates a directory where to store the outputs. To keep it simple, both stdout and stderr are redirected to the same file
  • The (echo "Network start " `date +%s && /sbin/ifconfig eth0 | grep bytes` ) > ${DIR}_${N}/${CFG}_eth0-${N}_${i}.txt ) lines are a raw method for estimating how much data is transferred through the network, It works only for a single job running and on a unloaded machine. Even if gross, gives only a 6% overestimate (see here)
  • Using at least two set of files avoids caching effects, so it is suggested

The resulting files can be analyzed through the CMSSW Performance Toolkit scripts (CPT): https://twiki.cern.ch/twiki/bin/view/Main/LST2PerfToolkit

Analyzing the data

Using CPT, it is quite easy to have a grand-view of performances (NB: information gathered through ifconfig not yet analyzed). Three scripts are available:

  • a python script for retrieving interesting statistics from the job output, saving them as ROOT histograms (PerfToolkit/cpt_getJobInfo.py)
  • a python script for plotting the selected quantities and produce a twiki-compliant summary table (PerfToolkit/cpt_getStats.py)
  • a python script containing utilities used by cpt_getJobInfo.py and cpt_getStats.py (PerfToolkit/cpt_utilities.py)

So, for example, the information from the previous tests can be saved in root files through the command:

$ cpt_getJobInfo.py --type=CMSSW T3_CH_PSI-MultiTest-JPE_1
OutFile: T3_CH_PSI-MultiTest-JPE_1.root
Analyzing T3_CH_PSI-MultiTest-JPE_1

Then, tables and plots can be produced with:

$ cpt_getStats.py --binwidth-time=10 T3_CH_PSI-MultiTest-JPE_*.root

This will output a table and some plots (the script is configurable), eg:

  T3_CH_PSI-MultiTest-JPE_1 T3_CH_PSI-MultiTest-JPE_2
Success 100.0% (20 / 20) 100.0% (20 / 20)
ExeTime 86.28 +- 32.95 197.56 +- 36.02
UserTime 17.18 +- 0.12 16.62 +- 0.49
CpuPercentage 24.40 +- 9.12 8.85 +- 1.85
SysTime 1.48 +- 0.07 1.33 +- 0.03
OPEN T3_CH_PSI-MultiTest-JPE_1 T3_CH_PSI-MultiTest-JPE_2
dcap-open-total-megabytes 0.00 +- 0.00 0.00 +- 0.00
dcap-open-total-msecs 2877.59 +- 140.41 3053.50 +- 86.78
dcap-open-num-operations 8.00 +- 0.00 8.00 +- 0.00
dcap-open-num-successful-operations 8.00 +- 0.00 8.00 +- 0.00
READ T3_CH_PSI-MultiTest-JPE_1 T3_CH_PSI-MultiTest-JPE_2
tstoragefile-read-actual-total-megabytes 25.71 +- 0.00 25.63 +- 0.00
tstoragefile-read-total-megabytes 25.71 +- 0.00 25.63 +- 0.00
dcap-read-total-megabytes 25.71 +- 0.00 25.63 +- 0.00
tstoragefile-read-actual-total-msecs 62148.64 +- 32852.00 168376.25 +- 35914.76
tstoragefile-read-total-msecs 62156.70 +- 32851.95 168384.45 +- 35915.03
dcap-read-total-msecs 62138.72 +- 32852.00 168366.30 +- 35914.59
tstoragefile-read-actual-num-operations 5760.00 +- 0.00 5745.00 +- 0.00
tstoragefile-read-actual-num-successful-operations 5760.00 +- 0.00 5745.00 +- 0.00
tstoragefile-read-num-operations 5760.00 +- 0.00 5745.00 +- 0.00
dcap-read-num-operations 5760.00 +- 0.00 5745.00 +- 0.00
tstoragefile-read-num-successful-operations 5760.00 +- 0.00 5745.00 +- 0.00
dcap-read-num-successful-operations 5760.00 +- 0.00 5745.00 +- 0.00
READV T3_CH_PSI-MultiTest-JPE_1 T3_CH_PSI-MultiTest-JPE_2
dcap-readv-total-megabytes 0.00 +- 0.00 0.00 +- 0.00
dcap-readv-total-msecs 0.00 +- 0.00 0.00 +- 0.00
dcap-readv-num-operations 0.00 +- 0.00 0.00 +- 0.00
dcap-readv-num-successful-operations 0.00 +- 0.00 0.00 +- 0.00
SEEK T3_CH_PSI-MultiTest-JPE_1 T3_CH_PSI-MultiTest-JPE_2
tstoragefile-seek-total-megabytes 0.00 +- 0.00 0.00 +- 0.00
tstoragefile-seek-total-msecs 19.42 +- 0.19 19.56 +- 0.16
tstoragefile-seek-num-operations 5760.00 +- 0.00 5745.00 +- 0.00
tstoragefile-seek-num-successful-operations 5760.00 +- 0.00 5745.00 +- 0.00

* T3_CH_PSI-MultiTest-JPE_2-Overview.jpg:
T3_CH_PSI-MultiTest-JPE_2-Overview.jpg

Examples

HT vs no HT

As an example, a comparizon between Intel cpu with and withou Hyperthreading enabled is here presented.

Test machine:

Name t3wn29
Model Sun Blade X6270
CPU 8 Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
RAM 16 GB

The test is made by running N concurrent CMSSW jobs ( 1 <= N <= 16) reading from a set of files stored on a local disk. The chosen cfg for this test is PAT.

A simple script can be used to perform this test:

#!/bin/bash

CFG=PAT1_local
DIR=Site.T3_CH_PSI-Setting.ParallelTest-Label.HT-Cfg.${CFG}-Set.

for N in `seq 1 16`; do mkdir ${DIR}_${N}; done

for i in `seq 1 2`; do
    for N in `seq 1 16`; do
        for n in `seq 1 ${N}`; do
            echo $n $N $i;
            ( (echo "Network start " `date +%s && /sbin/ifconfig eth0 | grep bytes` ) > ${DIR}_${N}/${CFG}_eth0-${N}_${n}_${i}.txt ) && \
                ( /usr/bin/time cmsRun -j ${DIR}_${N}/${CFG}-${N}_${n}_${i}.xml ${CFG}.py ) &> ${DIR}_${N}/${CFG}-${N}_${n}_${i}.stdout &&\
                ( (echo "Network stop " `date +%s && /sbin/ifconfig eth0 | grep bytes` ) >> ${DIR}_${N}/${CFG}_eth0-${N}_${n}_${i}.txt ) && \
                cat ${DIR}_${N}/${CFG}_eth0-${N}_${n}_${i}.txt >> ${DIR}_${N}/${CFG}-${N}_${n}_${i}.stdout &
        done
        sleep 1200;
    done
done

Here, each set is run twice. This script must run for each configuration (HT enabled/disabled). With this script all the juicy information about the CMSSW job is stored in separate directories, one for each number of concurrent jobs. The directory names follow the template KEY.LABEL-KEY.LABEL, in order to be better processed by the CMSSW Performance Toolkit.

Once the jobs have finished, you can process their outputs with the CPT scripts. First of all, create the ROOT files containing the information:

$ cpt_getJobInfo.py --type=CMSSW NameOfDir

This can be done also in a loop:

$ for i in `ls --color=none | grep T3`; do cpt_getJobInfo.py --type=CMSSW $i; done

Now to create the plots from the ROOT files, e.g. for the sample with HT disabled:

$ cpt_getStats.py Site*noHT*.root

This will output a table containing the requested informations and some plots. Various configuration variables are available in the script:

  • filter: filter quantities to be considered and output in the table
  • negFilter: same as above, but in a negative sense
  • plotFilter: quantities to be plotted
  • summaryPlots: quantities to be plotted in the summary plot (a multipad canvas) and as bar histograms

Furthermore, the informations contained in the directory name as KEY.LABEL pairs can be used for configuring the legends and the name of the output files, e.g.:

  • PNG_NAME_FORMAT= ['Site',"Cfg","Setting","Label"]
  • legendComposition = ["Set"]

Plots can be saved in png format using the --save-png option, while a rootfile containing all the bar histograms can be created using the --save-root option. This is want we want to do right now, so:

$ cpt_getStats.py Site*noHT*.root --save-root
$ cpt_getStats.py Site*.HT*.root --save-root

Two ROOT files are created: T3_CH_PSI-PAT1_local-ParallelTest-HT.root and T3_CH_PSI-PAT1_local-ParallelTest-noHT.root. Histograms can be compared and plotted then using eg a ROOT macro like:

cfrPlots(){
  bool printEps = false;
  gROOT->SetStyle("Plain"); // PAW like style
  gStyle->SetFrameBorderMode(0);
  gROOT->ForceStyle();

  TFile *files[10];
  int TOTFILES=0;
  files[TOTFILES++] = TFile::Open("T3_CH_PSI-PAT1_local-ParallelTest-HT.root");
  files[TOTFILES++] = TFile::Open("T3_CH_PSI-PAT1_local-ParallelTest-noHT.root");
  string legend[10];
  legend[0] = "HT";
  legend[1] = "no HT";
  
  int TOTHIST=0;
  string histoNames[100];

  TFile *out = TFile::Open("HTvsNoHT.root","RECREATE");
  TH1F * histos[100][100];
  
  for(int f=0;f<TOTFILES;f++){
    int TOTHIST=0;
    TIter next( files[f]->GetListOfKeys());
    TKey *key;
    while ((key=(TKey*)next()))  histoNames[TOTHIST++] = key->GetName();
    
    for(int h=0;h<TOTHIST;h++){
      histos[f][h] = (TH1F*) files[f]->FindObjectAny(histoNames[h].c_str()); 
      histos[f][h]->SetLineColor(1+f);
      histos[f][h]->SetLineWidth(2);
    }
  }
  
  TCanvas * canvas[100];  TLegend * legends[100];
  for(int h=0;h<TOTHIST;h++){
    canvas[h] = new TCanvas(histoNames[h].c_str(), histoNames[h].c_str());
    legends[h] = new TLegend(0.1,0.2,0.7,0.4); 
    canvas[h]->cd();
    
    for(int f=0;f<TOTFILES;f++) legends[h]->AddEntry(histos[f][h], legend[f].c_str(),"l" );
    histos[0][h]->Draw("");
    for(int f=1;f<TOTFILES;f++) histos[f][h]->Draw("sames");
     legends[h]->SetFillColor(kWhite);
    legends[h]->SetBorderSize(0);
    legends[h]->Draw();
    out->cd();
    canvas[h]->Write();
    
    if(printEps) canvas[h]->SaveAs((histoNames[h]+".png").c_str());
  }
  out->Write();
  out->Close();
}

ExeTime.jpg UserTime.jpg tstoragefile-read-total-msecs.jpg CpuPercentage.jpg Ifconfig_MB.jpg

Few things can be seen:

  • Overall execution time does not change among the two configurations in a significant way until N~11
  • Performances drop when running more than 1 job per core (bottleneck in reading files from a standard HD?)
  • Different behaviours between HT and noHT in timing:
    • noHT seems to spend more time reading files, while its UserTime remains constant. The opposite happens for HT... is this due to the pipeling enabled?
  • The rise in time for N=1 can be due to the loading of conditions constants etc. (the machine was rebooted before each test). This could explain the network activity for the first job.

NB: you can retriete the PAT configuration file and the CPT scripts here: http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/UserCode/leo/Utilities/PerfToolkit/

HT vs no HT improved

Warning!!!: Y axis are still to be optimized, right now they plot the stated quantity for 30 secs, so eg the data transferred from disk is for bytes/30 secs

CpuPercentage.jpg ExeTime.jpg tstoragefile-read-total-msecs.jpg UserTime.jpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-CPU_User.jpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-CPU_Wait.jpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-DISK_Read.jpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-CPU_User.jpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-CPU_Wait.jpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-DISK_Read.jpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-MEM_Used.jpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-MEM_Used.jpg

Random Ideas

(this part is dedicated to ideas on how to set and run these tests)

Leo: We could start running a first set of tests on a common dataset (/QCD_Pt80/Summer09-MC_31X_V3-v1/GEN-SIM-RECO, ~1TB). We actually don't need to transfer the whole dataset

Leo: It would also could be a good idea to put the rootfiles/dirs in a common space, in order to cross-check the results (any idea on where?)

-- LeonardoSala - 26-Feb-2010

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg CpuPercentage.jpg r2 r1 manage 14.1 K 2010-03-09 - 15:45 UnknownUser  
JPEGjpg ExeTime.jpg r2 r1 manage 13.5 K 2010-03-09 - 15:45 UnknownUser  
JPEGjpg Ifconfig_MB.jpg r1 manage 13.7 K 2010-03-04 - 17:07 UnknownUser  
JPEGjpg T3_CH_PSI-MultiTest-JPE_2-Overview.jpg r1 manage 30.5 K 2010-02-26 - 16:33 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-CPU_User.jpg r1 manage 19.6 K 2010-03-09 - 15:47 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-CPU_Wait.jpg r1 manage 22.6 K 2010-03-09 - 15:47 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-DISK_Read.jpg r1 manage 22.9 K 2010-03-09 - 15:47 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-HT-dstat-MEM_Used.jpg r1 manage 19.7 K 2010-03-09 - 16:03 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-CPU_User.jpg r1 manage 20.3 K 2010-03-09 - 15:47 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-CPU_Wait.jpg r1 manage 22.4 K 2010-03-09 - 15:48 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-DISK_Read.jpg r1 manage 23.0 K 2010-03-09 - 15:48 UnknownUser  
JPEGjpg T3_CH_PSI-PAT-ParallelTest-noHT-dstat-MEM_Used.jpg r1 manage 20.4 K 2010-03-09 - 16:04 UnknownUser  
JPEGjpg UserTime.jpg r2 r1 manage 14.2 K 2010-03-09 - 15:46 UnknownUser  
JPEGjpg tstoragefile-read-total-msecs.jpg r2 r1 manage 15.3 K 2010-03-09 - 15:45 UnknownUser  
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2010-05-13 - BrianBockelman
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback