Sideband Subtraction

In an effort to move the functionality of CMS.TagAndProbe from the full framework to FWLite, a generic tool has been developed in order to provide a unified library of relavent statistical methods. The principal statistical method required for Tag-and-Probe is Sideband Subtraction. For this reason, the name of the class for these methods is SideBandSubtract. SideBandSubtract provides functionality to do Sideband Subtraction quickly and easily without imposing conditions on the User. All the knobs that can be turned in Sideband Subtraction are turned by the user because the user owns all of the pertinent objects that SideBandSubtract has.

Furthermore, SideBandSubtract lets you interact with both input and output objects.

Obtaining and Running

SideBandSubtraction is available from the PhysicsTools/Utilities package, and can be obtained in the following way:
scram p CMSSW CMSSW_3_4_1
cd CMSSW_3_4_1/src/
cmsenv
cvs co -r SBS PhysicsTools/Utilities
scram b
If you plan on running from FWLite only, you can skip building with scram, and simple interactively compile the code with ACLiC. Example code can still be accessed from my UserCode area:
cvs co UserCode/DavidBjergaard
You will also check out the UserCode version of SideBandSubtract which is not being officially maintained anymore!

Notes

  • SideBandSubtract has no dependence on FWLite, FWCore, or anything CMSSW related.
  • The example code does depend on FWLite.
  • In order to run the examples two things must be done:
    • UserCode/!DavidBjergaard/bin/fwlite_zmm.C must be edited to point to an input data set. If you use the jpsi_example.C, then UserCode/!DavidBjergaard/bin/fwlite_jpsi.C must be edited.
    • zplots and jpsi_plots must exist, if they don't you run the risk of over-writing the output plots of each example.

jpsi_example.C and zmm_example.C are wrapper macros which handle the mundane business of loading and compiling the libraries and then calling the required functions. They assume that a zplots and jpsi_plots directory exist and when the code is done running, it moves all the produced plots to these folders and over writes any previous plots in them. If you want to save your plots move them outside of this folder, or rename them!

Pedagogical step by step explanation of the side-band subtraction method

Before diving into a detailed description of the interface, please take a moment to read the step by step procedure of a side-band subtraction. Experienced users may want to skip this section, and simply read the interface notes.

The sideband subtraction usually involves at least two variables. In the case of the mass resonances such as Z to mu mu or J/Psi to mu mu, the separation variable used is the invariant mass plot of reconstructed candidates. Any number of other interest variables can be defined. For the purposes of this discussion, let's stick to mass and Pt (transverse momentum). Pt is what we want to measure (interest variable), and mass is what will separate pure background from signal by using the sidebands (intervals away from the Gaussian peak). The mass is called the separation variable because it allows us to separate signal+background areas from background areas. What we are really doing is:

  1. Make a plot of Pt with a cut applied to the separation variable (mass) in order to select pure Background (B) events (from the sidebands). This is (dN/dPt)(B) for the sidebands.
  2. Make another plot of Pt with a cut applied the the separation variable (mass) to select Signal (S) and Background (B). This is a mixture of S+B, so the resulting distribution is (dN/dPt)(S) + (1-f)*(dN/dPt)(B), where f is the signal fraction, f = S/(S+B) in the signal region.
  3. Now we construct a model for dN/dmass(B), from which we can estimate how much more background is in the sidebands than in the signal region. We integrate the sidebands, integrate the signal region, and divide the second by the first. This is multiplicative factor MF, or Signal-To-Sideband ratio.
  4. We take the histogram dN/dPt(B) for the sidebands, scale it by MF (which is usually less than one), and this gives an estimate of dN/dPt(B) for the signal region.
  5. We subtract the estimated dN/dPt(B) for the signal region from the total dN/dPt for the signal region. The resulting distribution is an estimate of dN/dPt(S), which is what we want.

This is most easily demonstrated using visuals. First, we examine our example using functions that are easy to work with in terms of visualizing results. The Model PDF used to generate the mass spectrum is a gaussian and constant background.

https://twiki.cern.ch/twiki/bin/viewfile/CMS/PhysicsToolsDevTagAndProbe?rev=1;filename=fitted_mass_example.gif

In the figure, the side-bands are demonstrated by the dark blue lines and the signal region is represented by the red lines. It is clear from the plot that the area under the red region of the mass spectrum includes signal events and background events. The mass spectrum will serve as the separation variable, allowing us to select areas of the distribution which are pure background (step 1) and areas which are signal and background (step 2).

Since the mass spectrum is used to estimate the signal-to-sideband ratio, it is inappropriate to also perform subtraction on it. Therefore, a number of interest variables must be defined to actually do the subtraction on. In our simple example we limit ourselves to a transverse momentum distribution (Pt), however multiple interest variables can be defined and subtracted in tandem with the SideBandSubtract object. The figure below shows the Pt distribution with no cuts applied. It was generated with two overlapping gaussians in order to more dramatically demonstrate the side-band subtraction method. The left peak is the set of signal events, while the right peak is the set of background events.

https://twiki.cern.ch/twiki/bin/viewfile/CMS/PhysicsToolsDevTagAndProbe?rev=1;filename=pt_no_cut.gif

Our goal in doing the side-band subtraction is to estimate the number of background events in each bin, and then subtracting that number of events from each bin. One effective way to reduce background events is to apply a cut to the mass peak, thus selecting events that are "signal + background." (step 2 from above) The resulting plot below demonstrates that the right gaussian peak is dramatically reduced, but not completely eliminated.

https://twiki.cern.ch/twiki/bin/viewfile/CMS/PhysicsToolsDevTagAndProbe?rev=1;filename=raw_ptSignalHist.gif

The next step is to fit the separation variable with a PDF that models the signal and background functional forms. Once this is done, then a signal-to-sideband ratio can be estimated by integrating the Model PDF through the signal region (this gives signal + background yield), and then integrating the background pdf through the side-band regions (giving the background yield) (step 3). Then all that's left is to subtract the side-band histogram from the signal histogram weighting each bin by the signal-to-sideband ratio. This results in the plot below

https://twiki.cern.ch/twiki/bin/viewfile/CMS/PhysicsToolsDevTagAndProbe?rev=1;filename=sbs_ptSignalHist.gif

Due to per bin fluctuations, some of the bins have negative weights, this is perfectly normal, and it is up to the user on how to treat this in his/her specific situation. Generally, if you need to truncate negative bins, you are probably not doing something completely kosher.

Interface Explanation

For ease of use, the public portion of the header file for SideBandSubtract is shown below with a brief description of each function and what to use it for.

 public:
  SideBandSubtract(RooAbsPdf *model_shape, RooAbsPdf *bkg_shape, RooDataSet* data, RooRealVar* sep_var, std::vector<TH1F*> base, bool verb);
  ~SideBandSubtract();
  void addSignalRegion(float min, float max);
  void addSideBandRegion(float min, float max);
  int doGlobalFit();
  int doSubtraction(RooRealVar* variable,Double_t stsratio,Int_t index); //stsratio -> signal to sideband ratio                                              
  void printResults(std::string prefix="");
  void saveResults(std::string outname);
  RooFitResult* getFitResult();
  std::vector<TH1F> getRawHistos();
  std::vector<TH1F> getSBSHistos();
  Double_t getSTSRatio(); //returns signal-to-sideband ratio                                                                                                 

NOTE: The interface to SideBandSubtract involves a lot of pointers to objects that the user owns. This was done intentionally in order that the user have control over all of the objects that go into and out of the object. This means that the user is more responsible for making sure that stuff doesn't get accidentally deleted before SideBandSubtract is done with the objects.

All of these objects map very directly into objects held by SideBandSubtract.

addSignalRegion(float min, float max)

This is fairly self-explanatory, it defines a region which SideBandSubtract will treat as signal. Most of the time the separation variable will only have one well-defined signal region but there are cases where multiple ones may be used (such as the Lb -> Lc* pi+ pi- resonances) Therefore, you can define multiple signal regions and SideBandSubtract shouldn't choke.

The only caveat in using addSignalRegion (as well as addSideBandRegion) is that neither check to make sure that the regions are dis-joint. This is a responsibility of the user, but these redundancy checks may be added in the future. In either case, the warning still holds: make sure you're regions are well-defined and dis-joint.

addSideBandRegion(float min, float max)

This function behaves exactly the same as addSignalRegion, it just defines the Side Bands of the Separation Variable. It is common to only have sidebands on either side of the signal, but again there are cases where you may want 3+ sideband regions depending on how the analysis is set up.

It is completely possible (and appropriate) to define dis-joint regions which are neither signal or background. This is done by simply not including a interval of the distribution in either signal or sideband categories.

doGlobalFit()

This function does all the heavy lifting. It fits the ModelPDF to the data and does the Sideband Subtraction. It also fills RawHistos and SBSHistos vectors with TH1F histograms that are clones of the base histograms handed to it.

doSubtraction(RooRealVar* variable,Double_t stsratio, Int_t index)

This function has been made public in the case that the end user would like to redo a subtraction without re-fitting the ModelPDF to data. The variable is any variable in the dataset which is not the separation variable, and stsratio is the Signal-To-Sideband Ratio calculated from the model and background PDFs, index is the element of the array which corresponds to the base histogram for the variable there are no checks to make sure the name matches, and the base histogram is used to define the binning. doGlobalFit() does check to make sure that the index of BaseHistos matches the variable to insure that the names are preserved properly across the subtraction.

The only use case where this function is useful, is where the Signal-To-Sideband ratio was calculated by some other means.

printResults()

This function spools through all the collections of results (histograms and fit results) and prints them.

It prints and .eps and .gif of each plot before (designated Raw_) and after subtraction (designated SBS_) It also appends the date to the filename so that the work done on day won't be over-written the next. If you need further granularity in days, it is important to copy the results somewhere safe.

printResults() also prints a fitted plot of the separation variable as well as an ascii text file of the fit results (fit_results.txt)

saveResults(string outname)

While printResults() isn't intelligent enough to prevent over-writing from run to run on the same day, saveResults() is.

saveResults() creates a root file named outname (ie "mySideBandResults_ZmumuAnalysis.root") It is also possible to pass outname as a path to the file.

saveResults() creates the file if it doesn't exists, or opens it to append new results if it does. This way, the user can run SideBandSubtract multiple times with different settings and maintain one file of all the results. The contents of the root file are a series of folders named "run#" where '#' is a count of how many times this file has been used to hold results.

This functionality was implemented for use cases which need to run multiple times and all the information from each run is needed. This is the case in CMS.TagAndProbe during the creation of efficiency plots where each bin of the efficiency variable must be fitted as a slice using SideBandSubtract. Even using 10 corse bins means that SideBandSubtract must run 10 times, producing 2*N_vars plots. Maintaining these results on disk is difficult and inflexible which is why the root file was used instead. Recent versions of the code allow for bin, by bin naming so that individual results can be saved for a particular analysis on disk.

saveResults() also saves some RooFit objects, which aren't easily accessible using the standard TBrowser approach to navigating a root file. Therefore, please see !RooFit Tricks for an explanation of how to retrieve you're results after saveResults() have run.

getFitResult()

Returns the RooFitResult pointer attained from doGlobalFit() for the user's printing and review

getRawHistos()

This returns a vector of histograms which have been cloned from the BaseHistos fed to the constructor. These histograms represent the data before the subtraction has been performed. This object has been exposed to the user in case printResults() and saveResults() fail to satisfy the needs of the user's analysis. In this way, one or all of the raw histograms can be saved to a root file in a larger analysis

getSBSHistos()

This functions exactly as getRawHistos() except that it maintains the subtracted histograms

Nota Bene You must supply a base histogram for each interest variable in the SideBandSubtract object in the form of a vector of empty histogram pointers, the name of the histogram must match the name of the corresponding RooRealVar, using GetName() and SetName() methods here is highly recommended. This was done so that the user would have full control over binning as well as what the plots looked like depending on where they were going to end up (ie a talk, a paper etc)

If the RooPlots produced by SideBandSubtract are unsatisfactory, it is the user's responsibility to make ones which are more fitting since the user owns the original RooRealVars that go into the dataset.

getSTSRatio()

Returns the signal-to-sideband ratio, useful for debugging fits to make sure that everything looks OK. This is printed to stdout when the fit finishes, but sometimes this can be lost in a garble of other text, so the user can retrieve the ratio after everything finishes.

Objects of SideBandSubtract

The following is pedagogical introduction to the various components that go into a sideband subtraction, and they're corresponding objects in the SideBandSubtraction object.

The objects required for Sideband Subtraction are:

  • Data Set (RooDataSet* Data)
  • Separation Variable (RooRealVar* SeparationVariable)
  • Well Defined Signal and Sideband regions (set by addSignalRegion and addSidebandRegion functions respectively)
  • Background Probability Density Function (PDF) (RooAbsPdf* BackgroundPDF)
  • Total PDF (RooAbsPdf* ModelPDF)

Data Set

This is a RooDataSet which holds all of the data involved in the user's analysis. SideBandSubtract has no knowledge of variables which this data set holds with the exception of the Separation Variable. This is not to say that SideBandSubtract doesn't interact with these variables. It is unclear which plots are the most useful, and so a more is better approach is taken to which plots are produced. The user also has access to the individual plots and can choose to only print all the plots, or only those of interest. (see above)

Separation Variable

This is the variable which is sensitive to signal and background. It is chosen such that there are well-defined Signal and Sideband regions. In the case of J/Psi to mu mu, the Separation Variable is the dimuon mass. The same is true of Z to mu mu. This is the variable which is used to define the Sideband and signal regions. SideBandSubtract expects that this is a RooRealVar*

Signal and Sideband regions

These are regions which define the difference between signal and sideband. They are defined with max and min values using the addSignalRegion(float min, float max) and addSideBandRegion(float min, float max) methods. As of this writing, there is no check to make sure that the user doesn't overlap these definitions. Future version of this code will hopefully include these redundancy checks, but it is hoped that the end user is intelligent about where the regions are defined.

Background PDF

This describes the background of the Separation Variable. It is used to find the Signal-to-Sideband ratio which is ultimately used to do the subtraction. SideBandSubtract expects a pointer to RooAbsPdf

Misc. Requirements

While not essential to the operation of a Sideband Subtraction, SideBandSubtract requires somethings in order to function properly. Since no assumptions about what kind of analysis the user wants to do are made, SideBandSubtract is pretty dumb. For this reason, the user must specify a set of empty histograms corresponding to each interest variable for it to define the style of all the produced plots. This allows the user to define his/her own binning. The user also must specify whether or not to have verbose output. NOTE It is absolutely imperative that the names of the interest variable Histograms match the names of the RooRealVars that are actually used in the subtraction. If this doesn't happen, no subtraction will occur, and it will look like nothing happened! Better error handling is planned for future releases.

Using the interface

This section will walk you through building the required pieces in order to make a SideBandSubtract object and do the subtraction. Impatient users should be able to get what is going on with these lines:
  TH1F SubtractedResult("SubtractedResult","Z",100,0,150);
  SideBandSubtract sbs(&Model,&Background,&ZData,&ZMass,basehistos,true);
  sbs.addSideBandRegion(20,60);
  sbs.addSignalRegion(60,110);
  sbs.addSideBandRegion(110,140);
  sbs.doGlobalFit();
  sbs.printResults();
  sbs.saveResults("zmumu_output_results.root");

For the Z to mu mu example, first we take a file (in this case a PAT-tuple):

 //Load data and declare helper variables (counters etc)
  TFile  *file = new TFile("~/CMSSW/DATA/PAT_ZMuMuRelVal_25K.root");
  Event ev(file);
Then we define a set of observables and make a dataset from them:
  //Define our variables to play with (Observables)                                                                                                          
  RooRealVar ZMass("ZMass","Z Mass",0,150,"GeV/c^2");
  RooRealVar ZPt("ZPt","Z p_T",0,300,"GeV/c");
  RooRealVar muon1Pt("muon1Pt","1st #mu p_T",0,300,"GeV/c");
  RooRealVar muon2Pt("muon2Pt","2nd #mu p_T",0,300,"GeV/c");
  RooArgSet varSet(ZMass,ZPt,muon1Pt,muon2Pt);
  RooDataSet ZData("ZData","Z candidate data",varSet);
  //These are base histos with custom binning. 
  //*NOTE* The _names_ of the objects must match exactly! 
  TH1F ZMassHist("ZMass","Z Mass",100,0,150);
  TH1F ZPtHist("ZPt","Z p_T",100,0,300);
  TH1F muon1PtHist("muon1Pt","1st #mu p_T",100,0,300);
  TH1F muon2PtHist("muon2Pt","2nd #mu p_T",100,0,300);
  vector<TH1F*> basehistos(0);
  basehistos.push_back(&muon1PtHist);
  basehistos.push_back(&muon2PtHist);
  basehistos.push_back(&ZMassHist);
  basehistos.push_back(&ZPtHist);
Next we build a signal, background and model PDF
  //Build a PDF to do sideband subtraction
  //signal
  //mean 3.1 sigma 0.1
  RooRealVar ZMean("ZMean","Mean of Z Mass peak",80,100,"GeV/c^2");
  RooRealVar ZWidth("ZWidth","Width of Z peak",0.0,20,"GeV/c^2");
  RooRealVar ZSigma("ZSigma","Sigma of Z peak",0.0,2,"GeV/c^2");
  ZSigma.setVal(1);
  ZMean.setVal(91.1875);
  ZWidth.setVal(5.6444);
  RooVoigtian Signal("Signal", "Z Signal Peak",ZMass, ZMean, ZWidth, ZSigma);
  //background
  RooRealVar a0("a0","a0",-50,50,"GeV/c^2");
  a0.setVal(26.934);
  RooRealVar a1("a1","a1",-10,10,"GeV/c^2");
  a1.setVal(.545585);
  RooRealVar a2("a2","a2",-10,10,"GeV/c^2");
  a2.setVal(-.0123);
  RooRealVar a3("a3","a3",-10,10,"GeV/c^2");
  a3.setVal(0.0000507846);

  RooPolynomial Background("Background","Z background",ZMass,RooArgList(a0,a1,a2,a3));
  RooRealVar sig_frac("sig_frac","Signal Fraction",.91,0,1);
  RooRealVar bkg_frac("bkg_frac","Background Fraction",0.29,0,1);
  RooAddPdf Model("Model", "Model Z distribution",RooArgList(Signal,Background),RooArgList(sig_frac,bkg_frac));
  
Normally, the background should be exponentially falling. If there is a tight P_t cut, we kill the exponential background, and so its more appropriate to fit with a polynomial background, otherwise we should fit with an exponential+erfc.

Next we loop over the data, calculate the observables and put them into the dataset. Then we define the subtraction object and call doGlobalFit();

for(event loop)
   {
     //selection cuts
   }  
  SideBandSubtract sbs(&Model,&Background,&ZData,&ZMass,basehistos,true);
  sbs.addSideBandRegion(20,60);
  sbs.addSignalRegion(60,110);
  sbs.addSideBandRegion(110,140);
  sbs.doGlobalFit();

This example has been written in FWLite, however it should be made clear that SideBandSubtract is insensitive to how the RooDataSet is made, and so as long as the user fills a data set, makes the required pdfs and adds appropriate regions, a Sideband Subtraction can be performed.

-- DavidBjergaard - 19-Jan-2010 -- SudhirMalik - 19-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-01-19 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback