The TauAnalysis/BgEstimationTools Package

Complete: 3

Doxygen

software administrators: MichalBluj, ChristianVeelken

Introduction

The TauAnalysis/BgEstimationTools package is dedicated to data-driven methods for estimating the different types of backgrounds contributing to Trash.CMSPublicElectroWeak analyses Z --> mu + tau-jet, Z --> e + tau-jet, Z --> e + mu + tau-jet, W --> tau-jet + nu and by the analysis of H --> tau tau in vector boson fusion. The package is part of the TauAnalysis subsystem.

Right now, three different methods for data-driven background estimation are foreseen to be implemented in the TauAnalysis software:

Please see the section Information sources for details about these methods.

Template method

Workflow

The workflow of the template method for data-driven background estimation proceeds in three stages:

The purpose of the first stage is to select (from the very loosely skimmed AOD/RECO samples stored on CASTOR) an AOD/RECO event sample containing O(1000) events of each background process in a phase-space similar to that of the final event selection (number of events expected in data of 200 pb^-1 integrated luminosity).

In the second stage, "plain" ROOT trees are produced for the events contained in the loosely selected event sample. Only a subset of event information will be stored in these "plain" ROOT trees, reducing the space needed to store an event to O(100) bytes. Producing the plain ROOT trees from the AOD/RECO samples preselected in the first stage (rather then the very loosely skimmed AOD/RECO samples stored on CASTOR) has the practical advantage that the "plain" ROOT trees can be easily reproduced for the relatively small number of events contained in the loosely selected event sample, in case new variables get added to the "plain" ROOT tree.

The plain ROOT trees are then used to fill "template" histograms of e.g. the visible invariant mass of the tau decay products (Z→μ+τ-jet, Z→e+τ-jet, Z →e+μ channels) or the transverse mass of the tau-jet + MET system (W→τ-jet+ν channel) for all signal and background processes. The template histograms then get fitted to the distribution observed in (pseudo)data after the final event selection criteria have been applied. The contribution of signal and background processes to the final event sample is finally determined by the normalization of template histograms computed by the fit.

Skimming of AOD/RECO events

The skimming of a loosely selected AOD/RECO event sample containing O(1000) events of each background process in a phase-space similar to that of the final event selection is implemented in the top-level configuration file skimZto.._cfg.py in the TauAnalysis/BgEstimation package (here and in the following, the two dots ".." are placeholders for one of the analysis channels implemented in the TauAnalysis subsystem so far).

The filtering of events is performed by the module MultiBoolEventSelFlagSelector contained in the TauAnalysis/Skimming package. The module takes as input the boolean "CMS.EventSelection" Flags contained in the TauAnalysis specific PATTuple, which gets produced before the filter module is run:

Template method-AOD/RECO level skimming

Production of "plain" ROOT trees

The production of "plain" ROOT trees is implemented in the top-level configuration file TauAnalysis/BgEstimation/test/prodNtupleZto.._cfg.py .

The ROOT tree is filled by the module ObjValNtupleProducer. The module inherits from EDAnalyzer in order to access the event level information stored in the PATTuple and performs the creation and filling of the branches of the ROOT tree. For computing the content stored in each branch, the ObjValNtupleProducer module holds a collection of plugins derived from a common base-class, ObjValExtractorBase. The functionality of these plugins is to extract a floating point number (of double precision) from an edm::Event passed as function argument to the operator() method declared in the ObjValExtractorBase base-class (and implemented in derived classes).

So far, the following plugin classes have been implemented in the TauAnalysis/BgEstimation package:

  • ConstObjValExtractor
    The ConstObjValExtractor class does actually not extract any information from the edm::Event. It is a "dummy", which just stores in the branch a constant number (defined via a value confguration parameter) for each event. Such constant values are useful e.g. to store luminosity dependent event weights in the plain ROOT tree, in case Monte Carlo samples of different simulated luminosities are chained together.
  • MultiObjValExtractor
    The MultiObjValExtractor is a bit exceptional in the sense that it is a plugin (inheriting from the ObjValExtractorBase base-class) which itself holds a collection of plugins derived from ObjValExtractorBase. The MultiObjValExtractor class is used internally by other classes of the TauAnalysis/BgEstimationTools package only. As the MultiObjValExtractor is not used for the production of ROOT trees, it is not described in more detail here.
  • NumObjValExtractor
    The NumObjValExtractor class extracts the number of entries in a collection (specified via a src configuration parameter), e.g. the number of muons (passing certain selection criteria) in an event.
  • PATLeptonIpExtractor
    The PATLeptonIpExtractor plugin extracts the transverse impact parameter with respect to the primary event vertex. The implementation of the PATLeptonIpExtractor class is based on TauAnalysis/RecoTools/plugins/PATLeptonIpSelector, the main difference being the adaption to the interface of the ObjValExtractorBase base-class.
  • PATMuonAntiPionExtractor
    The PATMuonAntiPionExtractor plugin extracts the compatibility of energy deposits in the electromagnetic and hadronic calorimeters and hit patterns in the muon system with the signal expected for a "real" muon. The implementation of the PATMuonAntiPionExtractor class is based on TauAnalysis/RecoTools/plugins/PATMuonAntiPionSelector, the main difference again being the adaption to the interface of the ObjValExtractorBase base-class.
  • StringObjValExtractor
    The StringObjValExtractor class extracts information from objects stored in collections, by calling member functions of those objects. The implementation of the StringObjValExtractor is based on the physics cut and expression parser. Per default, the StringObjValExtractor class calls the member function specified by the value configuration parameter for the first object contained in the collection of objects specified by the src configuration parameter.

For an example how to use these plugins to fill a plain ROOT tree, please see TauAnalysis/BgEstimationTools/test/prodNtupleZtoMuTau_cfg.py .

As the plugins extracting the floating point values filled into branches of the plain ROOT tree use event information stored in the PATTuple as input, while the output of the first (skimming) stage is stored in AOD/RECO format, the PATTuple needs to get reproduced in the second stage, prior to execution of the ObjValNtupleProducer module:

Template method - production of plain ROOT tree

It is also possible to refine the event selection again, but this feature is actually not used (necessary) yet.

The ROOT tree produced by the ObjValNtupleProducer module gets saved automatically (no need for a dedicated module) via TFileService at the end of the event processing.

Fit

The plain ROOT tree produced in the second stage is used as input for the template fit performed in the third (and final) stage of the workflow.

The third stage is defined in the top-level configuration file TauAnalysis/BgEstimationTools/test/fitTemplateZto.._cfg.py . The top-level configuration file consists of three sections:

Template method - fit

All signal and background processes are processed by the same top-level configuration file and in in the same cmsRun job. In order to avoid repetition of configuration parameters for the different processes, configuration parameter "templates" are defined in TauAnalysis/BgEstimationTools/python/templateHistDefinitions_cfi.py and imported into the top-level configuration file.

Note: In the "standard" workflow of the template method, the histograms are filled from the information stored in the plain ROOT trees produced in the second stage of the workflow. As an optional alternative, the histograms used as templates in the fit can be produced by any EDAnalyzer that saves its histograms in the format of DQM MonitorElements (e.g. also the GenericAnalyzer module used to fill the histograms after the final event selection).

In the following, the three sections of the fitTemplateZto.._cfg.py top-level configuration file are described in more detail:

Production of template histograms

The booking and filling of template histograms is performed by the module TemplateHistProducer in the "standard" workflow of the template method. Default values for most configuration parameters of that module, including the binning options, are already defined in templateHistDefinitions_cfi.py imported by the top-level configuration file.

The only configuration parameters which need to be defined in fitTemplateZto.._cfg.py , separately for each process are:

  • fileNames
    The fileNames parameter defines the names of the .root files containing the plain ROOT trees. The .root files are defined in configuration files bgEstNtupleDefinitionsZto.._cfi.py of the TauAnalysis/BgEstimationTools package, so that it is sufficienct to e.g. set
         fileNames = fileNames_Ztautau
         
    (click the following link to see the configuration file bgEstNtupleDefinitionsZtoMuTau_cfi.py of the Z --> mu + tau-jet channel for an example.)
  • treeSelection
    The treeSelection configuration parameter defines a selection performed on the entries of the plain ROOT tree before the template histogram gets filled. The selection needs to be passed in string format, referring to the names of branches in the plain ROOT tree, and is used internally by the TemplateHistProducer module to call the CopyTree() method of the TTree.

ALERT! Note: The treeSelection parameter needs to implement the cuts separating the signal/background process the template histogram of which is to be produced by the TemplateHistProducer module from all other signal/background processes, i.e. cuts which have an as high as possible efficiency for a single signal/background process and an as low as possible efficiency for all other signal and background processes. The aim is to achieve an event statistics of > 1000 events (expected in data of 200 pb^-1 integrated luminosity) together with a purity of > 90% for the process under study.

  • meName
    The meName parameter specifies the name of the DQM MonitorElement representing the template histogram produced by the TemplateHistProducer module, e.g.
         meName = cms.string("fitTemplateZtoMuTau/Ztautau_pure/diTauMvis12")
         

Note: In the fitTemplateZto.._cfg.py configuration file, actually two variants of template histograms get produced for each signal/background process: "pure" template histograms which ignore shape distortions due to contributions of signal/background processes other than the one under study (and can be estimated by Monte Carlo only) and "real" template shapes which include these impurities (and can be determined from "real" (pseudo)data).

Plotting of template histograms

The plotting of the template histograms is performed by the module DQMHistPlotter, that is used to plot the histograms filled after all event selection criteria also. Default values for most configuration parameters, in particular those specifying drawing options, are again defined in the templateHistDefinitions_cfi.py configuration file.

The parameters that need to be defined (separately for each process) in fitTemplateZto.._cfg.py , are:

  • plots.dqmMonitorElements
    The plots.dqmMonitorElements configuration parameter specifies the name of the DQM MonitorElement representing the template histogram to be plotted. It needs to match the value of the meName parameter used to configure the TemplateHistProducer module, except for the process label (e.g. 'Ztautau_pure', 'Ztautau_real'), which needs to be replaced by the special keyword "#PROCESSDIR#"
  • plots.processes
    The plots.processes configuration parameter specifies the name of the signal/background process the template histogram of which is to be plotted. Its value needs to match the process label defined in the meName parameter used to configure the TemplateHistProducer module.
  • title
    The title parameter specifies a label printed above the template histogram in the graphics file produced by the DQMHistPlotter module.

In addition, the list of all processes need to be passed to the processes, drawOptionSets.default and drawJobs configuration parameters of the DQMHistPlotter module. Please see fitTemplateZtoMuTau_cfg.py (top-level configuration file of the Z --> mu + tau-jet channel) as an example.

More details about the configuration parameters of the DQMHistPlotter module can be found at the following link.

Fitting of template histograms

The contributions of all signal and background processes to the final event sample are determined in the third stage of the workflow by the normalization factors obtained by fitting the template (shape) histograms to the distribution observed in (pseudo)data.

The fit is performed by the module TemplateBgEstFit contained in the TauAnalysis/BgEstimationTools package.

Configuration parameters needed by the module are: * processes
the names of all signal and background processes to be included in the fit, together with the _meName_s of the DQM MonitorElements holding the template histograms to be fitted and the drawOptions to be used for control plots of the fit results, in the format

       Ztautau = cms.PSet(
            meName = cms.string("fitTemplateZtoMuTau/Ztautau_pure/diTauMvis12"),
            drawOptions = drawOption_Ztautau
       )
      
where the definitions of drawOptions to be used for the individual signal/background processes are imported from TauAnalysis/DQMTools/python/plotterStyleDefinitions_cfi.py . * data
the meName of the DQM MonitorElements holding the distribution observed in (pseuso)data * fit
the range of the (pseudo)data distribution to be fitted, specified by the configuration parameters xMin and xMax, and the name (and title) of the observable the distribution of which is fitted (visible invariant mass of the tau decay products in case of the Z→μ+τ-jet, Z→e+τ-jet, Z →e+μ channels; transverse mass of the tau-jet + MET system in case of the W→τ-jet+ν channel), * output
the name of the graphics file containing control plots of the fit results.

Please see again fitTemplateZtoMuTau_cfg.py (top-level configuration file of the Z --> mu + tau-jet channel) as an example.

The implementation of the TemplateBgEstFit module is based on the CMS.RooFit toolkit.

"Generalized" Matrix method

Workflow

The workflow of the "generalized" matrix method method for data-driven background estimation proceeds in three stages:

The first two stages of the workflow of the "generalized" matrix method are identical to those of the template method. The purpose is to select in the first stage (from the very loosely skimmed AOD/RECO samples stored on CASTOR) an AOD/RECO event sample containing O(1000) events of each background process in a phase-space similar to that of the final event selection (number of events expected in data of 200 pb^-1 integrated luminosity) and to produce "plain" ROOT trees for the events passing the selection of the first stage in the second stage.

In the workflow of the "generalized" matrix method, the plain ROOT trees are then used to fit (pseudo)data events observed in different regions of the phase-space with models of individual signal and background processes, in order to determine the contribution of all signal and background processes in each region.

The basic idea of the fit is to use d independent variables to construct a d dimensional phase-space and to divide the phase-space into 2^d regions, defined by d thresholds x_i^threshold such that:
region
where the sum extends over the d variables that span the phase-space,
dimValue
and
bin.

Provided that the d variables are indeed independent, the contribution of signal/background process j to region i can then be modeled by the product: N_i^j,
where N_j^total represents the normalization of signal/background process j and P_i the probability for events of signal/background process j to have x_i values below the threshold x_i^threshold. The normalization factors N_j^total as well as all probabilities P_i are determined for all signal and background processes j in the fitting stage of the "generalized" matrix method.

As an example, consider a 3 dimensional phase-space spanned by the 3 variables:

  • muon isolation
  • charge of mu + tau-jet system
  • transverse mass (or alternatively either the acoplanarity angle dPhi between muon and tau-jet or the isolation of the tau-jet)
illustrated in the figure below:

generalized Matrix method

Skimming of AOD/RECO events

See Skimming of AOD/RECO events in section Template method for details.

Production of "plain" ROOT trees

See Production of "plain" ROOT trees in section Template method for details.

Fit

The contributions of all signal and background processes in different regions of the phase-space are determined by a single fit of the contributions of individual signal/background processes according to the model N_i^j,
to the number of events observed the different regions in (pseudo)data in the third stage of the workflow. The fit is implemented in the module TauAnalysis/BgEstimationTools/plugins/GenMatrixBgEstFit , which takes as input the plain ROOT trees produced in the second stage.

All configuration parameters needed by the GenMatrixBgEstFit module are defined in the top-level configuration file fitGenMatrixZto.._cfg.py :

  • processes
    the names of all signal and background processes to be included in the fit, together with the drawOptions to be used for control plots of the fit results, in the format
           Ztautau = cms.PSet(
                fileNames = fileNames_Ztautau,
                drawOptions = drawOption_Ztautau
           )
          
    where the names of the .root files containing the plain ROOT trees are defined in configuration files bgEstNtupleDefinitionsZto.._cfi.py of the TauAnalysis/BgEstimationTools package (see section Production of template histograms) and the definitions of drawOptions to be used for the individual signal/background processes are imported from TauAnalysis/DQMTools/python/plotterStyleDefinitions_cfi.py (as in section Fitting of template histograms).
  • data
    the name of the .root files containing the plain ROOT trees of the (pseudo)data, again as defined in configuration files bgEstNtupleDefinitionsZto.._cfi.py of the TauAnalysis/BgEstimationTools package.
  • treeName
    the name of the plain ROOT tree within each .root file.
  • treeSelection
    the selection performed on the entries of the plain ROOT tree that events need to pass in order to be considered in the fit. (see section Production of template histograms).
  • branches
    the names of the d variables spanning the d dimensional phase-space to be fitted. For each variable, the name of the branch in the plain ROOT tree holding the variable, the range to be fitted (specified by two configuration parameters xMin and xMax) and the value of the threshold x_i^threshold (specified by the boundaries configuration parameter) needs to be defined. All elements of the branches parameter are defined in a separate configuration file, TauAnalysis/BgEstimationTools/python/bgEstBinGridZto.._cfi.py (see the configuration file TauAnalysis/BgEstimationTools/python/bgEstBinGridZtoMuTau_cfi.py of the Z --> mu + tau-jet channel as an example).
  • output
    the name of the ASCII file (defined by the configuration parameter scaleFactors.fileName) into which results of the fit get written and the name of the graphics file containing control plots of the fit results. The configuration parameter scaleFactors.signalRegion defines the region in phase-space in which the dominant signal contribution is expected. The region is specified by a d dimensional vector representing an (arbitrary) point within that region. The scaleFactors.signalRegion configuration parameter has no effect on the fit. It's sole purpose is to make the interpretation of the fit results easier.

The implementation of the GenMatrixBgEstFit module is based on the CMS.RooFit toolkit.

Prior to the execution of the fit, the level of correlations between the d variables spanning the phase-space to be fitted is estimated. The module ObjValCorrelationAnalyzer, contained in the TauAnalysis/BgEstimationTools packages, computes the linear correlation between any pair of variables.

ALERT! Note: As detailed in the following wikipedia article, the absence of linear correlations between variables is not sufficient to guarantee that all variables are indeed independent - there can still exist more complicated relationships between the variables which require more complicated techniques, e.g. a test of the total correlation, to be found. The test for linear correlations implemented in the ObjValCorrelationAnalyzer module is at least a good starting point, however. Tests for linear correlations are also used by other CMS EWK analyses, e.g. in the analysis of W --> mu nu (see pages 28-30).

Configuration parameters of the ObjValCorrelationAnalyzer module are:

  • processName
    the name of the signal/background process for which the correlation between variables is to be tested; used as label when printing the table of correlation coefficients.
  • fileNames
    the names of the .root files containing the plain ROOT trees for the specified process, as defined in configuration files bgEstNtupleDefinitionsZto.._cfi.py of the TauAnalysis/BgEstimationTools package (see section Production of template histograms).
  • treeName
    the name of the plain ROOT tree within each .root file.
  • branches
    the name of the branches holding the variables the correlation between which is to be analyzed.

The sequence of ObjValCorrelationAnalyzer and GenMatrixBgEstFit modules is illustrated in the following picture:

generalized Matrix method - fit

All signal and background processes are processed by the same fitGenMatrixZto.._cfg.py top-level configuration file and in in the same cmsRun job.

Fake-rate technique

#CMS.InformationSources

Information sources

Review status

Responsible: ChristianVeelken

Reviewer/Editor and Date (copy from screen) Comments
Last reviewed by: -- KatiLassilaPerini - 17 Oct 2006 created template page
Last reviewed by: -- ChristianVeelken - 01 Jul 2009 created initial version

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt TauAnalysisBgEstimationToolsWiki_figures.ppt r2 r1 manage 162.5 K 2009-07-06 - 15:29 ChristianVeelken figures in editable "raw" format
PNGpng bgEstFitGenMatrix.png r3 r2 r1 manage 53.6 K 2009-07-06 - 15:35 ChristianVeelken  
PNGpng bgEstFitTemplate.png r3 r2 r1 manage 85.1 K 2009-07-06 - 15:35 ChristianVeelken  
PNGpng bgEstGenMatrixMethod.png r1 manage 299.8 K 2009-07-08 - 10:19 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_2d.png r1 manage 1.1 K 2009-07-08 - 11:18 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_Nij.png r1 manage 2.7 K 2009-07-08 - 11:54 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_NjTotal.png r1 manage 1.5 K 2009-07-08 - 11:55 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_Pi.png r1 manage 1.0 K 2009-07-08 - 11:55 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_bin.png r1 manage 5.2 K 2009-07-08 - 11:18 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_dimValue.png r1 manage 2.8 K 2009-07-08 - 11:19 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_region.png r1 manage 4.1 K 2009-07-08 - 11:19 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_xThreshold.png r1 manage 1.9 K 2009-07-08 - 11:19 ChristianVeelken  
PNGpng bgEstGenMatrixMethod_xi.png r1 manage 0.9 K 2009-07-08 - 11:54 ChristianVeelken  
PNGpng bgEstProdNtuple.png r4 r3 r2 r1 manage 88.4 K 2009-07-06 - 15:35 ChristianVeelken  
PNGpng bgEstSkim.png r3 r2 r1 manage 60.0 K 2009-07-06 - 15:36 ChristianVeelken  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2010-06-14 - JulieMalcles
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback