General logic of the Skim Framework:

The FW for skimming the data and producing the final (latinos) flat trees is organized in 3 logically distinct steps. The first step, which is commonly called initial skim, takes care of skimming interesting events, but actually it also implements data pruning and data boosting. The second intermediary step, which we can identify as event building, combines information from different collections of leptons, jets, tracks, etc.. into a single SkimEvent object. The product of this step can be saved on disk into an Edm file or transiently stored in memory to be accessed by the next step of the skimming procedure. The third and final step, which we indicate as final trees production, accesses the ancillary methods of the SkimEvent class to calculate inter-collection observables as projectedMET, transverseMass, etc.. and dumps into a flat root tree all the information which is necessary to run the rest of the analysis.

In the following, the three steps are analyzed in more detail

Framework Installation:

To get a fresh install do:

   cmsrel CMSSW_X_Y_Z[_patchK]
   cd CMSSW_X_Y_Z[_patchK]/src
   cmsenv
   cvs co -rTAG -d WWAnalysis UserCode/Mangano/WWAnalysis
   eval `cat WWAnalysis/Misc/pleaseEvalMeOnInstall`
   scram b -j8

where the CMSSW_X_Y_Z[_patchK] and TAG have to be choosen according to the Tags google doc in the latino collection. (You might need to ask access to it.)

Initial Skim:

The main goal of this step is to reduce the size of the edm files from DATA and MC productions through data skim and data pruning. It also runs those algorithms that require data collections that are available only in the full AOD files and that we cannot access later in the analysis workflow. Everything is defined in the single configuration file latinosYieldSkim.py, which includes and uses several modules defined in the WWAnalysis/SkimStep/python folder.

  • data skim: a pair of leptons is looked for in the collections of muons, gsfElectrons and taus without applying any requirement on charge, isolation or ID of the leptons. If there isn't any pair of these loosest leptons, the event is not usable neither for the base di-lepton analysis nor for same-sign tests independently on which ID or Isolation are eventually applied: in this case the event is skimmed out and it is not propagated to the rest of the analysis workflow. The initial di-lepton collection is definedy by the python allDiLep module defined here.

  • data pruning: to reduce the size on disk of the skim step, data pruning is implemented dropping some large collections of AOD products like the Tracker or Calomiter clusters, the collection of tracks, etc. The exact list of the collections which are kept or dropt is defined by the configuration of process.out.outputCommands at the end of the latinosYieldSkim file. Some collections are not completely dropped, but they are replaced by smaller collections which are lighter on disk: this is the case for example for the collection of generator particles which is replaced by the smaller prunedGen collection containing only descendents of the Higgs decay or particles from the summary block of the generator list.

  • data boosting: finally data boosting is done calculating those observables that require the availability of the full AOD data content (like track-based isolation, track-corrected MET, etc..) and storing the new quantities into the edm files so that they can be used later in the analysis workflow. Most of these are per-lepton quantities and therefore it is convenient to store them as userFloat embedded directly into the lepton objects. The PatMuonBooster and PatElectronBooster classes take care of converting the muon and electron collections in collections of patElectron and patMuons, respectively and they embed in each object the information about several types of isolation, impact parameter observables and other variables to be used for ID. For the pat electrons, also the output of the electron MVA selection (2011-style) is saved. Each of these quantities can be retrieved from the corresponding lepton with a simple call like lepton.userFloat("name of the variable") . Other example of observables which are not per-lepton, but that are calculated during the skim step are the rho density and the collection of PV-associated PF candidates (see process.reducedPFCands and the corresponding plugins). The PatJetBooster is embedding a few
extra variables for jets as well.

To produce a latino Skim from AOD, the latinosYieldSkim.py has to be edited to replace:

  • RMMEMC depending if running data(=False) or MC(=True)
  • RMMEGlobalTag by your actual GT
  • The first RMMEFN by the AOD input (or leave it to crab if not running interactively)
  • The second RMMEFN by your desired output file name
before submitting via crab or running it locally.

N.B.: If wanting to run the next stops on crab, the outgoing files need to be registered in cms_dbs_ph_analysis_02 DBS and the new sample DBS name should be added to stepOneDatasets_cff.py and in scaleFactors_cff.py. See below for more details about the format of these two files.

Event Building:

The purpose of this step is to combine the information from separate collections of objects in a convenient way so that inter-collection observables like projectedMet, lepton-jet cleaning, transverseMass, etc.. can be retrieved quickly and easily by the user. The SkimEvent class is meant to be used to build one or more event-hypothesis per event, depending on the number of lepton pairs in the reconstructed event. The class stores internally the references to the leptons and jets in their original collections and implements a long list of ancillary methods that can be used to calculate several observables: for the complete list and the exact definition, it is easier to refer directly to the class defined here. Originally this step was run stand-alone and the collection of SkimEvent was saved on disk in an edm file together with most of the collections produced by the skim step (from which the SkimEvent depends, through the references to leptons, jets, vertices,etc..). However, because of the large overlap with the output of the skim step itself (wasted disk space) and because the production of the event hypotheses is relatively fast, the event building is now run simultaneously with the final tree production and the collection of SkimEvents is added to the edm Event only as a transient product which is used to create the final trees.

The original configuration file which was used to produce run the Event building standalone is here. One important part of this configuration file (which is also used by the new configuration described in the next section), is the inclusion of the wwElectrons and wwMuons python modules: these configuration files define completely both the muon and the electron identification and isolation selections. Exploring these files, one can find the exact definition of all the different flavors of isolation that we used, the impact parameter selections and also the loose-muons and loose-electrons definitions used for W+jets studies. The convenient thing of this setup (and the embedding of key variables within the pat muon and electron objects) is that all the lepton selections can be defined by strings in these 2 python files, without having to recompile any C++ class.

Final Trees production:

During this step, the SkimEvent collection is read from the edm Event and all relevant observables are saved in the latinos tree. This is a flat tree saving only simple integer and float variables per event-hypothesis. This means that in theory there can be more than one entry per event in the tree, if there are more than 2 leptons reconstructed in the event. However, in its default configuration, the tree producer uses only those leptons passing full lepton selection and those SkimEvents satisfying the extra-lepton veto: the former greately reduce the chance to have multiple pairs of leptons, while the latter basically reduces it to zero.

The list of variables that are added to the tree are defined by the configuration file in WWAnalysis/AnalysisStep/python/step3_cff.py. Each line like:

    jettche1 = cms.string("leadingJetBtag(0,'trackCountingHighEffBJetTags',0,5.0)"),

defines the name of the variable that is added to the tree (in this example jettche1) and specifies the call to the SkimEvent method that defines the variable itself (in this example the method SkimEvent.leadingJetBtag(0,'trackCountingHighEffBJetTags',0,5.0) is called, i.e. the result of the trackCountingHighEffBJetTags discriminator for the leading jet in the event).

As it was anticipated in the previous section, the current setup of the workflow read the output of the skim step, produces the transient collection of SkimEvents and returns the final latinos tree in one-go. To prepare the python configuration file to be used for this operation, there is a dedicated python script in makeStep3.py. It can be used running:

python makeStep3.py -2 -c string 

where:

  • the option -2 means that you also want to run the step2 (ie Event building), which is necessary to make the trees from the skim output files
  • the option -c means that you want to have the crab cfg file prepared
  • "string" identify the set of datasets you want to run on. It has to correspond to one of the block defined in the file in WWAnalysis/AnalysisStep/python/scaleFactors_cff.py. For examle you can use "dataSamples" for running over all the datasamples; "allBackgroundSamples" for running over all the MC background samples; or define a new set of datasets that you like. You can exclude one sample from the pre-defined list of dataset just commenting the line with a "#". Right now the ids listed in scaleFactors_cff.py match the correct dataset ids as specified in the file python/stepOneDatasets_cff.py. However, if you want to run on a newly produced dataset, you have to edit both files according to this convention. Every time you edit the file in python folder, you also have to recompile the folder. N.B.: Real data should be defined without a scale factor as this feature is used in the main parser script to distinguish data from MC.

At the bottom of the makeStep3.py there are some lines related to the preparation of the crab.cfg template. You may have to change something there for your needs.

NB: After you have submitted all the crab jobs, you will get the root trees in a format which is slightly different from the final lations format. The root files will contain 4 directories with a tree for each channel (ee, mm,etc..). Also the name of the tree will be "probe_tree" instead of "latino". You can convert any of these files in the standard format using the script in test/step3/convertAllTrees.sh

The default input collections of muons and electrons can be changed so that not just the leptons passing the full lepton selection are used to build the event hypotheses saved in the trees, but some looser definition of leptons is used instead. These input collections are controlled editing the muon, ele and softmu strings in the WWAnalysis/AnalysisStep/test/step3/step3.py.

Finally, in case one wants to have a latinos tree with same-sign lepton pairs (for example for the same-sign W+jets closure tests), it is necessary to edit the following line of the of the WWAnalysis/AnalysisStep/python/step3_cff.py configuration file:

    cut = cms.string("q(0)*q(1) < 0 && !isSTA(0) && !isSTA(1) && "+

Another important feature of the final tree production step is that is read from configuration files the cross-section of the different processes, the number of simulted events, PU and pt-reweight values, etc.. and it saves them into the tree to facilitate the proper event-reweight during the rest of the analysis. Some of these weights will be described in detail in the next sections.

Pile-up Weighting:

This is also the step where the PU weights get placed in the tree. The pile-up distributions for the MC and data are stored in WWAnalysis/Misc/Scales:

  • pu2011AB.root - The full 2011 data PU distribution
  • pu2011A.root - The 2011 Run A data PU distribution
  • pu2011B.root - The 2011 Run B data PU distribution
  • s4MCPileUp.root - The S4 MC PU distribution
  • s6MCPileUp.root - The S6 MC PU distribution

These are then converted to python arrays via the script in WWAnalysis/Misc/scripts/createPileUpVector.py. New pileup distros will have to be added to this script, so that the proper distros are included in the python config file WWAnalysis/Misc/python/certifiedPileUp_cfi.py. This file is then included by WWAnalysis/AnalysisStep/python/pileupReweighting_cfi.py where a reweighting vector is created (see e.g. s62011AB which re-wieghts the S6 MC distro to the full 2011 run distro). These variables are then passed to the re-weighting C++ EDProducer CombinedWeightProducer which automatically adds the re-weight to the tree. For example, search for puWeightS6AB in both WWAnalysis/AnalysisStep/python/pileupReweighting_cfi.py and WWAnalysis/AnalysisStep/test/step3/step3.py.

Pile-up weighting 2012 (addition of variables to the trees):

The main code is downloadable from:

    
    cvs co -d HWWAnalysis UserCode/Thea/HWW2L2Nu/HWWAnalysis/
    cd HWWAnalysis/ShapeAnalysis
    source test/env.sh

the instruction to add pile-up weight is:

   puadder

For example:

gardener.py  puadder          latino_085_WgammaToLNuG_INPUTTREE.root            latino_085_WgammaToLNuG_OUTPUTTREE.root          --mc=PileupMC_60bin_S10.root           --data=RunAB_60bin_69400pb.root          --HistName=pileup          --branch=puWABtrue         --kind=trpu

where "mc" is a root file with a histogram with pu distributions from MC (see here for true distribution), "data" is the distribution from data,"HistName" is the name of the histogram in the previous files, "branch" is the name to be given to the new branch and "kind" is the kind of pu reweighting to be used (trpu = true pile-up, itpu = observed pu distribution).

To process all the files in a folder do:

gardener.py  puadder       -r INPUTFOLDER         OUTPUTFOLDER \

...

and the other options as before.

Step4 : addition of variables to the trees:

There are several variables that can be added to the trees.

The main code is downloadable from:

    
    cvs co -d HWWAnalysis UserCode/Thea/HWW2L2Nu/HWWAnalysis/
    cd HWWAnalysis/ShapeAnalysis
    source test/env.sh

The list of additional variables are:

    effW, the lepton efficiency scale factor
    trigW, the trigger scale factor
    puW, pu weight (nominal and up/down scale) 

    xyShift 
    dymvaVar 

    mT2 variable
    the unboosted variables (in "R frame")
    chess variable, for 2D analysis
    susyVar (MR using jets)  
    WobblyBinVar, for 1D analysis with variable binning width 
    likelihoodQGVar, addition of quark gluon likelihood in the tree (*)

The code clones a step3 tree (or a list of trees) and add new variables.

To use the code, the basic instruction is:

    gardener.py  NAMEVAR  input.root  output.root  

or, to execute on a folder

    gardener.py  NAMEVAR  -r inputFolde  outputFolder

The list of NAMEVAR are defined in

gardener.py

whose implementation is in

tree

For 2D analysis, have a look at these slides.

(*) For QuarkGluon likelihood calculation, the CMSSW module has to be downloaded and a patch has to be applied. Just do:

   cd HWWAnalysis/
   cvs co -d QuarkGluonTagger -r HEAD UserCode/pandolf/QuarkGluonTagger
   cp ShapeAnalysis/data/BuildFile.xml_patch                QuarkGluonTagger/BuildFile.xml
   cp ShapeAnalysis/data/classes.h_patch                    QuarkGluonTagger/src/classes.h
   cp ShapeAnalysis/data/classes_def.xml_patch              QuarkGluonTagger/src/classes_def.xml
   cp ShapeAnalysis/data/QGLikelihoodCalculator.cc_patch    QuarkGluonTagger/src/QGLikelihoodCalculator.cc

   cp QuarkGluonTagger/data/QGTaggerConfig_nCharged_AK5PF.txt      ShapeAnalysis/data/QGTaggerConfig_nCharged_AK5PF.txt
   cp QuarkGluonTagger/data/QGTaggerConfig_nNeutral_AK5PF.txt      ShapeAnalysis/data/QGTaggerConfig_nNeutral_AK5PF.txt
   cp QuarkGluonTagger/data/QGTaggerConfig_ptD_AK5PF.txt           ShapeAnalysis/data/QGTaggerConfig_ptD_AK5PF.txt

   scramv1 b -j 8

   cd ShapeAnalysis
   source test/env.sh
 
 
 

Lineshape re-weighting

To take into account complex pole scheme in POWHEG sample. Reweight the mH spectrum to have "a resonance" as expected.

To install

 cd CMSSW_X_Y_Z/src
 cvs co -d MMozer UserCode/MMozer/powhegweight
 scramv1 b -j 8  

To use it (Higgs mass has to be specified for each sample):

gardener -m <mass> -p <process> fileIn fileOut   
with <mass> = the Higgs mass (120, 125, ...) and <process> may be qqH or ggH, that is VBF production or gluon fusion one.
For example
gardener -m 125 -p ggH ggH125.root newggH125.root
 

-- AndreaMassironi - 18 Dec 2013

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-12-18 - AndreaMassironi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback