PAT Examples: Top Quark Analysis (September 2009)



This page was created for the PAT Tutorial June 2009 and updated for the PAT Tutorial September 2009. You can find the corresponding talks on the Indico agenda (June 2009, September 2009).

This tutorial will show you examples of usage for the Top Quark Analysis Framework (TQAF). For this purpose, the semi-leptonic ttbar channel with muons was chosen.

In the following you will learn:

  • how to produce an object of the TtSemiLeptonicEvent class that contains six different event hypotheses plus generator information
  • how to access the candidates from the hypotheses, the generator particles and meta information from the TtSemiLeptonicEvent
  • how to perform a training for the multivariate analysis (MVA) that is used to find the correct jet-parton association
  • how to change the algorithm that is used for the generator particle based jet-parton matching

How to get the code

First check out the latest tags of PAT and TQAF for CMSSW_3_1_2 as described in the TQAF recipes. Then check out a specific version of the TQAF Examples package for this tutorial:

cmsrel CMSSW_3_1_2
cd CMSSW_3_1_2/src
cvs co -r V06-02-05 AnalysisDataFormats/TopObjects
cvs co -r V06-02-05 TopQuarkAnalysis
cvs co -r PatTutorial_September2009 TopQuarkAnalysis/Examples

How to run the code

Info This section introduces the most important files and commands for this tutorial. They will be described in detail in the subsequent section, in which you will also be asked again to run the code but that time with higher statistics (as checked out from CVS, the jobs in the test directory will only run on 100 events).

Don't forget to compile the code (maybe on more than one core, e.g. with four processes simultaneously, to make it faster):

scram b -j4

You can now run a first test:

cmsRun TopQuarkAnalysis/Examples/test/

You will get a file named analyzeTopHypothesis.root which you can inspect using a root macro that is provided in the Examples package:

root -l
.x TopQuarkAnalysis/Examples/test/analyzeTopHypotheses.C+

Tip, idea The macro does not only display all histograms on your screen, it also writes them into a postscript file. So if you feel that the drawing of the canvases with the histograms takes too much time, you can significantly speed things up by running root in batch mode and look at the results with a tool like ghostview afterwards:

root -l -b -q TopQuarkAnalysis/Examples/test/analyzeTopHypotheses.C+

This is also useful to compare different results since you could run the HypothesisAnalzer and the root macro, simply rename the postscript file that was produced, re-configure the HypothesisAnalyzer, run the whole machinery again and then compare the new with the old postscript file.

In order to perform a new training for the MVA based event hypothesis, first do:

cmsRun TopQuarkAnalysis/Examples/test/

You will get a file named train_save.root which you pass to the mvaTreeTrainer:

mvaTreeTrainer --xslt TopQuarkAnalysis/TopJetCombination/data/TtSemiLepJetCombMVATrainer.xml TopQuarkAnalysis/TopJetCombination/data/TtSemiLepJetComb.mva train_save.root

You should then run the HypothesisAnalyzer again.

Find out more about the details

The HypothesisAnalzer and the JetCombMVATrainTreeSaver run on (different) input files that are part of a PAT tuple created from a Pythia ttbar sample (already with a standard event selection for the semileptonic ttbar channel with a muon).


If you have a look into the cff file, you can see how the selectedLayer1 objects and clones of the standard PAT filters are used for a very quick but realistic, PAG specific event selection.

import FWCore.ParameterSet.Config as cms

# require exactly one isolated muon above 20 GeV in the central region

from CMS.PhysicsTools.PatAlgos.selectionLayer1.muonSelector_cfi import *
leadingMuons = selectedLayer1Muons.clone()
leadingMuons.src = 'selectedLayer1Muons'
leadingMuons.cut = 'pt > 20. & abs(eta) < 2.1 & (trackIso+caloIso)/pt < 0.1'

from CMS.PhysicsTools.PatAlgos.selectionLayer1.muonCountFilter_cfi import *
countLeadingMuons = countLayer1Muons.clone()
countLeadingMuons.src = 'leadingMuons'
countLeadingMuons.minNumber = 1
countLeadingMuons.maxNumber = 1

# require at least four jets above 30 GeV in the central region

from CMS.PhysicsTools.PatAlgos.selectionLayer1.jetSelector_cfi  import *
leadingJets = selectedLayer1Jets.clone()
leadingJets.src = 'selectedLayer1Jets'
leadingJets.cut = 'pt > 30. & abs(eta) < 2.4'

from CMS.PhysicsTools.PatAlgos.selectionLayer1.jetCountFilter_cfi import *
countLeadingJets = countLayer1Jets.clone()
countLeadingJets.src = 'leadingJets'
countLeadingJets.minNumber = 4

ttSemiLepEvtSelection = cms.Sequence(leadingMuons *
                                     countLeadingMuons *
                                     leadingJets *

The applied cuts are:

  • exactly one isolated muon above 20 GeV in the central detector region and
  • at least four jets above 30 GeV in the central detector region

If you look at the file


you see three sequences in the path:

process.path = cms.Path(process.makeGenEvt *
                        process.makeTtSemiLepEvent *

The sequences makeGenEvt, makeTtSemiLepEvent and analyzeHypotheses will be described in the following.

The TtGenEvent

The sequence makeGenEvt is used to produce the TtGenEvent, a container class that keeps the genParticles of the ttbar decay chain and additionally provides some meta information obtained from the Monte Carlo generator information for each event. This sequence is defined in:


The TtSemiLeptonicEvent

The sequence makeTtSemiLepEvt produces the TtSemiLeptonicEvent, a container class that provides a set of different event hypotheses for semileptonic ttbar events plus some meta information. The main differences between the hypotheses consist in the algorithms used to derive the jet-parton association within the events. The makeTtSemiLepEvt sequence is defined in:


If you uncomment the line

process.ttSemiLepEvent.verbosity = 1

and run the, as described above, you will get a printout like the following for each event:

 TtGenEvent says: Semi-leptonic TtBar, Muon Channel
 Number of available event hypothesis classes: 6
 - JetLepComb: LightP LightQ  HadB   LepB  Lepton
 * JetLepComb:   1      2      3      0      0
 * JetLepComb:   1      2      3      0      0
 * JetLepComb:   1      2      3      0      0
 * JetLepComb:   1      2      0      3      0
 * Sum(DeltaR) : 1.92811
 * Sum(DeltaPt): 281.127
 * JetLepComb:   0      2      1      3      0
 * Method  : ProcLikelihood
 * Discrim.: 0.770072
 * JetLepComb:   1      2      0      3      0
 * Chi^2      : 0.0588122
 * Prob(Chi^2): 0.971022

You can see from this that all 6 existing algorithms for event hypotheses in the TtSemiLeptonicEvent were run and you can read from the tabular printout which jet was assigned to which parton, where the indices refer to the selectedLayer1Jets from PAT which were used as input for the algorithms. And you can see which of the selectedLayer1Muons was chosen as the lepton for the event hypothesis. In addition, you already find some meta information in this standard printout like the summed deltaR of the GenMatch, the MVA discriminator value and the final chisquare of the kinematic fit.

The HypothesisAnalyzer

Finally, the sequence analyzeHypotheses invokes the HypothesisAnalyzer. If you have a look at the file


you see that actually three different clones of the corresponding module are run with this sequence.

import FWCore.ParameterSet.Config as cms

# make simple analysis plots for a comparison
# between a simple algorithmic, a gen match and
# an MVA discriminator based event hypothesis

# initialize analyzers
from TopQuarkAnalysis.Examples.HypothesisAnalyzer_cfi import *
analyzeGenMatch      = analyzeHypothesis.clone()
analyzeMaxSumPtWMass = analyzeHypothesis.clone()
analyzeMVADisc       = analyzeHypothesis.clone()

# configure analyzers
analyzeGenMatch.hypoClassKey      = "ttSemiLepHypGenMatch:Key"
analyzeMaxSumPtWMass.hypoClassKey = "ttSemiLepHypMaxSumPtWMass:Key"
analyzeMVADisc.hypoClassKey       = "ttSemiLepHypMVADisc:Key"

# define sequence
analyzeHypotheses = cms.Sequence(analyzeGenMatch *
                                 analyzeMaxSumPtWMass *

Let's go through the HypothesisAnalyzer now. The relevant files are:


You can see in the source file how the TtSemiLeptonicEvent is read as a product from the edm::Event and made available to the analyzer.

edm::Handle<TtSemiLeptonicEvent> semiLepEvt;
event.getByLabel(semiLepEvt_, semiLepEvt);

The member semiLepEvent_ just holds the corresponding InputTag which is "ttSemiLepEvent" as defined in the cfi file.

A second product is read from the event: A simple integer which is then converted into the type of TtSemiLeptonicEvent::HypoClassKey.

edm::Handle<int> hypoClassKeyHandle;
event.getByLabel(hypoClassKey_, hypoClassKeyHandle);
TtSemiLeptonicEvent::HypoClassKey& hypoClassKey = (TtSemiLeptonicEvent::HypoClassKey&) *hypoClassKeyHandle;

The HypoClassKey enumerator is defined in the TtSemiLeptonicEvent and used for accessing the hypothesis classes. The only difference between the three clones of the HypothesisAnalyzer consists in the InputTags that are used for the hypoClassKey_. As can be seen from the, the InputTags are "ttSemiLepHypGenMatch:Key", "ttSemiLepHypMaxSumPtWMass:Key" and "ttSemiLepHypMVADisc:Key". This means, the same analyzer is run three times to study first the GenMatch hypothesis, then the hypothesis named MaxSumPtWMass and finally the MVA discriminator based hypothesis.

After having read in the hypoClassKey, it is checked whether the corresponding hypothesis is available and valid in this event.

if( !semiLepEvt->isHypoValid(hypoClassKey) ){
  edm::LogInfo("HypothesisAnalyzer") << "Hypothesis not valid for this event";

A hypothesis is not available if the module that produces the hypothesis was not run. Since in the standard sequence all 6 hypotheses are produced before the TtSemiLeptonicEvent is constructed, they all typically should be available.

A hypothesis is not valid if it is not available or if, for example, there is no appropriate lepton candidate in the event or if there are not enough jets. And the GenMatch hypothesis is not valid if the event was not generated in the semileptonic channel. So, depending on the event selection, you might get quite some events with hypotheses that are not valid.

The candidates for the hadronic W boson and the hadronic top quark are accessed from the hypotheses in the analyzer using the following two lines:

const reco::Candidate* hadTop = semiLepEvt->hadronicDecayTop(hypoClassKey);
const reco::Candidate* hadW   = semiLepEvt->hadronicDecayW  (hypoClassKey);

Similar getter functions could be used to access the other particles of the ttbar decay chain:

  • hadronicDecayTop(const HypoKey& key)
  • hadronicDecayB(const HypoKey& key)
  • hadronicDecayW(const HypoKey& key)
  • hadronicDecayQuark(const HypoKey& key)
  • hadronicDecayQuarkBar(const HypoKey& key)
  • leptonicDecayTop(const HypoKey& key)
  • leptonicDecayB(const HypoKey& key)
  • leptonicDecayW(const HypoKey& key)
  • singleNeutrino(const HypoKey& key)
  • singleLepton(const HypoKey& key)

In the HypothesisAnalyzer just the very basic properties pt(), eta() and mass() of the candidates for the hadronic W and the hadronic top are taken and filled into histograms.

In the next step, the corresponding generator particles are obtained:

const reco::Candidate* genHadTop = semiLepEvt->hadronicDecayTop();
const reco::Candidate* genHadW   = semiLepEvt->hadronicDecayW();

They are used in the analyzer to calculate the so called pulls in pT, pseudorapidity and mass, i.e. the difference between the reconstructed and the generated values divided by the generated value, which are also filled into histograms.

In the last step of the analyzer, two quality criteria of hypotheses are read from the TtSemiLeptonicEvent and used to fill histograms: The summed deltaR between the jets and the partons of the GenMatch via semiLepEvt->genMatchSumDR() and the discriminator of the MVA based hypothesis via semiLepEvt->mvaDisc().

The HypothesisAnalyzer as checked out from CVS will only run on 100 events. Please set the maxEvents parameter in the to a number like 1000 in order get enough statistics for all histograms.

If you then run the analyzer and afterwards inspect the results with the analyzeTopHypotheses.C root macro, you will see three canvases. The distributions for the hadronic W can be found on the first canvas and those for the hadronic top on the second. On the third canvas the deltaR of the GenMatch and the discriminator of the MVA based hypothesis are shown.

The MVATrainer

You will see that a significant fraction of the events has MVA discriminator values below 0.05. This is mainly due to the fact that the file with the MVA training data that is stored in CVS is just a dummy version which was obtained on a small number of events and without a realistic event selection. The idea is that every user performs an MVA training on higher statistics and with an event selection that corresponds to the one applied in his/her analysis. To do so, please set the maxEvents in the to 2000, run the TrainTreeSaver and then the mvaTreeTrainer as described above.

The TrainTreeSaver produces a root tree that is saved in the file train_save.root and that contains for every selected event those variables that were calculated for the JetCombMVA. The mvaTreeTrainer then runs on this tree and is able to perform the remaining steps of the training very fast. If you enter mvaTreeTrainer without any arguments, you get a short explanation of its syntax:

mvaTreeTrainer <train.xml> <output.mva> <data.root>
Recognized parameters:
        -l / --load             Load existing training data.
        -s / --no-save          Don't save training data.
        -m / --no-monitoring    Don't write monitoring plots.
        -w / --no-weights       Ignore __WEIGHT__ branches.
        -x / --xslt             Use MVATrainer XSLT parsing.

In our case the full command for the mvaTreeTrainer is:

mvaTreeTrainer --xslt TopQuarkAnalysis/TopJetCombination/data/TtSemiLepJetCombMVATrainer.xml TopQuarkAnalysis/TopJetCombination/data/TtSemiLepJetComb.mva train_save.root

You can have a look into the TtSemiLepJetCombMVATrainer.xml, which is the main steering file for the training. You will see that the long list of all variables that are stored in the tree is included from a file called TtSemiLepJetCombMVATrainer_vars.xml. The variables are then passed through a series of so called processors in the main xml file. One of the processors is called ProcLikelihood and combines variables into a likelihood. As you can see, six variables are currently used here:

  • deltaRHadTopLepTop : the deltaR between the hadronic top and the leptonic top
  • deltaRHadQHadQBar : the deltaR between the two light quarks from the hadronic W
  • deltaRLepBLepton : the deltaR between the leptonic b quark and the lepton
  • bTagHadBTrkCntHighEff : the TrackCountingHighEfficiency b-tag on the hadronic b jet
  • bTagLepBTrkCntHighEff : the TrackCountingHighEfficiency b-tag on the leptonic b jet
  • bTagSumHadQHadQBarTrkCntHighEff : the summed TrackCountingHighEfficiency b-tags on the two light quarks

TtSemiLepJetComb.mva is the file to which the final result of the training will be written and which will be read by the TtSemiLepJetCombMVAComputer plugin used to build the MVADisc hypothesis.

The mvaTreeTrainer not only produces the mva file which is the needed input for the MVA based hypothesis, it also provides a file called train_monitoring which can be inspected using the ViewMonitoring macro of the MVATrainer package:

root -l $CMSSW_RELEASE_BASE/src/CMS.PhysicsTools/MVATrainer/test/ViewMonitoring.C

You will get a menu from which you can select "Input Variables" -> "norm" to see the input distributions of the six variables:


Or "ProcMatrix" -> "rot" to get a correlation matrix.


Please note that some of the plots shown by the ViewMonitoring macro, especially those under "Discriminator & Performance", are not the correct distributions for the JetCombMVA. This is due to the fact that ViewMonitoring was designed for the standard case in which you typically have one hypothesis per event (or per object), which is either signal or background. In the special case of the JetCombMVA you have always one true hypothesis per event plus several other hypotheses from the combinatorial background. And you always choose that hypothesis that yields the largest discriminator value in the event. In order to study the performance of the JetCombMVA, you need to find out in which events the "best" hypothesis was the true one and in which the "best" was a fake. This can be done with a root macro that you can find in the TQAF Examples package:

root -l TopQuarkAnalysis/Examples/test/analyzeJetCombMVAPerformance.C+

You will get plots like these:



After having performed the MVA training, please run the HypothesisAnalyzer again using and then the analyzeTopHypotheses.C macro. On the third canvas you will see that in most events the summed deltaR between the partons and the jets in the GenMatch hypothesis is smaller than 0.5 but that you also have quite some events with significantly larger values. To increase the purity of the GenMatch, you could use an outlier rejection and/or a different algorithm for the jet-parton matching. By including the following line in your config file, you choose the unambiguousOnly algorithm which only accepts unambiguous jet-parton pairs within cones of 0.3 in deltaR:

process.ttSemiLepJetPartonMatch.algorithm = "unambiguousOnly"

This increases the purity of the GenMatch but at the same time it reduces the efficiency, i.e. you will get less events with the GenMatch hypothesis. If you run the HypothesisAnalyzer again, you will finally get plots like these:

canvasHadW.png canvasHadTop.png canvasQuali.png

How to get more information

  • The main TQAF TWiki page is the SWGuideTQAF.
  • Documentation of the classes TopGenEvent and TtSemiLeptonicEvent can be found in the SWGuideTQAFClasses.
  • Some more information on the MVA based jet-parton association as well as on the algorithms for the GenEvent based jet-parton matching can be found in the SWGuideTQAFLayer2.
  • The main documentation of the CMS.PhysicsTools MVA package is in the SWGuideMVAFramework.

Review status

Reviewer/Editor and Date (copy from screen) Comments
HolgerEnderle - 03 Sep 2009 Updated for the PAT Tutorial September 2009
SebastianNaumann - 19 Jun 2009 Completed first revision for the PAT Tutorial June 2009
RogerWolf - 13 May 2009 Created the template page

Responsible: SebastianNaumann
Last reviewed by: most recent reviewer and date

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-03-11 - SebastianNaumann
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback