LPC Hands-on Advanced Tutorial Session (HATS) on B-Tagging

Prepared by: Alexander Schmidt and Nitish Dhingra

Link to indico agenda.

Part I: Accessing b-tagging information

Accessing b-tag information in PAT

A git repository has been setup which contains the necessary information to access b-tagging information in PAT framework and also without PAT i.e. using RECO data-tier. Login to cmslpc and setup the environment:

source /uscmst1/prod/sw/cms/cshrc prod  

Then follow the instructions given below:

mkdir LPCBTaggingHATS
cd LPCBTaggingHATS
cmsrel CMSSW_5_3_13
cd CMSSW_5_3_13/src
git clone https://github.com/NitishDhingra/bTaggingHATS-LPC.git
source bTaggingHATS-LPC/Setup.sh

The file RunPATAndAnalyzer_cfg.py under BTagAnalyzer/PATBTagAnalyzer/test is used to make PATtuples and run a simple EDAnalyzer over these PATtuples to fill histograms corresponding to various b-tagging discriminators. To save timings we are running interactively over a subset of files (stored at cmslpc-eos area) from the 53X RelVal TTbar dataset. The GEN-SIM-RECO root file is input in RunPATAndAnalyzer_cfg.py:

process.source = cms.Source("PoolSource",
                fileNames = cms.untracked.vstring(

Have a look at the code PATBTagAnalyzer.cc under BTagAnalyzer/PATBTagAnalyzer/src. Look at the section where various histograms are defined:

// Define Histograms 
hBJetDiscrByTrackCountingHighEff = fs->make<TH1F>("BJetDiscrByTrackCountingHighEff", "BJetDiscrByTrackCountingHighEff", 400, -20, 20);
hBJetDiscrByTrackCountingHighPur = fs->make<TH1F>("BJetDiscrByTrackCountingHighPur", "BJetDiscrByTrackCountingHighPur", 400, -20, 20);
hBJetDiscrBySimpleSecondaryVertexHighEff = fs->make<TH1F>("BJetDiscrBySimpleSecondaryVertexHighEff", "BJetDiscrBySimpleSecondaryVertexHighEff", 400, -20, 20);
hBJetDiscrByCombinedSecondaryVertexHighEff = fs->make<TH1F>("BJetDiscrByCombinedSecondaryVertexHighEff", "BJetDiscrByCombinedSecondaryVertexHighEff", 400, -20, 20);
hBJetDiscrByCombinedSecondaryVertexMVA = fs->make<TH1F>("BJetDiscrByCombinedSecondaryVertexMVA", "BJetDiscrByCombinedSecondaryVertexMVA", 400, -20, 20);                

and, in particular, the section where b-tagging related information is accessed from patJets and histograms are filled:

using namespace edm;
         edm::Handle< pat::JetCollection > _patJets;
         iEvent.getByLabel(_RecoJetSource, _patJets); 
 for ( pat::JetCollection::const_iterator patJet = _patJets->begin(); patJet != _patJets->end(); ++patJet ) {
                 // Only look at jets that pass jet pt and eta cuts
                 if(patJet->pt() < _RecoJetPtMinCut || fabs(patJet->eta()) > _RecoJetEtaMaxCut) continue;

Run the cfg file RunPATAndAnalyzer_cfg.py as:

cmsRun RunPATAndAnalyzer_cfg.py

This will create the PATtuple file patTuple_addBTagging.root and output histogram ROOT file analysis.root. Scan through various histograms contained in analysis.root:

root -l analysis.root 
root [0] 
Attaching file analysis.root as _file0...
root [1] new TBrowser

Accessing b-tag information without PAT

The b-tag information that is stored in the RECO data tier is a simple association map between a reco::Jet and a float variable. The float variable is the b-tag discriminator value. The jet-float association is stored in a collection and can be accessed with:

edm::Handle<reco::JetTagCollection> tagHandle;
iEvent.getByLabel("combinedSecondaryVertexBJetTags", tagHandle);
const reco::JetTagCollection & tagColl = *(tagHandle.product());
std::cout << "Found " << tagColl.size() << " B candidates in collection " << std::endl;

You can simply add the above four lines to the end of the PATBTagAnalyzer::analyze function from the previous section. You can also replace combinedSecondaryVertexBJetTags with any other b-tagging algorithm (see above).

The size of this collection is the same as the jet collection that has been used as input to the b-tagging sequence. This means that each jet in the jet collection has exactly one matching entry in the JetTagCollection.

Then loop over the collection and print the discriminator values

for (JetTagCollection::const_iterator tagI = tagColl.begin();
             tagI != tagColl.end(); ++tagI) {
          std::cout<<" jet pt = " << tagI->first->pt() << "  btag discriminator value = " << tagI->second << std::endl;

As a next step we want to access some properties of jets which are relevant for calculating b-tagging discriminators, for example secondary vertices. These are stored in the data format "SecondaryVertexTagInfos". We can access these tag infos with

edm::Handle<reco::SecondaryVertexTagInfoCollection> tagInfosHandle;
iEvent.getByLabel("secondaryVertexTagInfos", tagInfosHandle);
const reco::SecondaryVertexTagInfoCollection & tagInfoColl = *(tagInfosHandle.product());

and then loop over them with

for(reco::SecondaryVertexTagInfoCollection::const_iterator iter = tagInfoColl.begin(); iter != tagInfoColl.end(); ++iter) {
     // if there are reconstructed vertices in this jet
          if(iter->nVertices() >=1 ) {
              std::cout<<"found secondary vertex with a flight distance of " << iter->flightDistance(0).value() << " cm"<< std::endl;

which will print out the flight distance of each secondary vertex that is found in jets.

Part II: Make simple b-tag performance plots

In the following, we will first calculate the b-tagging efficiency and fake rate by simply counting how many b-jets are b-tagged and how many non b-jets are b-tagged. For simplicity, we can do this calculation at the three operating points loose, medium and tight. You can look up the thresholds for cutting on the discriminator at here.

We first need to find out for simulated events, which jets are b-jets and which jets are non b-jets. For this purpose we can use the standard jet flavour tool that is running by default when using PAT. The jet flavour is accessed with patJet->partonFlavour() method. We count the number of true b-jets, NBJets and the ones which pass CSV medium tagger, NBJetsPassingCSVM.

if(fabs(patJet->partonFlavour()) == 5) { hTrueBJetPt->Fill(patJet->pt()); hTrueBJetEta->Fill(patJet->eta()); NBJets++;
                        if(patJet->bDiscriminator("combinedSecondaryVertexBJetTags") > 0.679) {

Similarly we do the same for non b-jets (NnonBJets and NnonBJetsPassingCSVM).

else if ((fabs(patJet->partonFlavour()) == 1) || (fabs(patJet->partonFlavour()) == 2) || (fabs(patJet->partonFlavour()) == 3) || (fabs(patJet->partonFlavour()) == 21)) {
                        if(patJet->bDiscriminator("combinedSecondaryVertexBJetTags") > 0.679) {

The division of corresponding numbers is done in endJob() function to calculate the b-tagging efficiency and mistag rate from simple counting.

Now let us compute btagging efficiency and light jet mistag rates for CSV tagger as a function of jet pt and jet eta for medium working point. Look at the corresponding part of code in PATBTagAnalyzer.cc:

if(fabs(patJet->partonFlavour()) == 5) { hTrueBJetPt->Fill(patJet->pt()); hTrueBJetEta->Fill(patJet->eta()); NBJets++;
                        if(patJet->bDiscriminator("combinedSecondaryVertexBJetTags") > 0.679) {
                        } else if(patJet->bDiscriminator("combinedSecondaryVertexBJetTags") > 0.898) {
                } else if ((fabs(patJet->partonFlavour()) == 1) || (fabs(patJet->partonFlavour()) == 2) || (fabs(patJet->partonFlavour()) == 3) || (fabs(patJet->partonFlavour()) == 21)) {
                        if(patJet->bDiscriminator("combinedSecondaryVertexBJetTags") > 0.679) {
                        } else if(patJet->bDiscriminator("combinedSecondaryVertexBJetTags") > 0.898) {

For this we first identify jets with jet flavour method and then separate them out as a b-jet or a light-jet depending upon their jet flavour value i.e. 5 (b) or 1 (d), 2 (u), 3 (s), 21 (gluon). Once we separate them out we fill corresponding histograms hTrueBJetPt, hTrueBJetEta and hNonBJetPt, hNonBJetEta before applying any btagger. This gives us the denominators for calculating btagging efficiency and light jet mistag rates. These jets are further required to have CSV discriminator value > 0.679 (0.898) for medium (tight) working point and the corresponding histograms hTrueBJetPtPassingCSVM, hTrueBJetEtaPassingCSVM and hNonBJetPtPassingCSVM, hNonBJetEtaPassingCSVM are filled. These histograms serve as numerators for our btagging efficiency and light jet mistag rate calculations. A simple macro EffAndMistagPlotter.C has been provided with the tutorial which simply divides these histograms and plots efficiency and mistag rate as a function of jet pt. One can run the macro as:

root -l EffAndMistagPlotter.C

You should get results similar to:

More... Close bTagEff_CSVM_v1.png NonbJetMisTag_CSVM.png

Exercise 1

Repeat the exercise for CSV tight working point. How do the curves compare between medium and tight working points?

Spoiler - You should get results similar to:

More... Close bTagEff_CSVT.png NonbJetMisTag_CSVT.png

Exercise 2

What is the b-tagging efficiency and mistag rate from simple jet counting method for TCHP? Also, plot the b-tagging efficiency and mistag rate as a function of jet pt for TCHP.

Spoiler - You should get results similar to:

More... Close B-tagging Efficiency (by simple counting) : 41.3032% Mistag rate (by simple counting) : 0.210924%

More... Close bTagEff_TCHP.png NonbJetMisTag_TCHP.png

Part III: Apply data/MC scale factors

The Monte Carlo simulation does not perfectly reproduce the real b-tagging performance which has been measured in data. Therefore, scale factors to correct the simulation to data have been derived. These scale factors are available on the main b-tagging POG twiki (link). Usually, the scale factors depend on the jet transverse momentum and pseudo rapidity. For example, the CSV b-tagging algorithm at medium working point has the following functional description of the b-tag efficiency scale factor:

SFb = (0.939158+(0.000158694*x))+(-2.53962e-07*(x*x));
where x represents the jet pt. It can be seen that the scale factor is close to unity and has a moderate dependence on the momentum. The scale factor uncertainties are provided as well in bins of pt and they are of the size of a couple of %.

When applying these scale factors to simulated events one has to consider that an event may contain more than one b-jet. In addition, non b-jets (light flavour) may be wrongly mistagged as b-jets. Scale factors are available also for the mistag rate, but they are of course not identical to the b-tag efficiency scale factors. As a result, we have to combine several different scale factors per simulated event, one for each jet, which depend on the jet flavour, the jet momentum and the b-tagging algorithm.

A variety of methods have been invented to apply these scale factors to simulated events. The details are documented here.

Probably the easiest method is the "Jet-by-jet updating of the b-tagging status" in which b-tagged jets are flagged as "not b-tagged" based on a random number. For example, if the b-tag efficiency scale factor is 0.95, then 5% of all b-tagged b-jets are flagged as not b-tagged (the b-tags are simply thrown away to match the expected efficiency in data). This works for light flavour mistags as well. However, as the mistag scale factor can be larger than one, we have to change the status from not b-tagged to b-tagged (and we need the mistag efficiency as well to decide how many light jets should receive this change).

This "jet by jet" method has certain disadvantages. For example, if properties of b-tagged jets are used in an analysis, such as secondary vertices, they may be invalid for non-btagged jets which receive an upgrade to b-tagged, simply because a vertex may not be reconstructed. Another issue is that events which don't pass selection criteria may suddenly be selected after changing the b-tag status, so that they have to be added back. These disadvantages are avoided by the methods applying event weights as discussed below.

The goal of event reweighting methods is to predict the correct event yield in data by only changing the weight of the selected MC events, i.e., MC events that did not pass the selection do not need to be added back. This way there will be no problems with undefined variables in the later steps of the analysis. This method is described in more detail here. It is based on the probability to observe a given b-tag configuration in MC and data. Some example code for how to apply this method is linked above as well (direct link here).


Add the provided example code for b-tag event weights to our PAT analysis in Section I. Make the necessary adjustments (adapt the code and fill the missing pieces where indicated). Calculate the b-tag event weights. The efficiencies that are needed as input can be taken from section II.

Part IV: Advanced topics

Configure your own b-tagging sequence

Now we create and configure our own btagging sequence step by step. This will result in a very simple, but working configuration file. The final file is attached to this twiki, all components are described in the following. First include some standard headers:

import FWCore.ParameterSet.Config as cms

process = cms.Process("validation")

# load the full reconstraction configuration, to make sure we're getting all needed dependencies

# for the conditions
from Configuration.AlCa.autoCond import autoCond 
process.GlobalTag.globaltag = autoCond['startup']

Now add the jet-track association producer (JTA) which does the association between jets and tracks:

process.MyAk5PFJetTracksAssociatorAtVertex = cms.EDProducer("JetTracksAssociatorAtVertex",
   jets = cms.InputTag("ak5PFJets")
The next step is to produce the tagInfo objects that are used by all b-tagging algorithms. For this we clone the existing tagInfo configuration and change the JTA input to take the one from above:
process.MyImpactParameterPFTagInfos = process.impactParameterTagInfos.clone(
  jetTracks = "MyAk5PFJetTracksAssociatorAtVertex"

Now we add a b-tagging algorithm, in this case the track-counting algorithm, which takes the tagInfo from above as input:

process.MyTrackCountingHighEffBJetTags = process.trackCountingHighEffBJetTags.clone(
  tagInfos = cms.VInputTag(cms.InputTag("MyImpactParameterPFTagInfos"))

Finally we add a the path that actually runs the above defined modules in sequence:

process.plots = cms.Path(
  process.MyAk5PFJetTracksAssociatorAtVertex *
  process.MyImpactParameterPFTagInfos *

Finally, we need to add a dataset to read. You can use the same as above.

The resulting config file is fully functionable and running. If you add an OutputModule and an EndPath statement, you will see in the output file your newly defined objects.

The resulting configuration file can be downloaded here.


Add the secondaryVertexTagInfos and the simple secondary vertex b-tagging algorithm (simpleSecondaryVertexHighEffBJetTags) to this configuration by yourself. You can find their definitions in
(link to code browser) and in
(link to code browser)

Make sure you use the MyImpactParameterPFTagInfos that we defined above as input to the secondaryVertexTagInfos.

The solution can be downloaded here.

Performance plots with DQM validation tools

We now add the standard DQM validation tools to our config file which produce the usual mistag vs. efficiency curves and many other plots. As first step, we need to associate the true MC flavor with jets. This is done with three modules: the PartonSelector, the JetPartonMatcher and with the JetFlavourIdentifier from the PhysicsTools defined in the package


The additional configuration to include is:

We modify the JetPartonMatcher to use the same jets as we used in the b-tagging algorithms:
process.AK5PFbyRef = process.AK5byRef.clone(
  jets = "ak5PFJets"
The next step, the flavour identification can be done with different methods, we use the "algorithmic" definition and give the above parton matcher as input.
process.AK5PFbyValAlgo = process.AK5byValAlgo.clone(
  srcByReference = "AK5PFbyRef"
We then also need to add these three modules to the path:
process.myPartons *
process.AK5PFbyRef *

The calculation of b-tagging efficiencies and much more is done in the BTagPerformanceAnalyzerMC from the package Validation/RecoB. It is configured in

Again, we clone this configuration and modify a few parameters. In particular, we only want the analysis of the impact parameters and the track-counting b-tagger. So we specify the name of our algorithm and the tag info and add the following to our config file:
process.MybTagValidation = process.bTagValidation.clone(
  jetMCSrc = 'AK5PFbyValAlgo',
  tagConfig = cms.VPSet(
            type = cms.string('TrackIP'),
            label = cms.InputTag("MyImpactParameterPFTagInfos"),
            folder = cms.string("IPTag")
            label = cms.InputTag("MyTrackCountingHighEffBJetTags"),
            folder = cms.string("TCHE")
As the validation is embedded in the DQM environment, we also need to configure a few DQM parameters to make sure the histograms from the validation are saved correctly (further explanations of DQM are beyond the scope of this tutorial):
process.dqmEnv.subSystemFolder = 'BTAG'
process.dqmSaver.producer = 'DQM'
process.dqmSaver.workflow = '/POG/BTAG/BJET'
process.dqmSaver.convention = 'Offline'
process.dqmSaver.saveByRun = cms.untracked.int32(-1)
process.dqmSaver.saveAtJobEnd =cms.untracked.bool(True) 
process.dqmSaver.forceRunNumber = cms.untracked.int32(1)

Finally adding the validation and dqmSaver to the path as well:

process.MybTagValidation *

This configuration is now ready to run. The resulting output file with histograms will appear in the same directory under the name DQM_V0001_R000000001__POG__BTAG__BJET.root. Open the file with the ROOT browser and navigate to the folder DQMData/Run 1/Btag/Run summary/. Here you will find a large number of subfolders which can be grouped into three different types: IPTag_*, MisidForBEff_TCHE_* and TCHE_*. These three folders are replicated several times for various pt and eta bins. The bins in eta and pt are defined in the common block of the BTagPerformanceAnalyzerMC, namely in the file


The suffix *_GLOBAL contains inclusive plots over the full pt and eta range.

  • IPTag_*: these folders contain the validation of the impactParameterPFTagInfos, namely distributions of track quantities such as impact parameter value, errors and significances, track multiplicity, distance to jet axis and many more.
  • MisidForBEff_TCHE_*: these folders contain the (mis)-tag rate versus jet pt for a fixed b-tag efficiency of 0.5.
  • TCHE_*: these folders contain specific performance curves for the TCHE algorithm, most importantly the FlavEffBsBEff_* performance curves.


As in the previous exercise, add the secondaryVertexTagInfos and the simple secondary vertex b-tagging algorithm (simpleSecondaryVertexHighEffBJetTags) to the validation plots yourself.

The solution can be downloaded here.

Development of a new track counting b-tagging algorithm

We now want to create an improved track counting b-tagging algorithm which uses a pt-dependent "shrinking" cone for track- to jet association. The idea is that high-pt jets are more collimated than low-pt jets, so that we can reduce the dR cone for associating tracks. A completely uneducated guess would be to shrink the cone linearly to from 0.5 to 0.1 between 100GeV and 1TeV, but this is not the point of our tutorial, so you can invent your own function for this.

We don't want to touch the JetTracksAssociator itself, as the JTA is used for all kinds of algorithms which we do not want to change, so the right place to implement this is directly in the b-tagging algorithm.

Looking into the configuration of the track counting "high-efficiency" algorithm in

(link to code browser)

we see that it has only two configuration parameters, the well known "tagInfo" from above and a "jetTagComputer". The producer is the JetTagProducer which is implemented in

(link to code browser)

We should take a moment to understand this JetTagProducer.cc in some detail as it plays a central role in ALL b-tagging algorithms. The "setup" method in lines 72 to 94 simply retrieves the jetTagComputer from the event setup. The real work is done in the following "produce" method. In line 127 a loop over all input tagInfo types (can be more than one, e.g. for secondary vertex algorithms) is started, while line 131 loops over the tagInfos of each type. There is always a direct correspondence between a jet and a tagInfo which means that each tagInfo belongs to exactly one jet. This is resolved in lines 133 to 137 where the stl::map jetToTagInfos is filled for each jet with a vector of tagInfos. The reference to jet is used as key in this map. Later in lines 151 to 159 we loop over this map and pass the vector containing the tagInfos to the jetTagComputer (via a helper class). The b-tagging discriminator is then calculated by the jetTagComputer and associated to the jet. The result is a product of type JetTagCollection which associates a floating point discriminator to each jet. So we understand that the real math is done in the jetTagComputer which is therefore the right place to put our own improvements. (If the above was too complicated at first sight, we can still continue, as we don't need to change anything here).

Coming back to our configuration of the track counting algorithm in trackCountingHighEffBJetTags_cfi.py (see above) we understand that the used jetTagComputer is labelled trackCounting3D2nd. We find it's configuration in the file

(link to code browser)

and its implementation in the file

(link to code browser)

The discriminator() function in line 41 of the TrackCountingComputer simply returns the IP significance of the n-th track where the tracks are sorted by IP significance and n is given as a configuration parameter. In line 38 the significances are sorted by the orderedSignificances function where also additional cuts are applied. The loop over all impact parameters in the tagInfo is done in line 55 and line 57 applies the cuts on the distance to jet axis, decay length and track quality. An additional cut on the dR between track and jet is applied in line 62, however this cut is currently deactivated (the parameter is set to -1 by default). This is the place where we will add our new functionality.

Now we could simply modify the existing TrackCountingComputer.h file, add our new code and that's it. This is how it will be done in a future version of this module. However, to demonstrate how we can add a completely new algorithm, we will duplicate it, give it a new name and run it in parallel to the existing one in the following exercise.

Exercise 1

Duplicate the TrackCountingComputer.h and rename the file to something meaningful, such as ShrinkingTrackCountingComputer.h. In the new file, replace all occurrences of "TrackCountingComputer" with "ShrinkingTrackCountingComputer" (make sure this also replaced in the first two lines after #ifndef).

The same has to be done with the

In this file we also need to rename the TrackCountingESProducer to ShrinkingTrackCountingESProducer which implies that we need to add the ShrinkingTrackCountingESProducer as well to the file
in order to let the event setup know about the new ESProducer (in modules.cc duplicate lines 16 and 17 and rename to Shrinking...)

Finally, we need to tell the JetTagProducer that it should pick up our new ShrinkingTrackCountingComputer, so we add the configuration of our new algorithm to the


Hint: we also need to include the new computer in


Check the performance of the new algorithm with the validation tools as in the second chapter of this tutorial. Has the performance changed?

The modified files can be downloaded (as tar.gz package) here.

Exercise 2

The names of the track counting computer configuration files
indicate that the 3-dimensional Impact Parameter is used and that the 2nd or 3rd track in the jet (sorted by IP significance) is used as discriminator. The second track performs better in the high-efficiency region, while the 3rd track is better in the high-purity region. Now try what happens when you use the first or the fourth track instead. Does it give even better results in the very-high-efficiency or very-high-purity regions?

-- AlexanderSchmidt - 28 Apr 2014

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng NonbJetMisTag_CSVM.png r1 manage 12.1 K 2014-05-11 - 03:00 NitishDhingra  
PNGpng NonbJetMisTag_CSVT.png r1 manage 11.6 K 2014-05-11 - 03:00 NitishDhingra  
PNGpng NonbJetMisTag_TCHP.png r1 manage 12.4 K 2014-05-11 - 03:00 NitishDhingra  
Texttxt OwnSequence.py.txt r2 r1 manage 1.7 K 2014-05-12 - 18:35 AlexanderSchmidt  
Texttxt OwnSequence2.py.txt r2 r1 manage 2.2 K 2014-05-12 - 18:35 AlexanderSchmidt  
Texttxt OwnSequence3.py.txt r2 r1 manage 3.7 K 2014-05-12 - 18:36 AlexanderSchmidt  
Unknown file formatgz ShrinkingTrackCounting.tar.gz r1 manage 5.0 K 2014-05-11 - 00:12 AlexanderSchmidt  
PNGpng bTagEff_CSVM.png r1 manage 12.4 K 2014-05-11 - 03:00 NitishDhingra  
PNGpng bTagEff_CSVM_v1.png r1 manage 12.4 K 2014-05-11 - 03:03 NitishDhingra  
PNGpng bTagEff_CSVT.png r1 manage 12.0 K 2014-05-11 - 03:00 NitishDhingra  
PNGpng bTagEff_TCHP.png r1 manage 11.3 K 2014-05-11 - 03:00 NitishDhingra  
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2014-05-12 - AlexanderSchmidt
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback