Hadron based origin identification of heavy flavour jets at generator level

Contacts: Nazar Bartosik, Johannes Hauk

Introduction

This page describes the CMSSW plugin GenHFHadronMatcher, which is an EDProducer, and associated tools.

The purpose of the GenHFHadronMatcher tool is to identify heavy flavour (bottom, charm) jets at generator level and find originating partons to which they correspond. This is especially important in events that have many heavy flavour jets in the final state originating from different particles e.g. ttH(bb), ttZ(bb), tt+bb, tt+cc, etc. Identification of such jets is based on hadrons which is the definition preferred by theorists for comparison with measurements.

The main features of the tool are:

  • works in AOD and miniAOD data formats with minor configuration adjustments
  • finds all jets in the event that contain b/c hadrons (using ghost hadron injection by the official JetFlavour tool)
    • All hadrons stored in the EDM input files will be used (there is no kinematic selection applied, also in miniAOD currently all are stored)
  • for each b/c hadron finds a unique b/c quark and its mother (can be t, H, Z, g, proton, ...)
  • works in Pythia6, Pythia8, Herwig without any adjustments (relies only on pdgId of the particles, no status scheme dependence)
    • detects flavour oscillations of b hadrons (as stored in Herwig)
    • detects and skips infinite particle loops
  • for each b/c hadron associated to a jet leptonic decays are identified: lepton are stored with links to the corresponding hadrons
  • in a special case of tt+X: for each b/c hadron the flag is stored saying whether the hadron has been radiated before or after top quark decay
  • for c hadrons a flag is stored saying whether the c hadron originates from a b hadron (to properly identify prompt c jets)
  • more details about its principle of work can be found in the corresponding talk, in this analysis note AN-2014/093 (needs iCMS login), and in the PhD thesis of Nazar Bartosik CERN-THESIS-2015-120.

Installation instructions

The GenHFHadronMatcher plugin is a part of the sub-package PhysicsTools/JetMCAlgos and is by default available and fully functional in CMSSW from CMSSW_7_6_X onwards.

A test analyzer also exists in PhysicsTools/JetMCAlgos/test, which greatly simplifies understanding of the output and readability of the sample code.

Also earlier CMSSW versions (5X starting from CMSSW_5_3_24, and 7X starting from CMSSW_7_2_2_patch2) contained the tool but needed additional workarounds as was described on previous versions of this TWiki page, up to revision r19.

Configuration example for AOD/miniAOD

There is a simple test setup prepared under the PhysicsTools/JetMCAlgos/test folder in file matchGenHFHadrons.py which runs the tool over 1000 ttbar events and produces 2 root files matchGenHFHadrons_out.root (raw dump of the producer output) and matchGenHFHadrons_trees.root (output of the producer processes and appropriate information manually stored into trees by the matchGenHFHadrons.cc analyzer). The default input root files should be accessible from lxplus machines, but input test root files might need to be updated in the config.

The test configuration is documented with comments to each block and should be self explanatory. It is run by executing:

  • cd PhysicsTools/JetMCAlgos/test/
  • cmsRun matchGenHFHadrons.py (runs on predefined AOD files)
  • cmsRun matchGenHFHadrons.py runOnAOD=False (runs on predefined miniAOD files)

There are two differences between AOD and miniAOD. One is the name of genParticle collections used for ghost hadron injection and jet origin identification, which are genParticles in AOD, and prunedGenParticles in miniAOD. The other is the genJet collection, as genJets need to be created in AOD, while in miniAOD they already exist as slimmedGenJets. It should be noted that all B/C hadrons and their generator chains are stored in the prunedGenParticles, but not all particles used in the jet clustering are stored, some in the very-forward region are excluded. This means that if one checks for the constituents of the slimmedGenJets, some point to objects which are not stored in the very-forward region of jets with eta above 5.

Having genJets created, the chain for the hadron matching consists of 3 steps:

First step is to identify all final B/C hadrons, i.e. the last ones in the chain, which then decay to lighter flavour. This is done by process.selectedHadronsAndPartons, which is the plugin HadronAndPartonSelector. No selection is applied to the hadrons, all are considered.

Next step is to cluster the hadrons from the previous step as ghosts (energy scaled down by 10^{-18}) to the genJets. The ghost hadron injection into jets is done via the process.genJetFlavourInfos, which is the plugin JetFlavourClustering. It must have exactly the same parameters as the jets into which they are being injected (i.e. the parameters as used for the creation of slimmedGenJets stored in miniAOD, or in AOD the parameters as used in the explicit clustering of jets). This tool however re-clusters each jet subsequently including the collection selectedHadronsAndPartons (the ghosts), and then matches them kinematically to the original jet collection. And this causes some issue in miniAOD, but effects on any analysis are small. Full details of the issue are given at the end of this section.

Having obtained the association of the hadrons to the jets via the ghost clustering, the hadron and jet origin identification is done by the plugin GenHFHadronMatcher. The origin of each B/C hadron is identified going up the decay chains. Two instances of the GenHFHadronMatcher are imported: matchGenBHadron and matchGenCHadron that independently perform identification of B and C hadrons correspondingly. These are predefined versions of a generic matchGenHFHadron producer defined in PhysicsTools/JetMCAlgos/python/GenHFHadronMatcher_cfi.py. The raw dump of the tool is written to the matchGenHFHadrons_out.root file.

The EDAnalyzer which is also executed in this configuration, and the corresponding output file matchGenHFHadrons_trees.root, is discussed later in Example 2.

As mentioned before, especially in the very-forward region some constituents of the genJets are not stored in miniAOD and the pointer to this constituent is invalid -- this required the implementation of a protection in the plugin JetFlavourClustering, to not crash on miniAOD if a particle used in the original clustering is not stored in miniAOD, when looping over all the jet constituents for re-clustering. The protection however simply skips the missing constituents and prints an error message. Mainly jets in the very-forward region of eta above 5 are affected, so effects on any analysis are small. Another effect in addition to missing constituents comes from the reduced precision with which the constituents are stored in miniAOD. Due to these effects, the JetFlavourClustering prints several error messages when running, and results differ slightly for AOD and miniAOD. The error messages are:

%MSG-e MissingJetConstituent:  JetFlavourClustering:genJetFlavourInfos  27-Jan-2017 11:31:32 CET Run: 1 Event: 702414
Jet constituent required for jet reclustering is missing. Reclustered jets are not guaranteed to reproduce the original jets!
%MSG

%MSG-e JetPtMismatch:  JetFlavourClustering:genJetFlavourInfos  27-Jan-2017 11:31:32 CET Run: 1 Event: 702414
The reclustered and original jet 7 have different Pt's (9.86161 vs 13.1001 GeV, respectively).
Please check that the jet algorithm and jet size match those used for the original jet collection and also make sure the original jets are uncorrected. In addition, make sure you are not using CaloJets which are presently not supported.

In extremely rare instances the mismatch could be caused by a difference in the machine precision in which case make sure the original jet collection is produced and reclustering is performed in the same job.
%MSG

Issues related to this problem are the following:

  • Small effects on the analysis are introduced and cannot be avoided, but the influence is proven to be small
  • Many error messages are printed, and one needs to know that these can mainly be neglected, but spoil the printouts
  • No perfect reproducibility of results on miniAOD

The latter point is of special relevance, as the GenHFHadronMatcher meanwhile is used in MC production, for filtering of sub-processes from inclusive processes. One example in use is: filter from the inclusive ttbar process the processes containing additional heavy flavour jets, tt+HF, according to the process separation used in the ttH(bb) analysis, with the filter plugin ttHFGenFilter as configured in PhysicsTools/JetMCAlgos/python/ttHFGenFilter_cfi.py. This cannot be perfectly reproduced on miniAOD in the analysis later, so the categorisation of sub-processes of tt+xx is not fully identical between generation and analysis, and the usage of such specific MC samples in the analysis is not straight-forward.

The perfect solution would be to include the genJetFlavourInfos already in the miniAOD production and to store it in miniAOD, analogous to what is already done for recoJets (ghost B/C hadron clustering is used there for identification of heavy-flavour recoJets on truth level). This would mean to take lines 86-100 of the described example config matchGenHFHadrons.py into the miniAOD production. This approach solves all issues stated above, and in addition simplifies the configuration and reduces the run-time of the GenHFHadronMatcher.

Output of the tool

When opening the matchGenHFHadrons_out.root file, which is the raw dump of the GenHFHadronMatcher and is produced by the example configuration matchGenHFHadrons.py described in the previous section, one can find a number of branches created by the tool:

  • genBHadPlusMothers: vector of Candidate objects representing information about b hadrons, all their ancestor particles up to protons, and leptons from leptonic decays of the hadrons. Lorentz Vectors, and pdgId can be extracted for any particle in the list.
  • genBHadPlusMothersIndices: for each particle in the genBHadPlusMothers except leptons a vector of integers is stored that represent indices of its mother particles that are also stored in the genBHadPlusMothers collection. Usually vectors have size of 1 except strings, clusters or some other rare cases that have >1 mothers.
  • genBHadIndex: for each b hadron an integer is stored representing an index of the hadron in the genBHadPlusMothers collection.
  • genBHadFlavour: for each b hadron an integer is stored with absolute value representing flavour of a mother of the corresponding b quark and sign representing the sign of the hadron flavour. For example:
    • 6 - a b hadron from a t -> b (or flavour oscillation tbar -> bbar -> b hadron)
    • -6 - a b hadron from tbar -> bbar (or flavour oscillation t -> b -> bbar hadron)
    • 25 - a b hadron from H -> b (or flavour oscillation H -> bbar -> b hadron)
    • -25 - a b hadron from H -> bbar (or flavour oscillation H -> b -> bbar hadron)
    • 23 - a b hadron from Z -> b (or flavour oscillation Z -> bbar -> b hadron)
    • -23 - a b hadron from Z -> bbar (or flavour oscillation Z -> b -> bbar hadron)
    • 0 - origin of the hadron could not be identified
  • genBHadFromTopWeakDecay: for each b hadron an integer:
    • 1 - b hadron is from a b quark radiated after top quark weak decay
    • 0 - b hadron is from a b quark radiated before top quark weak decay
    • 2 - origin of the b hadron could not be identified
  • genBHadJetIndex: for each b hadron an integer representing an index of a jet to which the hadron has been associated by the JetFlavour tool
    • -1 - if option onlyJetClusteredHadrons: True was set and the hadron has not been associated to any jet
  • genBHadLeptonIndex: for each lepton from leptonic decays of the processed b hadrons an integer representing an index of the lepton in the genBHadPlusMothers collection
  • genBHadLeptonHadronIndex: for each lepton from leptonic decays of the processed b hadrons an integer representing an index of the corresponding b hadron from the genBHadIndex collection
  • genBHadLeptonViaTau: for each lepton from leptonic decays of the processed b hadrons an integer:
    • 1 - the lepton comes from the b hadron decay: B -> tau -> e/mu
    • 0 - the lepton comes from the b hadron decay: B -> e/mu
    • -1 - neither of the two (something went wrong)

Exactly the same set of information is produced when analysing c hadrons, for which names of the branches change from genBHad to genCHad. For c hadrons an additional branch is important which doesn't make sense for b hadrons:

  • genCHadBHadronId: for each c hadron an integer representing the index of an ancestor b hadron in the genCHadPlusMothers collection
    • -1 - has no b hadrons among mothers -> is a prompt c hadron

All branches for hadrons (genXHadIndex, genXHadFlavour, genXHadFromTopWeakDecay, genXHadJetIndex, genXHadBHadronId) have the same size and ordering. Element with index e.g 1 represents the same hadron in all the branches.

All branches for leptons (genXHadLeptonIndex, genXHadLeptonHadronIndex, genXHadLeptonViaTau) have the same size and ordering. Element with index e.g. 1 represents the same leptons in all the branches.

Configurable parameters

The matchGenHFHadron producer has several configurable options as defined in the PhysicsTools/JetMCAlgos/python/GenHFHadronMatcher_cfi.py:

  • genParticles: input tag of the genParticles collection that contains hadrons to be analysed, and all their ancestors up to initial partons, and daughters. In AOD it is genParticles. In miniAOD it is prunedGenParticles.
  • jetFlavourInfos: input tag of the output produced by the JetFlavourClustering plugin which has for each jet an additional collection of partons, hadrons, leptons associated to the jet by the clustering algorithm (the default configuration of the selectedPartonsAndHadrons injects only hadrons as ghosts, and nothing else is needed here).
  • flavour: flavour of jets that should be analysed. Can be 5 (b jets) or 4 (c jets).
  • onlyJetClusteredHadrons: whether only those hadrons should be analysed that are clustered to a jet. There can be ~3% of hadrons that have low energy and are far from closest jets so that they are not clustered to any jet. If it is needed to analyse such hadrons as well the flag should be set to False and performance can be slightly reduced.
  • noBBbarResonances: whether resonances of b/c quarks should not be treated as b hadrons and should be skipped. Resonances correspond to hadrons that have pdgId: X55x or X44x, where X means any sequence of digits or no digits and x means any single digit.

Event categorization example 1

This example describes process definitions of tt+xx processes, where xx are different jet flavour topologies.

It is based on an own EDProducer which reads the output of the GenHFHadronMatcher and returns directly a per-event ID, so the classification of the tt+xx processes is already fully defined. The plugin is in CMSSW in the sub-package TopQuarkAnalysis/TopTools, and is called GenTtbarCategorizer, with the steering parameters as given in GenTtbarCategorizer_cfi.py.

A test EDAnalyzer TestGenTtbarCategories with an example how to use the tool exists in TopQuarkAnalysis/TopTools/test, with the configuration given in testGenTtbarCategories.py. The test configuration produces a file called genTtbarId.root, and is executed by

  • cd TopQuarkAnalysis/TopTools/test/
  • cmsRun testGenTtbarCategories.py (runs on predefined AOD files)
  • cmsRun testGenTtbarCategories.py runOnAOD=False (runs on predefined miniAOD files)

The producer and analyzer are fully functional in CMSSW from CMSSW_7_6_X onwards, and a preliminary version was attached for downloading with instructions to earlier revisions of this TWiki up to revision r19.

The producer returns one single int, the event ID, constisting of 5 digits. The meaning of the IDs is explained in the beginning of the producer file, in the class description of GenTtbarCategorizer.cc. The example analyzer reads the IDs from the producer and fills them into histograms.

This example is very simple because:

  • One does not need to understand the underlying algorithm, the Producer returns directly a per-event ID of the process.
  • All additional jets are treated identically, contrary to the next example.

The producer is used in all ttH(bb) analyses at 13 TeV to classify the different tt+xx background processes.

The procedure is the following:

  • define kinematic criteria for considered jets: pt, |eta| (configurable steering parameters of the producer)
  • create lists of jets that contain b or c hadrons to define b and c jets, and identify the hadron origin. Exclude c hadrons which come from a b hadron decay
  • identify b jets containing b hadrons from t->Wb: can not be additional jets, but are called b jets from top, and are excluded from the list of additional jets identified in the following steps
  • identify b jets containing b hadrons coming from the W decay of the t->Wb: can not be additional jets, but are called b jets from W, and are excluded from the list of additional jets identified in the following steps
  • identify all other b jets containing b hadrons: these are additional b jets
  • identify c jets containing c hadrons coming from the W decay of the t->Wb: can not be additional jets, but are called c jets from W, and are excluded from the list of additional jets identified in the following steps
  • identify all other c jets containing c hadrons: these are additional c jets
  • all remaining jets are additional light flavour jets

Each event enters one of the following categories in the given order, i.e. it enters the first category where it fulfills the criterion. The categorisation scheme used is:
xxx51: tt+b (one additional b jet containing a single b hadron)
xxx52: tt+2b (one additional b jet containing at least 2 b hadrons)
xxx53, xxx54, xxx55: tt+bb (at least two additional b jets, independent of the number of b hadrons in each)
xxx41, xxx42, xxx43, xxx44, xxx45: tt+cc (at least one additional c jet, independent of the number of c hadrons in each)
xxx00: tt+LF (no additional b/c jets)
The three digits xxx at the beginning of the ID are not relevant in this classification scheme, each x can take values 0, 1, 2 depending on the number of b jets from top, of b jets from W (of the top decay) and of c jets from W (of the top decay).

Event categorization example 2

This example describes process definitions of tt+xx processes, where xx are different jet flavour topologies. The scheme is similar but not identical to the one from example 1.

It is based on the test analyzer matchGenHFHadrons which comes along with the GenHFHadronMatcher and is run as described above. The analyzer contains an example of how to access and interpret the output of the GenHFHadronMatcher, it produces 2 trees:

  • tree_events (event-wise values: number of hadrons, b/c jets, eventId based on number and origin of heavy flavour jets, etc.)
  • tree_bHadrons (bHadron-wise values: flavour, N leptons from the hadron decay, hadron/jet pT ratio, etc.)
The tree_bHadrons is more technical and is mostly for checking the detailed performance. The tree_events is of general interest and shows how to categorize tt+xx.

This example is much more sophisticated than the first one, due to the following reasons:

  • Much more information about the processes is accessed and stored in trees in the Analyzer.
  • The separation of tt+xx processes distinguishes between additional jets radiated before and after the top weak decay.

This categorisation scheme is used in the tt+bb dilepton differential cross section measurements of properties of additional b jets at 8 TeV, TOP-12-041.

The procedure is the following:

  • define kinematic criteria for additional jets: pt, |eta|
  • identify b jets containing b hadrons from t->Wb (regardless of their kinematics): can not be additional jets, but are called b jets from top, and are excluded from the list of additional jets identified in the following steps
  • create lists of remaining jets that contain b or c hadrons not from t->Wb, but originate from before or after top decay
  • Skip b and c hadrons coming from the W decay of the t->Wb. NB: This classification scheme was developed for dileptonic top decays, where this case of hadronic W decays never occurs. Instead of excluding these hadrons, jets containing these should ideally also be identified and excluded from the list of additional jets (as is done in Event categorization example 1 above)
  • Skip c hadrons that stem from the decay of a b hadron
  • count number of corresponding hadrons in each jet (jets from overlapping hadrons can have higher b-tag output)
  • jets are divided into additional (from hadrons before top decay) and pseudoadditional (from hadrons after top decay)
  • event is categorized based on number of additional/pseudoadditional b/c jets and number of hadrons in each of them

Each event enters one of the following categories in the given order, i.e. it enters the first category where it fulfills the criterion. The categorisation scheme used is:
x51: tt+b (one additional b jet containing a single b hadron)
x52: tt+2b (one additional b jet containing at least 2 b hadrons)
x53: tt+bb (two additional b jets: each b jet containing a single b hadron)
x54: tt+b2b (two additional b jets: one b jet containing a single b hadron, one at least 2 b hadrons)
x55: tt+2b2b (two additional b jets: each b jet containing at least 2 b hadrons)
x56: tt+B (one or more pseudo-additional b jets)
x41: tt+c (one additional c jet containing a single c hadron)
x42: tt+2c (one additional c jet containing at least 2 c hadrons)
x43: tt+cc (two additional c jets: each c jet containing a single c hadron)
x44: tt+c2c (two additional c jets: one c jet containing a single c hadron, one at least 2 c hadrons)
x45: tt+2c2c (two additional c jets: each c jet containing at least 2 c hadrons)
x46: tt+C (one or more pseudo-additional c jets)
x00: tt+jj (no additional or pseudo-additional b/c jets)
The first digit x at the beginning of the ID is not relevant in this classification scheme, x can take values 0, 1, 2 depending on the number of b jets from top. In case of more than 2 additional b(c) jets, the event is classified according to the 2 leading b(c) jets. The processes were studied individually as listed here in TOP-12-041, but several were merged in the final analysis.

Underlying identification method

Identification is performed in 3 steps. First all b (c) hadrons are found in the event. Then for each hadron a corresponding b (c) quark is found by scanning the particle chain up from hadron to the initial state particles. During this process for b (c) hadron a single corresponding b (c) quark is found. Its mother is then treated as an origin of the hadron which can be top quark, Higgs boson, Z boson, gluon, etc. And the final step is to associate each hadron to a jet which is done by the JetFlavour tool which uses jet clustering algorithm to inject ghost hadrons into jets and do the matching by looking for those hadrons among the jet constituents.

More details about the identification method, estimated performance and comparison to simple deltaR matching can be found in the AN-14-093.

Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2017-01-30 - JohannesHauk
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback