PAT Exercise 06: Embedding of extra information in PAT

Contents

Objectives

  • Learn what the problem of internal references is within the EDM.
  • Learn how PAT solves this situation.
  • Learn how to embed extra information into a pat::Candidate.

Introduction

In this exercise you will learn how to use embedding in PAT. The introduction explains the concept of embedding. In the following sections an example of embedding in PAT is explained in detail. Finally there are exercises on how to use embedding in PAT in a physics analysis.

All objects in the Event Data Model (EDM) contain references to information related to them. CaloJets, for example, contain references to the CaloTowers they were produced from. This saves disk space as opposed to storing the CaloTowers themselves and not the references within the jets. However this can create a confusing web of cross references throughout the whole EventContent, which significantly complicates the enterprise to reduce the EventContent to what is really needed for specific analyses. In our example, when the collection of CaloTowers (which make a significant amount of the EventContent!) are dropped the references within the collection of CaloJets will turn invalid.

As the pat::Tuple is an analysis data format (to be compared to the user ntuple), the reduction of the EventContent to what is really necessary in a user's analysis is one of its key elements. In our example while producing a pat::Tuple, one may want to retain the CaloTowers of which the CaloJets were produced for later use, while dropping all the others. To allow for this PAT has introduced the concept of embedding.

In the implementation of <=CMSSW_3_6_X the referenced objects (i.e. CaloTowers) were hard-copied into the pat::Objects (i.e. pat::Jets) depending on the configuration of the user. Calling the member function of the pat::Jet in a later analysis will check internally whether the CaloTowers have been embedded before or not and return the corresponding references accordingly. If the collection of CaloTowers is still part of the EventContent the references will be used. In this way the access to the referenced information is completely transparent to the user in later analysis states.

In the >=CMSSW_3_8_X the implementation of embedding CaloTowers into the pat::Jets has changed to optimize the performance of the access to pat::Jets. The CaloTowers are now stored in a separate collection containing all the CaloTowers which were clustered into pat::Jets (which is significantly smaller than the collection of all CaloTowers). The access to the referenced information is completely transparent to the user in later analysis states as before.

In our example you can safely drop the standard collection of CaloTowers thereby reducing the size of pat::Tuple and at the same time refer to the embedded CaloTowers stored in a separate collection.

ALERT! Note: In case you dropped the CaloTower collection from the EventContent and did not embed the CaloTowers into the pat::Jets calling the corresponding member function will still cause an edm::Exception.

You will learn in the following how to configure embedding to more efficiently customise your event content.

Setting up of the environment

We assume that you are logged in on lxplus and are in your work directory. If not you can follow the instruction given here.

mkdir exercise08a
cd exercise08a
cmsrel CMSSW_7_4_1_patch4
cd CMSSW_7_4_1_patch4/src 
cmsenv
addpkg PhysicsTools/PatAlgos V08-07-31-01
addpkg FWCore/GuiBrowsers V00-00-56
scram b -j4

If you are running remotely (via ssh -Y) use edmConfigEditorSSH instead of edmConfigEditor in the following.

ALERT! Note that you need a reasonably good network connection to make use of the graphical tools via ssh -Y. If you don't have a sufficient connection, you may proceed doing this exercise using interactive python and text editors only.

PAT default embedding

Let's first find out what is embedded in PAT objects by default. Here we show as an example the pat::Jets, but in principle this recipe holds for any kind of PAT object. Inspect the configuration using the ConfigEditor:

edmConfigEditor PhysicsTools/PatAlgos/test/patTuple_standard_cfg.py 

Then search and select the module "patJets" (e.g. using Edit->Find) and browse its parameters. The result is the following:

embedding-defaults.png

embedCaloTowers = True
embedGenPartonMatch = True
embedGenJetMatch = True
embedPFJetCandidates = True

Alternatively you can find this information in the configuration file in cvs: jetProducer_cfi.py You can look at the module by using interactive python as well:

python -i PhysicsTools/PatAlgos/test/patTuple_standard_cfg.py 
>>>process.patJets

As you can see, by default, the CaloTowers are embedded into the pat::Jets. The matched generated parton and GenJet are embedded as well. In case of PFJets, the PFJetCandidates are also embedded. For the following examples you may keep this configuration opened in ConfigEditor.

Embed extra information into PAT objects

Let's try to access the CaloTowers in the default case of embedding. First create a pat::Tuple using the standard configuration:

cmsRun PhysicsTools/PatAlgos/test/patTuple_standard_cfg.py 

Print the event content and the size of each branch:

edmEventSize -v patTuple.root

patJets_cleanPatJets__PAT. 21036.6 4104.06
CaloTowers_selectedPatJets_caloTowers_PAT. 15050.1 5910.26

ALERT! Note: The first number of the output indicates the uncompressed event size while the second number shows the event size at the current zip compression level used within the EDM. The second number is the important one to check what disc space will physically be needed per event to store this collection.

Using the PAT default, cleanPatJets and a collection of calo towers clustered inside the pat::Jets are kept in the event. Together they make up about 10kB per event for the default MC sample (ttbar). ALERT! Note: To arrive at this number add the number in the last columns of the two components. Note that these are given in bytes, so you will have to divide them by 1024 to arrive at the corresponding numbers in kB per event.

Now check out the PAT example analyzers:

addpkg PhysicsTools/PatExamples V00-05-29

Edit PhysicsTools/PatExamples/plugins/PatBasicAnalyzer.cc. Uncomment the code that plots the number of CaloTowers per pat::Jet:

// uncomment the following line to fill the jetTowers_ histogram
jetTowers_.Fill(jet->getCaloConstituents().size());

// uncomment the following line to book the jetTowers_ histogram
jetTowers_ = fs->make<TH1F>("jetTowers", "towers per jet", 90, 0, 90);

Compile and run. ALERT! Note: You might have to adapt the name of the input file in the file analyzePatBasics_cfg.py before starting cmsRun.

scram b -j4
rehash
cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py

root -l analyzePatBasics.root
root [0] TBrowser b
  • double-click on analyzePatBasics.root= in =ROOT files on the left sidebar;
  • double-click on analyzeBasicPat;1.
  • double-click on jetTowers;1.

towers.png

PAT objects without embedding and without keeping extra branches of the event

Let's see what happens when we switch off the embedding of calo towers into the pat::Jets. Therefore open the file patTuple_standard_cfg.py in ConfigEditor again. Modify it using ConfigEditor click on Edit using ConfigEditor. Search and select the module "patJets" and mark:

embedCaloTowers = False

Save as "user_noembed_cfg.py" resulting in the following configuration:

### Generated by ConfigEditor ###
import sys
import os.path
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_BASE','src/PhysicsTools/PatAlgos/test'))))
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_RELEASE_BASE','src/PhysicsTools/PatAlgos/test'))))
### --------------------------- ###

from patTuple_standard_cfg import *

### Generated by ConfigEditor ###
if hasattr(process,'resetHistory'): process.resetHistory()
### --------------------------- ###
process.patJets.embedCaloTowers=False

Run and measure the event content sizes.

cmsRun user_noembed_cfg.py 
edmEventSize -v patTuple.root

patJets_cleanPatJets__PAT. 15480.6 3599.76
CaloTowers_selectedPatJets_caloTowers_PAT. 426.82 289.85

As you can see the size of the pat::Jets with calo towers has decreased from 10kB per event with embedding of CaloTowers to 3.6kB per event without embedded CaloTowers. The CaloTowers_selectedPatJets_caloTowers_PAT collection is actually empty now and can be dropped from the event content to save 0.3kB per event. ALERT! Note: Again add up the numbers of in the last columns to arrive at this estimate and divide the sum by 1024.

Now execute our EDAnalyzer for the CaloTowers again:

cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py

The code will crash! The reason is that the CaloTowers are not embedded in the pat::Jets and not kept in the event content. Trying to access them anyhow finally leads to the crash.

Keep extra branches of the event

Another option to make the CaloTowers accessible from the pat::Jets is to keep them in the event content. Therefore open the file patTuple_standard_cfg.py in the ConfigEditor again. In order to modify it using the ConfigEditor click on Edit using ConfigEditor. Search and select the module out. Edit the parameter outputCommands and add the following entry. You may click on the "pencile" symbol to open an editor window.

''keep CaloTowers*_towerMaker_*_*'

Save this configuration as "user_keep_cfg.py". It will look like this:

### Generated by ConfigEditor ###
import sys
import os.path
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_BASE','src/PhysicsTools/PatAlgos/test'))))
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_RELEASE_BASE','src/PhysicsTools/PatAlgos/test'))))
### --------------------------- ###

from patTuple_standard_cfg import *

### Generated by ConfigEditor ###
if hasattr(process,'resetHistory'): process.resetHistory()
### --------------------------- ###
process.out.outputCommands = cms.untracked.vstring('drop *',     
'keep *_cleanPatPhotons*_*_*',     
'keep *_cleanPatElectrons*_*_*',     
'keep *_cleanPatMuons*_*_*',     
'keep *_cleanPatTaus*_*_*',     
'keep *_cleanPatJets*_*_*',     
'keep *_patMETs*_*_*',     
'keep *_cleanPatHemispheres*_*_*',     
'keep *_cleanPatPFParticles*_*_*',     
'keep *_cleanPatTrackCands*_*_*',     
'keep CaloTowers*_towerMaker_*_*')

Run and show the event content and sizes:

cmsRun user_keep_cfg.py 
edmEventSize -v patTuple.root

CaloTowersSorted_towerMaker__RECO. 94692.2 18715.4
patJets_cleanPatJets__PAT. 21036.6 4104.06

As you can see the complete set of calo towers are now kept in the event with an additional 19 kB per event. ALERT! Note: You can see this number in the last column of the first row. This is disc space in addition to the size of the jet collection of roughly 4 kB per event.

ALERT! Note: The pat::Jet collection plus the whole CaloTower collection (23kB per event) is bigger than the pat::Jet collection with embedded calo towers (10kB per event), since only towers clustered in the jets are stored in that case.

Now execute our EDAnalyzer for the CaloTowers again:

cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py
root -l analyzePatBasics.root
root [0] TBrowser b
  • double-click on analyzePatBasics.root= in =ROOT files on the left sidebar;
  • double-click on analyzeBasicPat;1.
  • double-click on jetTowers;1.

Resulting in the following plot:

towers.png

Keeping the CaloTowers in the event content allows to access them from the pat::Jets. ALERT! Note: All CaloTowers including those, that were not clustered in any jet are kept. Therefore this option may not be the best for saving disk space if you are only interested in CaloTowers clustered in jets. Thus consider the PAT default option of embedding.

ALERT! Note: If you are using a FWLiteAnalyzer keeping does not always hold. You should embed instead.

Exercises

Before leaving this page try to do the following exercises:

Exercise 6 a): Disable all additional information for pat::Jets and compare their size to reco::Jets.
Question Can you reduce the pat::Jet size to the original reco::CaloJet size?
Question What is the size of the pat:Jet collection after removing embedding and additional information (e.g. addGenJetMatch=False, ...)?
You may proceed as follows:

Switch off all information added to the pat::Jet:

process.patJets.addTagInfos = False
process.patJets.addJetCharge = False
process.patJets.addGenJetMatch = False
process.patJets.embedGenJetMatch = False
process.patJets.addAssociatedTracks = False
process.patJets.addDiscriminators = False
process.patJets.embedGenPartonMatch = False
process.patJets.addGenPartonMatch = False
process.patJets.embedPFCandidates = False
process.patJets.addJetCorrFactors = False
process.patJets.addBTagInfo = False
process.patJets.embedCaloTowers = False
process.patJets.addJetID = False

Add reco::CaloJets to the output event content:

'keep recoCaloJets_ak5CaloJets_*_*'

Calculate event size:

cmsRun user_no_embedding.py
edmEventSize -v patTuple.root

You will find the solution here:

The result is:
patJets_cleanPatJets__PAT. 8276.94 2086.14
CaloTowers_selectedPatJets_caloTowers_PAT. 426.82 298.95
recoGenJets_selectedPatJets_genJets_PAT. 311.15 217.34
recoBasedTagInfosOwned_selectedPatJets_tagInfos_PAT. 118.76 82.92
recoCaloJets_ak5CaloJets__RECO. 3706.14 951.01
By switching off all additional information stored in the pat::Jet, its size is reduced to 2.0kB which is almost as small as the original reco::CaloJets with 1.0kB.

Exercise 6 b): Embed tracks (in the tracker, not in the muon system) into the pat::Muons and compare for each muon the pt of the track and the muon itself.
Question What fraction of muons have a track in the tracker?
You may proceed as follows:

Embed the tracks into the muons:

process.patMuons.embedTrack = True
Create a patTuple.root.

Edit PhysicsTools/PatExamples/plugins/PatBasicAnalyzer.cc and add a plot with the ratio of track pT and muon pT for each muon that has a track:

..... analyze(...) .....
  for(edm::View<pat::Muon>::const_iterator muon=muons->begin(); muon!=muons->end(); ++muon){
     histContainer_["muonstrackptfraction"]->Fill(muon->track().isNull() ? 0 : muon->track()->pt()/muon->pt());
  }
..... beginJob() .....
  histContainer_["muonstrackptfraction"]=fs->make<TH1F>("muonstrackptfraction", "muon track pt fraction", 101, 0, 2);

Compile and run:

scram b -j4
cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py

Look at the plot:

root -l analyzePatBasics.root
root [0] TBrowser b

  • double-click on analyzePatBasics.root= in =ROOT files on the left sidebar;
  • double-click on analyzeBasicPat;1.
  • double-click on the muonstrackptfraction;1.

You will find the solution here:

In CMSSW>=4_1_X you will see something like this:

muontrack.png

The plot is explained like this: There are 2 different kinds of muons visible in the plot:

  • Muons build from only tracks in the tracker peak at 1
  • Muons build from only tracks in the muon system peak at 0

In CMSSW_3_8_X you will see this:

muon-track-pt-fraction-2.png

From this plot you can see that the muon momentum is equal to the momentum of the tracker track that it contains.

In CMSSW_3_6_X you would see the following:

muon-track-pt-fraction.png

The plot is explained like this: There are 3 different kinds of muons visible in the plot:

  • Muons build from only tracks in the tracker peak at 1
  • Muons build from only tracks in the muon system peak at 0
  • Muons build from tracks combined in the tracker and muon system show up elsewhere

ALERT! Note:

In case of problems don't hesitate to contact the SWGuidePAT#Support. Having successfully finished Exercise 6 you might want to proceed to Exercise 7 of the SWGuidePAT to learn more about the PAT support of object disambiguation across different object collections. For an overview you can go back to the WorkBookPATTutorial entry page.

Review status

Reviewer/Editor and Date (copy from screen) Comments
RogerWolf - 18 June 2010 Final revision and synch of layouts

Responsible: AndreasHinzmann

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng embedding-defaults.png r1 manage 19.5 K 2010-08-11 - 15:25 AndreasHinzmann Default parameters of the patJets producer
PNGpng muon-track-pt-fraction-2.png r1 manage 15.7 K 2010-09-13 - 15:12 AndreasHinzmann  
PNGpng muon-track-pt-fraction.png r1 manage 14.5 K 2010-05-28 - 16:49 AndreasHinzmann muon track pt fraction
PNGpng muontrack.png r1 manage 10.7 K 2011-03-21 - 18:39 AndreasHinzmann  
PNGpng patJetProperties.png r1 manage 18.8 K 2010-05-26 - 16:14 AndreasHinzmann Parameters of the patJets producer
PNGpng towers.png r1 manage 10.1 K 2011-03-21 - 18:04 AndreasHinzmann  
PNGpng towersPerJet-v2.png r1 manage 14.4 K 2010-08-11 - 17:45 AndreasHinzmann  
PNGpng towersPerJet.png r1 manage 13.8 K 2010-05-26 - 16:14 AndreasHinzmann Tower per Jet
Cascading Style Sheet filecss tutorial.css r1 manage 0.2 K 2010-05-26 - 10:48 AndreasHinzmann  
Edit | Attach | Watch | Print version | History: r60 < r59 < r58 < r57 < r56 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r60 - 2011-12-07 - FelixHoehle
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback