PAT Exercise 06: Embedding of extra information in PAT
Contents
Objectives
- Learn what the problem of internal references is within the EDM.
- Learn how PAT solves this situation.
- Learn how to embed extra information into a pat::Candidate.
Introduction
In this exercise you will learn how to use embedding in PAT. The introduction explains the concept of embedding. In the following sections an example of embedding in PAT is explained in detail. Finally there are exercises on how to use embedding in PAT in a physics analysis.
All objects in the Event Data Model (
EDM) contain references to information related to them. CaloJets, for example, contain references to the CaloTowers they were produced from. This saves disk space as opposed to storing the CaloTowers themselves and not the references within the jets. However this can create a confusing web of cross references throughout the whole EventContent, which significantly complicates the enterprise to
reduce the EventContent to what is really needed for specific analyses. In our example, when the collection of CaloTowers (which make a significant amount of the EventContent!) are dropped the references within the collection of CaloJets will turn invalid.
As the
pat::Tuple is an analysis data format (to be compared to the user ntuple), the reduction of the EventContent to what is really necessary in a user's analysis is one of its key elements. In our example while producing a
pat::Tuple, one may want to retain the CaloTowers of which the CaloJets were produced for later use, while dropping all the others. To allow for this PAT has introduced the concept of embedding.
In the implementation of <=CMSSW_3_6_X the referenced objects (i.e. CaloTowers) were hard-copied into the
pat::Objects (i.e. pat::Jets) depending on the configuration of the user.
Calling the member function of the
pat::Jet in a later analysis will check internally whether the CaloTowers have been embedded before or not and return the corresponding references accordingly. If the collection of CaloTowers is still part of the EventContent the references will be used. In this way the access to the referenced information is completely transparent to the user in later analysis states.
In the >=CMSSW_3_8_X the implementation of embedding CaloTowers into the
pat::Jets has changed to optimize the performance of the access to
pat::Jets.
The CaloTowers are now stored in a separate collection containing all the CaloTowers which were clustered into
pat::Jets (which is significantly smaller than the collection of all CaloTowers). The access to the referenced information is completely transparent to the user in later analysis states as before.
In our example you can safely drop the standard collection of CaloTowers thereby reducing the size of
pat::Tuple and at the same time refer to the embedded CaloTowers stored in a separate collection.
Note: In case you dropped the CaloTower collection from the EventContent and did not embed the CaloTowers into the
pat::Jets calling the corresponding member function will still cause an edm::Exception.
You will learn in the following how to configure embedding to more efficiently customise your event content.
Setting up of the environment
We assume that you are logged in on
lxplus
and are in your work directory. If not you can follow the instruction given
here.
mkdir exercise08a
cd exercise08a
cmsrel CMSSW_7_4_1_patch4
cd CMSSW_7_4_1_patch4/src
cmsenv
addpkg PhysicsTools/PatAlgos V08-07-31-01
addpkg FWCore/GuiBrowsers V00-00-56
scram b -j4
If you are running remotely (via
ssh -Y) use
edmConfigEditorSSH
instead of
edmConfigEditor
in the following.
Note that you need a reasonably good network connection to make use of the graphical tools via ssh -Y. If you don't have a sufficient connection, you may proceed doing this exercise using interactive python and text editors only.
PAT default embedding
Let's first find out what is embedded in PAT objects by default. Here we show as an example the
pat::Jets, but in principle this recipe holds for any kind of PAT object.
Inspect the configuration using the
ConfigEditor:
edmConfigEditor PhysicsTools/PatAlgos/test/patTuple_standard_cfg.py
Then search and select the module "patJets" (e.g. using
Edit->Find) and browse its parameters. The result is the following:
embedCaloTowers = True
embedGenPartonMatch = True
embedGenJetMatch = True
embedPFJetCandidates = True
Alternatively you can find this information in the configuration file in cvs:
jetProducer_cfi.py
You can look at the module by using interactive python as well:
python -i PhysicsTools/PatAlgos/test/patTuple_standard_cfg.py
>>>process.patJets
As you can see, by default, the CaloTowers are embedded into the
pat::Jets. The matched generated parton and GenJet are embedded as well.
In case of PFJets, the PFJetCandidates are also embedded. For the following examples you may keep this configuration opened in
ConfigEditor.
Embed extra information into PAT objects
Let's try to access the CaloTowers in the default case of
embedding. First create a
pat::Tuple using the standard configuration:
cmsRun PhysicsTools/PatAlgos/test/patTuple_standard_cfg.py
Print the event content and the size of each branch:
edmEventSize -v patTuple.root
patJets_cleanPatJets__PAT. 21036.6 4104.06
CaloTowers_selectedPatJets_caloTowers_PAT. 15050.1 5910.26
Note: The first number of the output indicates the uncompressed event size while the second number shows the event size at the current zip compression level used within the
EDM. The second number is the important one to check what disc space will physically be needed per event to store this collection.
Using the PAT default,
cleanPatJets and a collection of calo towers clustered inside the
pat::Jets are kept in the event.
Together they make up about
10kB per event
for the default MC sample (ttbar).
Note: To arrive at this number add the number in the last columns of the two components. Note that these are given in bytes, so you will have to divide them by 1024 to arrive at the corresponding numbers in
kB per event
.
Now check out the PAT example analyzers:
addpkg PhysicsTools/PatExamples V00-05-29
Edit
PhysicsTools/PatExamples/plugins/PatBasicAnalyzer.cc
.
Uncomment the code that plots the number of CaloTowers per
pat::Jet:
// uncomment the following line to fill the jetTowers_ histogram
jetTowers_.Fill(jet->getCaloConstituents().size());
// uncomment the following line to book the jetTowers_ histogram
jetTowers_ = fs->make<TH1F>("jetTowers", "towers per jet", 90, 0, 90);
Compile and run.
Note: You might have to adapt the name of the input file in the file
analyzePatBasics_cfg.py before starting
cmsRun.
scram b -j4
rehash
cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py
root -l analyzePatBasics.root
root [0] TBrowser b
- double-click on
analyzePatBasics.root=
in =ROOT files
on the left sidebar;
- double-click on
analyzeBasicPat;1
.
- double-click on
jetTowers;1
.
PAT objects without embedding and without keeping extra branches of the event
Let's see what happens when we switch off the embedding of calo towers into the
pat::Jets. Therefore open the file
patTuple_standard_cfg.py in
ConfigEditor again. Modify it using
ConfigEditor click on
Edit using ConfigEditor. Search and select the module "patJets" and mark:
embedCaloTowers = False
Save as "user_noembed_cfg.py" resulting in the following configuration:
### Generated by ConfigEditor ###
import sys
import os.path
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_BASE','src/PhysicsTools/PatAlgos/test'))))
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_RELEASE_BASE','src/PhysicsTools/PatAlgos/test'))))
### --------------------------- ###
from patTuple_standard_cfg import *
### Generated by ConfigEditor ###
if hasattr(process,'resetHistory'): process.resetHistory()
### --------------------------- ###
process.patJets.embedCaloTowers=False
Run and measure the event content sizes.
cmsRun user_noembed_cfg.py
edmEventSize -v patTuple.root
patJets_cleanPatJets__PAT. 15480.6 3599.76
CaloTowers_selectedPatJets_caloTowers_PAT. 426.82 289.85
As you can see the size of the
pat::Jets with calo towers has decreased from
10kB per event
with embedding of CaloTowers to
3.6kB per event
without embedded CaloTowers. The
CaloTowers_selectedPatJets_caloTowers_PAT collection is actually empty now and can be dropped from the event content to save
0.3kB per event
.
Note: Again add up the numbers of in the last columns to arrive at this estimate and divide the sum by 1024.
Now execute our EDAnalyzer for the CaloTowers again:
cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py
The code will crash! The reason is that the CaloTowers are not embedded in the
pat::Jets and not kept in the event content. Trying to access them anyhow finally leads to the crash.
Keep extra branches of the event
Another option to make the CaloTowers accessible from the
pat::Jets is to
keep them in the event content. Therefore open the file
patTuple_standard_cfg.py in the
ConfigEditor again. In order to modify it using the
ConfigEditor click on
Edit using ConfigEditor. Search and select the module
out. Edit the parameter
outputCommands and add the following entry. You may click on the "pencile" symbol to open an editor window.
''keep CaloTowers*_towerMaker_*_*'
Save this configuration as "user_keep_cfg.py". It will look like this:
### Generated by ConfigEditor ###
import sys
import os.path
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_BASE','src/PhysicsTools/PatAlgos/test'))))
sys.path.append(os.path.abspath(os.path.expandvars(os.path.join('$CMSSW_RELEASE_BASE','src/PhysicsTools/PatAlgos/test'))))
### --------------------------- ###
from patTuple_standard_cfg import *
### Generated by ConfigEditor ###
if hasattr(process,'resetHistory'): process.resetHistory()
### --------------------------- ###
process.out.outputCommands = cms.untracked.vstring('drop *',
'keep *_cleanPatPhotons*_*_*',
'keep *_cleanPatElectrons*_*_*',
'keep *_cleanPatMuons*_*_*',
'keep *_cleanPatTaus*_*_*',
'keep *_cleanPatJets*_*_*',
'keep *_patMETs*_*_*',
'keep *_cleanPatHemispheres*_*_*',
'keep *_cleanPatPFParticles*_*_*',
'keep *_cleanPatTrackCands*_*_*',
'keep CaloTowers*_towerMaker_*_*')
Run and show the event content and sizes:
cmsRun user_keep_cfg.py
edmEventSize -v patTuple.root
CaloTowersSorted_towerMaker__RECO. 94692.2 18715.4
patJets_cleanPatJets__PAT. 21036.6 4104.06
As you can see the complete set of calo towers are now kept in the event with an additional
19 kB per event
.
Note: You can see this number in the last column of the first row. This is disc space in addition to the size of the jet collection of roughly
4 kB per event
.
Note: The
pat::Jet collection plus the whole CaloTower collection (
23kB per event
) is bigger than the
pat::Jet collection with embedded calo towers (
10kB per event
), since only towers clustered in the jets are stored in that case.
Now execute our EDAnalyzer for the CaloTowers again:
cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py
root -l analyzePatBasics.root
root [0] TBrowser b
- double-click on
analyzePatBasics.root=
in =ROOT files
on the left sidebar;
- double-click on
analyzeBasicPat;1
.
- double-click on
jetTowers;1
.
Resulting in the following plot:
Keeping the CaloTowers in the event content allows to access them from the
pat::Jets.
Note: All CaloTowers including those, that were not clustered in any jet are kept. Therefore this option may not be the best for saving disk space if you are only interested in CaloTowers clustered in jets. Thus consider the PAT default option of embedding.
Note: If you are using a FWLiteAnalyzer keeping does not always hold. You should embed instead.
Exercises
Before leaving this page try to do the following exercises:
Exercise 6 a): Disable all additional information for pat::Jets and compare their size to reco::Jets.

Can you reduce the pat::Jet size to the original
reco::CaloJet size?
What is the size of the pat:Jet collection after removing embedding and additional information (e.g. addGenJetMatch=False, ...)?
You may proceed as follows:
Switch off all information added to the pat::Jet:
process.patJets.addTagInfos = False
process.patJets.addJetCharge = False
process.patJets.addGenJetMatch = False
process.patJets.embedGenJetMatch = False
process.patJets.addAssociatedTracks = False
process.patJets.addDiscriminators = False
process.patJets.embedGenPartonMatch = False
process.patJets.addGenPartonMatch = False
process.patJets.embedPFCandidates = False
process.patJets.addJetCorrFactors = False
process.patJets.addBTagInfo = False
process.patJets.embedCaloTowers = False
process.patJets.addJetID = False
Add reco::CaloJets to the output event content:
'keep recoCaloJets_ak5CaloJets_*_*'
Calculate event size:
cmsRun user_no_embedding.py
edmEventSize -v patTuple.root
You will find the solution here:
The result is:
patJets_cleanPatJets__PAT. 8276.94 2086.14
CaloTowers_selectedPatJets_caloTowers_PAT. 426.82 298.95
recoGenJets_selectedPatJets_genJets_PAT. 311.15 217.34
recoBasedTagInfosOwned_selectedPatJets_tagInfos_PAT. 118.76 82.92
recoCaloJets_ak5CaloJets__RECO. 3706.14 951.01
By switching off all additional information stored in the pat::Jet, its size is reduced to 2.0kB which is almost as small as the original reco::CaloJets with 1.0kB.
Exercise 6 b): Embed tracks (in the tracker, not in the muon system) into the pat::Muons and compare for each muon the pt of the track and the muon itself.
What fraction of muons have a track in the tracker?
You may proceed as follows:
Embed the tracks into the muons:
process.patMuons.embedTrack = True
Create a patTuple.root.
Edit
PhysicsTools/PatExamples/plugins/PatBasicAnalyzer.cc
and add a plot with the ratio of track pT and muon pT for each muon that has a track:
..... analyze(...) .....
for(edm::View<pat::Muon>::const_iterator muon=muons->begin(); muon!=muons->end(); ++muon){
histContainer_["muonstrackptfraction"]->Fill(muon->track().isNull() ? 0 : muon->track()->pt()/muon->pt());
}
..... beginJob() .....
histContainer_["muonstrackptfraction"]=fs->make<TH1F>("muonstrackptfraction", "muon track pt fraction", 101, 0, 2);
Compile and run:
scram b -j4
cmsRun PhysicsTools/PatExamples/test/analyzePatBasics_cfg.py
Look at the plot:
root -l analyzePatBasics.root
root [0] TBrowser b
- double-click on
analyzePatBasics.root=
in =ROOT files
on the left sidebar;
- double-click on
analyzeBasicPat;1
.
- double-click on the
muonstrackptfraction;1
.
You will find the solution here:
In CMSSW>=4_1_X you will see something like this:
The plot is explained like this: There are 2 different kinds of muons visible in the plot:
- Muons build from only tracks in the tracker peak at 1
- Muons build from only tracks in the muon system peak at 0
In CMSSW_3_8_X you will see this:
From this plot you can see that the muon momentum is equal to the momentum of the tracker track that it contains.
In CMSSW_3_6_X you would see the following:
The plot is explained like this: There are 3 different kinds of muons visible in the plot:
- Muons build from only tracks in the tracker peak at 1
- Muons build from only tracks in the muon system peak at 0
- Muons build from tracks combined in the tracker and muon system show up elsewhere
Note:
In case of problems don't hesitate to contact the
SWGuidePAT#Support. Having successfully finished
Exercise 6 you might want to proceed to
Exercise 7 of the
SWGuidePAT to learn more about the PAT support of object disambiguation across different object collections. For an overview you can go back to the
WorkBookPATTutorial entry page.
Review status