PAT Event Size Management

Complete: 5

Introduction

The PAT allows to easily skim events, both in terms of filtering out events and trimming down the event content, and by this reduce sample sizes to run analysis on. In this page an overview is given of how the user can configure the PAT to save only what is necessary for the particular analysis.

Please note that this page has been written with CMSSW version 2_1_X in mind. All ways that are described to reduce the event size can also be used in previous versions, but depending on the version the overhead per event from EDM metadata can be sizeable, and make detailed tweaking superfluous (see also the section on event size estimations). In versions 2_2_X and 3_0_X metadata overhead should not be a problem anymore.

Controlling the size of PAT object collections

The most straightforward way to reduce the event size is to apply cuts on the PAT objects. After the PAT Layer 1 object production, a layer of selectors is already foreseen that allows you to make selections on the produced objects. The config files for this step are all included from here.

The selection of objects used in these configuration files is done using string-based selectors. By default all selectors are configured not to cut any objects away, such that you effectively will work with the exact same objects as on the RECO/AOD. If you want to introduce selectors in your job making Layer 1 objects, it suffices to add the pertaining "replaces" in your config, for example:

   process.selectedLayer1Jets.cut = 'et > 30. & abs(eta) < 2.5'
Especially in the case of jets, such preselection cuts can reduce your event size considerably, since typically several tens of jets are present on the standard RECO/AOD.

Controlling the size of the actual PAT objects

As a second step it is possible to reduce the size of the PAT objects themselves. These objects encapsulate RECO/AOD content, which defines the minimal size, which can not be further slimmed down without redefining the data formats. The extra information the PAT adds on these objects in the Layer-1 production step can however be trimmed down, depending on which information is actually desired to be kept by the user in his particular analysis.

All the configuration flags that can be used to switch on and off PAT event content are found in the cfi python config files here. Some examples of things that can be switched off:

  • MC match: addGenMatch, addGenPartonMatch, addGenJetMatch, addGenMET, ... are switches you can put to false if you don't want to store the matching generated particle
  • trigger matching: addTrigMatch controls the storage of matching trigger objects
  • object efficiencies: addEfficiencies is not used yet, so fine to leave set to false
  • resolutions: addResolutions lets you switch off storing of MC resolutions
  • isolation: the isolation and isoDeposits PSets let you add respectively simple isolation values or full-blown IsoDeposit objects from which isolation can be recomputed with different settings. Depending on your analysis you can leave either of the PSets empty. Not storing the IsoDeposits will have the biggest effect.
  • other specific space-saving switches: addElectronID, addPhotonID, addJetCorrFactors, addBTagInfo (all b-tag info), addDiscriminators (store only discriminator values), addTagInfoRefs (store references to the objects needed to recompute the b-tagging), addAssociatedTracks, addJetCharge, getJetMCFlavour.

Embedding related content inside PAT objects

The RECO/AOD objects contain several Ref's giving access to related information, pointing to parts of collections in other branches on the file. Sometimes considerable size sample gains can be obtained by retaining only the referenced objects, and dropping the overall collection being referred to. CMS.CaloJets for example contain Ref's to the CMS.CaloTowers they were constructed from. It can be beneficial though to only keep the CMS.CaloTowers belonging to your selected jets, rather than the full collection.

In the PAT there is a feature called "embedding" which foresees the machinery to implement the above-mentioned use case in a way transparent to the user. Information referred to by PAT objects can be embedded inside the object, while keeping exactly the same interface to the user. When accessing the CMS.CaloTowers associated to a jet that have been embedded, the exactly same methods can be used as before, with the technical difference that the returned edm::Ref a so-called transient Ref is.

The advantage is clear: embedding only the necessary information, the redundant information can be dropped from the event. The configuration parameters to use this feature are all defined in the *cfi.py files here: embedTrack, embedCaloTowers, embedSuperCluster, embedCombinedMuon,...
One should take care however: using this feature is useless without proper tuning of the "drop" statements in the PAT Layer-1 production. Also, when saving several jet collections, it may be beneficial to keep the full CaloTower collection instead of several copies of the same embedded CMS.CaloTowers. Ideal settings are analysis-specific, and even some trial-and-error may be needed.

Pruning the generated particles

The genParticles branch contains the full particle listing as produced by the MC generator. This list is large though (~O(20kB)) and can be reduced by pruning out unwanted details from that list. For this the tool genParticlePruner has been written (not PAT specific), which you can find here

Controlling the amount of EDM metadata that is saved in the files

In CMSSW 2_1_X or before, EDM files have an overhead of 10-20kb/event coming from metadata.

Starting from 2_2_X, options to control this metadata size are provided; in particular, for 2.2.X releases, you can try adding this parameter to any PoolOutputModule to see if it reduces the overall file size.

     dropMetaDataForDroppedData = cms.untracked.bool(True)

In CMSSW_3_0_X the metadata information should already be much smaller without turning on this feature.

The parameters for the PoolOutputModule are documented at SWGuideEDMParametersForModules

PAT event size estimates with a few typical settings

The size of branches has been estimated with a few typical PAT settings:

  • minimal: all PAT additions on top of standard RECO objects are switched off. This is not really a useful setting for physics! It shows what the PAT datastructures in their empty form contribute as extra on top of the RECO objects.
  • default: this is the PAT default sequence. Be warned that the goal of this default is not to be a fixed endpoint on which you must do your analysis. These are just reasonable settings from which the user can start tweaking to his/her own needs.
  • allembed: with this setting all external information, like calotowers, superclusters, tracks, genmatches etc, that can be embedded inside the PAT objects, actually are, making them "heavy" but also independent for analysis.

The links attached to the above cases give the breakdown of the branch sizes and all the leave sizes for the PAT event content in these scenarios. Also given is the current information on the provenance. These results were obtained on 2_1_9 RelValTTbar, with the 10/10/2008 head of the PAT on top of 2_1_9.
*Caveat*: the obtained results are not the last word on event size, but should be taken rather as a moving target. Especially for the metadata work has been ongoing to bring down the size drastically. In versions 2_2_X and 3_0_X this should not be an issue anymore.

Measuring the event size

There are two tools you can use to measure the disk size of your PAT-tuples beyond the basic ls -lh shell command:

  • edmEventSize, a generic CMSSW tool that gives you a summary of the event size used by each object collection in an edm file
  • diskSize.pl, a PAT script that has been written by Giovanni Petrucciani and can produce more a detailed analysis of the object collections (e.g. computing the average number of items per event, the size per item, and the size of each data member of the object).
You can find an updated version of this script here (.txt extension has been added by TWiki admin). (Note that this version of the script requires root-6) Due to ROOT compression, the event size depends on the number of events in a file. It therefore should be measured only on sufficiently large samples (at least on thousand events, the more the better).

Since the tool edmEventSize is documented elsewhere, here we will describe only how to use diskSize.pl .

In order to run the script you need to be working from your CMSSW area, and you must have FWLite loading automatically when starting ROOT (instructions). You can then use it as

   diskSize.pl  somefile.root > somefile.html

In order to make the html file readable, you should also grab these three additional files from this TWiki page, and save them together with the html in your favorite public_html folder

An example output can be found here.

Review status

Reviewer/Editor and Date Comments
StevenLowette - 19 Sep 2008 created page
GiovanniPetrucciani - 17 Dec 2008 adding tools and provenance

Responsible: Steven Lowette
Last reviewed by: None

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif blue-dot.gif r1 manage 0.8 K 2016-03-21 - 16:42 RogerWolf  
Texttxt diskSize.pl.txt r1 manage 11.7 K 2016-03-22 - 18:12 RogerWolf  
Cascading Style Sheet filecss patsize.css r1 manage 0.3 K 2016-03-21 - 17:04 RogerWolf  
GIFgif red-dot.gif r1 manage 0.8 K 2016-03-21 - 16:42 RogerWolf  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2016-03-22 - RogerWolf
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback