Heppy : a mini framework for HEP event processing in python

Prerequisites

You should be familiar with python to follow this tutorial. I strongly advise to carefully follow the python tutorial if not yet done. It will take you a few hours now, but will gain you many days in the future.

Why python? In short:

  • fast learning curve: python is the most easy-to-learn language
  • high productivity: coding in python is about 10 times faster than in C++
  • high flexibility: code can be easily reused, refactored, extended.
  • dynamic typing (similar to C++ template features, without the pain in the neck): if you do an analysis for e.g. the muon channel, it is going to work for the electron channel with only minor modifications related to lepton identification. If your analysis reads a certain kind of particle-like objects, it will probably work on other kinds of particle-like objects.
  • very large and easy-to-use standard library

A short description of the analysis system

Design principles

This goal of the ntuplizer system is to produce a flat tree for each of the datasets (also called "components") used in the analysis. Any operation requiring a manual loop on the events can be done while producing the flat tree, so that the resulting trees can be used with simple TTree.Draw or TTree.Project commands.

For example, the ntuplizer allows to:

  • read events from an EDM root file, e.g. mini AODs.
  • create python physics objects wrapping the C++ objects from the EDM root file. These objects have the exact same interface as the C++ objects, and can be extended with more information. For example, you could write your own muon ID function for your python Muon object, or add attributes to your python Muons along the processing flow, like the 4-momentum of the closest jet or the closest generated muon.
  • create new python objects, e.g. a VBF object to hold VBF quantities.
  • compute event-by-event weights
  • select a trigger path, and match to the corresponding trigger objects
  • define and write simple flat trees
It is up to you to define what you want to do, possibly re-using existing code from other analyses or writing your own.

An analysis typically consists in several tenth of samples, or "components": data samples, standard model backgrounds, signal. The ntuplizer is built in such a way that it takes one command to either:

  • run interactively on a single component
  • run several processes in parallel on your multiprocessor machine
  • run hundreds of processes as separate jobs on LSF, the CERN batch cluster.

If you decide to run several processes, you can split a single component in as many chunks as input ROOT files for this component. For example, you could run in parallel:

  • 6 chunks from the DYJet component, using 6 processors of your local machine, assuming you have more than 6 input DYJet ROOT files.
  • 200 chunks from the DYJet component, 300 from your 5 data components altogether, and 300 jobs from all the remaing components (e.g. di-boson, TTJets, ...) on LSF.

The ntuplizer is based on python, pyroot, and FWLite. The analysis could be a simple python macro based on these tools. Instead, it was decided to keep the design of typical full frameworks for high-energy physics (e.g. CMS, ATLAS, FCC), and to implement it in python. This design boils down to:

  • a python configuration system, similar to the one we use in HEP full frameworks like CMSSW.
  • a Looper which allows to access the EDM events and runs a sequence of analyzers on each event.
  • a common python event, created at the beginning of the processing of each EDM event, and read/modified by the analyzers.
the python event allows you to build the information you want into your event, and allows the analyzers to communicate. At the end of the processing of a given EDM event, information from the python event can be filled into a flat tree using a specific kind of analyzer, like this one.

Package list

PhysicsTools/HeppyCore is the core package. It is also usable out of CMSSW, and it contains the following python packages:

  • framework : Core modules: python configuration system, the looper, the python event, etc.
  • analyzers : Very simple generic analyzers.
  • statistics : Modules for counting and averaging, histogramming, tree production.
  • utils : Miscellaneous utilities, like deltaR matching tools.
It also contains a scripts directory.

PhysicsTools/Heppy is the CMS-specific package; it contains the following packages:

  • analyzers : CMS analyzers
    • core : Core analyzers, usable in any analysis (e.g. JSONAnalyzer)
    • objects : physics object analyzers (e.g. JetAnalyzer)
    • eventtopology : analyzers computing global event topology variables
    • examples : example analyzers, to be used in tutorials.
  • physicsobjects : python physics objects for CMS.
  • physicsutils : utilities for physics.
  • utils : other utilities

Most of the code is documented with docstrings. To get more information on a given object do, for example:

python
from PhysicsTools.HeppyCore.framework.looper import Looper
help(Looper)

Installation

Instructions

These instructions are based on CMSSW_8_0_11. If you want to use another release, see the next section.

Log to lxplus on SLC6.

Move to a directory in your afs account: For example:

mkdir HeppyTutorial 
cd HeppyTutorial

Set up a local CMSSW area:

scram project CMSSW CMSSW_8_0_11
cd CMSSW_8_0_11/src
cmsenv
git cms-init

Install the heppy packages:

git cms-merge-topic cbernet:heppy_8_0_11

Compile:

scram b -j 8

Available Git Branches for the various CMSSW releases

Release Corresponding heppy branch Comments about the CMSSW release
CMSSW_8_0_11 cbernet:heppy_8_0_11 For 2016 data, production and development version of Heppy
CMSSW_7_6_3_patch2 cbernet:heppy_7_6_3_patch2 For 2015 data, old production branch

If you need heppy for another version of CMSSW, just ask Colin.

Exercises

TIP All exercises are done in the following directory

cd PhysicsTools/Heppy/test

1- Understanding the configuration file

Have a detailed look at the configuration file, simple_example_cfg.py.

Load it in python:

ipython
from simple_example_cfg import *

Get info on one of the analyzers:

print all_jets

Get help on this object:

help(all_jets)

TIP all objects created in this cfg file are just configuration objects. These configuration objects will be passed to the actual analyzers that contain your analysis code.

TIP In the future, when you use this event processing system in your analysis, it can save time to make sure that all ingredients (components, analyzers) are defined correctly by loading your configuration in python before even trying to run.

2- Finding existing analysis code

Open simple_example_cfg.py. The configuration fragments for the analyzers look like:

all_jets = cfg.Analyzer(
    SimpleJetAnalyzer,
    'all_jets',
    njets = 4,
    filter_func = lambda x : True
    )
The first argument is a class object coming from SimpleJetAnalyzer.py. The framework will use this class object to create an instance of this class.

The second argument, 'all_jets', is an instance label. This argument is optional, and can be used in case several analyzers of the same class are requested. Here, there is only one instance of the SimpleJetAnalyzer class, and this argument could be omitted.

The third argument is a simple integer, and the last one a function object. This function object will be used to select jets in SimpleJetAnalyzer.py.

Have a look at the SimpleJetAnalyzer.py module, and study the code.

Then study the code of the base Analyzer class for CMS in Analyzer.py.

Finally, checkout the top Analyzer base class in analyzer.py. This class is not specific to CMS and is used in other experiments.

3- Running interactively on one component

Run:

heppy Out simple_example_cfg.py -N 5000

You should see a healhy printout with at the end:

number of events processed: 5000

TIP Processing speed is typically limited by disk access in this particular case.

In the ouput Out directory, you can find a component directory, test_component. Investigate the contents of this component directory, and of all directories within.

Fire up root (here we choose to use ipython + pyroot), and check the main output tree:

ipython 
from ROOT import TFile 
f = TFile("Out/test_component/PhysicsTools.Heppy.analyzers.examples.ZJetsTreeAnalyzer.ZJetsTreeAnalyzer_5/tree.root")
f.ls()
t = f.Get('tree')
t.Print()
t.Draw('jet1_pt')

4- Multiprocessing on a single machine

Edit simple_example_cfg.py and set:

multi_thread = True

Follow the code to understand what this is doing:

if multi_thread:
    inputSample.splitFactor = len(inputSample.files)
On a component, splitFactor determines the number of threads that will be used to process the component. Here, we are going to use one thread per input file.

As usual, load the configuration script in python, and print the config object. Check the value of splitFactor in the printout.

Then run again:

heppy Multi simple_example_cfg.py

heppy should now process the input files in parallel in two threads. In some cases, eos will serve some of the input files with some delay, and you may get the feeling that the input files are processed sequentially.

In the Multi output directory, you have chunks. Each of these chunks correspond to one of the threads you have run We're going to add everything up:

cd Multi
heppy_check.py * 
heppy_hadd.py .

The first command checks that all chunks terminated correctly. The second command adds the root files (with hadd), the cut-flow counters, and the averages. The results is put in Multi/test_component/

TIP To do multiprocessing, you can also define several components corresponding to the samples you need to process. Each of these components can have its own split factor.

TIP Check the number of processors on your machine (cat /proc/cpuinfo), and define the number of threads accordingly.

TIP When debugging your code, make sure to have only one thread.

5- Multiprocessing on LSF

Make sure you are logged on a SLC6 lxplus machine, and not on another SLC6 machine with no access to the CERN batch system.

Run the following command to start your jobs on LSF:

heppy_batch.py -o Batch simple_example_cfg.py -b 'bsub -q 1nh < ./batchScript.sh'

Control your jobs in the usual way. When they are done, do:

cd Batch
heppy_check.py * 
heppy_hadd.py .

6- Making your own analyzer

The event processing workflow of simple_example_cfg.py has one issue: the muons from Z decay are considered as jets...

The goal of this section is to create an analyzer from scratch and to insert it in the sequence to remove the jets corresponding to the two muons making the best Z candidate.

The analyzer will do the following:

  • get from the event the di-muons produced by the dimuons analyzer
  • get from the event the selected jets produced by the sel_jets analyzer
  • create a list event.sel_jets_nomu, and put in this list the selected jets that are at far enough from the muons in eta,phi space.

6.1 Create the configuration

A basic choice for the configuration of the analyzer could be:

from PhysicsTools.Heppy.analyzers.example.DeltaRCleaner import DeltaRCleaner
sel_jets_nomu = cfg.Analyzer(
    DeltaRCleaner
    'sel_jets_nomu',
    dR=0.5
    )

Put this code in simple_example_cfg.py

Do not forget to insert sel_jets_nomu in the sequence, after dimuons and sel_jets so that DeltaRCleaner can access the products of these analyzers when it runs.

6.2 Create the analyzer code

Copy SimpleJetAnalyzer.py to DeltaRCleaner.py in the same directory, and make sure to replace all occurences of the string SimpleJetProcucer by DeltaRCleaner.

Implement the following in DeltaRCleaner.process :

  • get from the event the di-muons produced by the dimuons analyzer
  • get from the event the selected jets produced by the sel_jets analyzer
  • create a list event.sel_jets_nomu, and put in this list the selected jets that are at far enough from the muons in eta,phi space. You may find the matchObjectCollection function of the deltar.py module useful.

Run and look at the event printout to make sure that the collection event.sel_jets_nomu has been created and filled with the jets that do not correspond to the muons.

6.4 Check the results

Modify ZJetsTreeAnalyzer.py to read jets from event.sel_jets_nomu instead of event.jets.

Run and use the root file to check that the jet pT distribution is what you expect now that you have removed the muons from the collection of jets.

-- ColinBernet - 18 July 2016

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2016-07-18 - ColinBernet
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback