pyframe

A light-weight python framework for analyzing ROOT ntuples in ATLAS.

News

Hey Will and Alex! Note that I've put together a new SVN area for future versions of pyframe. I think it would be really nice if we could clean-up and merge our separate developments this summer and organize a new common version for reading xAOD.

Introduction

What is this framework?

pyframe is a light-weight python framework for analyzing ROOT ntuples in ATLAS. It allows you to read virtually any kind of flat ntuples in python, quickly, and in a minimal framework that can scale in complexity to include several algorithms and tools.

It's design goals are:

1. No EDM

Eliminate the need of a user to maintain classes describing the entire content of the data. Variables are accessed dynamically on-demmand. Of course the execution will halt if you try to access a variable that doesn't exist, but you aren't responsible for knowing more about the data than you ask of it. No MakeClass. No ReaderMaker.

2. Flat data

Data formats in active experiments are subject to change. pyframe makes no assumptions about the kind of data you want to analyze except:

  1. Your trees are flat, meaning that they are not filled with rich, user-defined types. The branches are basic types int, float, ..., or std::vectors or arrays of basic types or std::vector<std::vector<T> >.
  2. Groups of variables that represent the attributes of a common class of objects (of which there can be multiple per event), should have a common prefix in their branch names, and be filled in arrays or vectors of consistent length. That is, if you have variables that describe multiple electrons per entry in the ntuple, they should be something like
int el_n;
std::vector<float> el_eta;
std::vector<float> el_phi;
std::vector<float> el_pt;

pyframe was designed with the use case of reading ATLAS D3PDs in mind, but there is no reason you can't read any flat ntuple with common branch prefixes.

3. But treat objects like objects

Flat data has the advantage that it can be easily inspected and one doesn't have to maintiain a class as its versions evolve to read the data. But for data representing things like electrons, the analyzer should be able to objectify the data, thinking of the electrons as individual objects, each having a pt, eta, etc. The objects should be able to be collected, filtered, and sorted.

This goal is realized by the VarProxy class in pyframe.core. A VarProxy internally holds a reference to the tree being read, a string prefix, and an integer index. It also has __getattribute__ overridden such that if you have a VarProxy instance, p, with prefix='el_' and index=1, and do

p.pt
then the VarProxy internally prepends its prefix to that variable name and appends its index:
tree.el_pt[1]

You can build a list of VarProxies for all the electrons in your tree with the build_var_proxies function:

electrons = pyframe.core.build_var_proxies(chain, chain.el_n, prefix='el_')

Now you can pt-sort the electrons

electrons.sort(lambda x, y: cmp(x.pt, y.pt), reverse=True) 
and treat each instance as if it were an instance of a class with members pt, eta, etc.
el = electrons[0]
el.pt

4. Blackboard pattern for data sharing among algorithms

A pyframe job executes a list of algorithms in order, for each event in the data. Like the Gaudi framework's StoreGate, used by the ATLAS and LHCb collaborations, pyframe allows algorithms to share data with each other by storing any event-level derived data in the store dictionary, which can later be retrieved by any subsequent algorithms.

5. Easy addition of user data

In the course of data analysis, one should be able to calculate new object-level derived quantities and associate them to the object. Python's abitity to dynamically set any object's attributes makes attaching new variables to objects, trivial:

p.meaning_of_life = 42

6. Be fast, enough

Being written in python, pyframe 's design favors the programmer's time and sanity, over the CPU. But of course your analysis needs to be quick enough to sample the data effectively. pyframe has been designed with event processing rates (per 2-3 GHz core) of 100 Hz being considered sufficienct and 500 Hz being preferred. Processing lots of data should be done in parallel. pyframe has been designed for easy parallelization using Python's standard module multiprocessing, or by submitting jobs to a batch system.

Installation

pyframe is intended to minimize dependencies. All you really need to use the framework is Python, gcc, and ROOT consistently setup. Since these are common pieces of software in HEP computing environments, setting up pyframe probably only involves you checking out some code with SVN and making sure some environment variables are properly set.

(Recommended) requirements

  • python 2.X, X ≥ 6
  • gcc 4.3
  • ROOT ≥ 5.26
    • ROOT built with python enabled (check with root -config)
    • and against the same python version
  • svn

To setup root, make sure you have set the ROOTSYS environment variable and then set

export PATH=$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$ROOTSYS/pyroot:$ROOTSYS/lib:$PYTHONPATH

Please do not setup the ATLAS offline framework, athena, or anything else that could pollute your environment variables.

My environment on our t-3 cluster at Penn looks like this:

[bash]  which python
/opt/ATLASLocalRootBase/x86_64/python/2.6.5-x86_64-slc5-gcc43/sw/lcg/external/Python/2.6.5/x86_64-slc5-gcc43-opt/bin/python
[bash]  which gcc
/opt/ATLASLocalRootBase/x86_64/Gcc/gcc432_x86_64_slc5/slc5/gcc43/bin/gcc
[bash]  which root
alias root='root -l'
        /opt/ATLASLocalRootBase/x86_64/root/5.28.00c-slc5-gcc4.3/bin/root
[bash]  root -config
ROOT ./configure options:
... PYTHONDIR=/afs/cern.ch/sw/lcg/external/Python/2.6.5/x86_64-slc5-gcc43-opt --enable-python ...

The important thing to note is that Python and ROOT were built with a consistent gcc version, 4.3, and that Python was enabled when ROOT was configured and built. As a first test that you have a good environment, please make sure you can run python and import ROOT:

[bash]  python
Python 2.6.5 (r265:79063, Jun 29 2010, 16:03:43) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Loading pythonstartup.py
>>> import ROOT

Checking out the code

pyframe is part of a larger set of software tools maintained mainly by RyanReece and AlexanderTuna of the University of Pennsylvania for tau related analysis in ATLAS. The most convenient way to check out and setup pyframe is with a release of PennTau.

You can browse the code online with Trac.

First, it helps to have an environment variable pointing to the repository, and to kinit so that you don't have to re-enter you password for each check out.

export SVNPENN=svn+ssh://reece@svn.cern.ch/reps/penn
kinit reece@CERN.CH

To check out a release of PennTau for Z' 2012 analysis, do

svn co $SVNPENN/PennTau/releases/PennTau-ZprimeTauTau2012-00-00-00 PennTau-ZprimeTauTau2012-00-00-00

This simply checks out a directory for you to work in, with check_out.sh and setup.sh scripts for getting the dependent packages. Do that by:

cd PennTau-ZprimeTauTau2012-00-00-00/
. check_out.sh

Then setup:

. setup.sh

The setup script sets some environment variables for using pyframe, editing your PYTHONPATH and PATH. This first time you setup, you will get a warning that the RootCore setup script does not exist, but you are about to check that out next.

Building the external dependencies

pyframe is written entirely in python, but some of its algorithms make use of external packages to read a GRL or do pile-up reweighting, etc. The external packages are built with a common package called RootCore. You need to checkout RootCore by doing

cd pyframe/ext
source co_RootCore.sh

This setup RootCore for you. Running

./RootCore/scripts/build.sh packages.txt
will check out and build your dependent packages listed in packages.txt. Please be careful to watch that each package checks-out ok. Make sure everything built ok, by asking it to compile again:

./RootCore/scripts/compile.sh 

And that's it. No more compiling. Now every time you want you use pyframe in a new environment, you simply have to source the setup script

cd PennTau-ZprimeTauTau2012-00-00-00/
. setup.sh

Running hello world

There are working examples in the pyframe/test/ directory, running over a small ntuple: pyframe/test/test.root.

The job.min.py file is a close to minimal top-level job file for running the helloworld algorithm in pyframe. Run it with a line like:

./job.min.py

The job.more.py file demonstrates more of pyframe's tools. Note the use of VarProxies in the HelloWorld2 algorithm. Run it with a line like:

./job.more.py

The job.more_config.py file demonstrates the utility of the config module for passing configuration through command-line options and/or configuration files. Note that the main() function is actually handled by the config module. Also, note that one of the configuration options is to use the multiprocessing standard module for parallelizing your job on several cores.

Run it with a line like:

./job.more_config.py input.py

To use multiprocessing to run on two cores:

./job.more_config.py -p 2 input.py

More extensive jobs using many more tools to do a ditau cut flow are being developed here:

See job.py for the top file.

Module organization

Pyframe is organized into a few core modules and then modules for each of the object types used to analyze ATLAS data (egamma, muon, tau, jet, met).

algs.py
Some general purpose algorithms, like LooperAlg which loops over a specified list of objects in the store dictionary and runs a configurable function on each of them.
config.py
Logic for parsing command line arguments. Includes the main function that should drive any job.
core.py
Defines the classes for the core functionality of pyframe: EventLoop, Algorithm, TreeProxy, and VarProxy.
egamma.py
Defines selector classes and helper functions for electrons (photons to be implemented).
filters.py
Some general purpose filter algorithms for skipping events.
grl.py
Implements a class for filtering events based on a Good Run List.
input_trans.py
A module for defining input transform classes for possibly manipulating the list of input files, for example, adding a root:// prefix for xrootd filesystems.
jet.py
Defines selector classes and helper functions for jets.
mc.py
Defines the MCEventWeight algorithm and is a module for implementing Monte Carlo truth helper functions.
met.py
Defines the MET object and recommendations.
muon.py
Defines selector classes and helper functions for muons.
p4calc.py
Module for four-vector arithmetic.
selectors.py
Defines the base class for all selectors and the SelectorAlg for using selectors to make lists of selected objects in the store dictionary.
tau.py
Defines selector classes and helper functions for taus.
trig.py
Defines selector classes and helper functions for triggers (nothing currently).
vxp.py
Defines selector classes and helper functions for vertices.

Writing your own job

Writing a pyframe job simply involves writing a top-level analyze function which gets any configuration it needs from a config dictionary that will be the only argument passed to analyze. The main function in the pyframe.config module, parses command line arguments and optional python files passed on the command line to setup the config dictionary and then calls analyze(config). A skeleton job.py file should look something like

#!/usr/bin/env python
"""
Write a docstring describing your job.
"""

## ROOT
import ROOT
ROOT.gROOT.SetBatch(True)
 
## pyframe
import pyframe

## your modules
import helloworld

#_____________________________________________________________________________
def analyze(config):
    ## build the chain
    chain = ROOT.TChain('physics')
    for fn in config['input_files']:
        chain.Add(fn)

    ## configure the event loop
    loop = pyframe.core.EventLoop('pyframe_hello_world', version=config['version'])
    loop += helloworld.HelloWorld()

    ## run the job
    loop.run(chain, 0, config['max_events'])

#______________________________________________________________________________
if __name__ == '__main__':
    pyframe.config.main(analyze)

In the simple example above, the analyze function builds a TChain, defines an EventLoop, and then runs the loop. Note that it expects the config dictionary to at least contain the keys input_files, version, and max_events, but you could put whatever additional configuration your analyze function needs.

You should make this file executable (chmod +x job.py). Note that the main execution function is actually handled by the pyframe.config module. It parses command line arguments, including any .py files passed as arguments. Additional .py files can be used to store additional modifications to the config dictionary, like the pyframe/test/input.py, used by the example call of ./job.more_config.py input.py, shown in the Hello World section above.

Writing your own Algorithms

Writing pyframe algorithms is a lot like writing Athena/Gaudi algorithms. They can have initialize, finalize, and execute methods as needed. initialize is called at the start of the analysis, finalize at the end, and execute is called once for every event.

For example, consider the HelloWorld2 algorithm in pyframe/test/helloworld.py:

class HelloWorld2(pyframe.core.Algorithm):
    #__________________________________________________________________________
    def __init__(self,
            key = 'selected_electrons',
            name = 'HelloWorld2',
            ):
        pyframe.core.Algorithm.__init__(self, name)
        self.key = key

    #__________________________________________________________________________
    def initialize(self):
        pyframe.core.Algorithm.initialize(self)
        log.info('This algorithm reads data via VarProxy instances.')

    #__________________________________________________________________________
    @pyframe.core.save_runtime
    def execute(self, weight):
        pyframe.core.Algorithm.execute(self, weight)
        selected = self.store[self.key]
        n = len(selected)
        log.info('len(%s) = %s' % (self.key, n))
        log.info('tlv.Pt() = %s' % [ p.tlv.Pt() for p in selected ])
        log.info('tlv.Eta() = %s' % [ p.tlv.Eta() for p in selected ])

You should implement an __init__ function, which is the python analog of a constructor function. Note that python does not implicitly call the corresponding base class function when a function is overridden in a derived class, even __init__, so you should call the base class functions when appropriate.

The @pyframe.core.save_runtime decorator is there to save the runtime of the execute function of this algorithm. pyframe prints the runtime statistics to a log file at the end of the run. If your algorithm has an execute method, you should put that line just before its definition.

Complementary tools

  • tree_trimmer.py - pyroot tool for skimming/slimming flat ntuples
  • metaroot - pyroot plotting module
  • root2html.py - python script for dumping a detailed html page of root plots

-- RyanReece - 17-Jun-2011

Edit | Attach | Watch | Print version | History: r32 < r31 < r30 < r29 < r28 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r32 - 2014-04-06 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback