CMS Data Analysis School at CERN 2020: $\mathrm{t}\bar{\mathrm{t}}$ cross section measurement at $\sqrt{s} = 5.02$ TeV -- Exercise

Facilitators

Introduction

This exercise will guide you through a measurement of the top-quark cross-section in final state with one or two leptons using the 2017 5TeV dataset, and physics analysis in CMS in general, using the standard NanoAOD format.

The following colour scheme will be used:

  • Commands that you should run in the shell will be shown in a gray box, e.g.:
    ls -ltr
  • Code snippets to be looked at (e.g, as examples) or edited will be shown in a pink box, e.g.:
     std::cout << "This is a test" << std::endl; 
  • Output such as screen printouts will be shown in a green box, e.g.:
    This is a test

orange notes are to-do items for the facilitators while preparing the exercise - so these should be gone by the time we start smile

Getting set up

As for the pre-exercises and short exercises, we will use lxplus:

ssh -Y USERNAME@lxplus.cern.ch

bamboo

For most of the analysis we will use the bamboo framework, which only needs a recent version of ROOT (it's based on RDataFrame) and python3. We created a virtual environment with everything you will need (following this recipe). You can pick it up with

source /cvmfs/sft.cern.ch/lcg/views/LCG_98python3/x86_64-centos7-gcc9-opt/setup.sh
source ~davidp/public/CMSDAS2020CERN/bamboovenv/bin/activate

create "final" virtualenv (latest platform and bamboo) in the cmsdas area (Pieter)

... the most likely reason is that you did not start from a clean shell (the LCG software distribution brings its own gcc, python etc., similarly to CMSSW, but mixing several of these is generally a bad idea).

Please check your ~/.bashrc and try again in a new shell. If that doesn't help, please ask the facilitators.

In case you are using (t)csh instead of bash or zsh, you should use the csh version of the scripts above instead (setup.csh and activate.csh).

It is recommended to use lxplus, at least for this part — not so much for bamboo, but for the input files that you also need to copy to your laptop or institute storage, and for accessing lxbatch, which is useful for the single lepton channel.

To install bamboo there are three options: using the LCG software distribution from CVMFS and a virtual environment (that's what's done above), using conda and pip, or from a docker image. All three are described in the installation guide.

git repository for analysis code

Next, clone the skeleton repository. To be able to exchange code within your team, and with other teams, it's a good idea to create a personal fork on Github (with the "fork" button on the top right), and add it as a remote.

git clone -o skeleton https://github.com/.../....git
cd ...
git remote add origin git@github.com:GITHUBUSERNAME/....git

You can test that git and the repository are correctly set up with git fetch origin. It will not fetch anything at this point, but it will check that you can push to your fork from lxplus (you should have gone through the necessary steps in the pre-exercises; if something does not work (anymore) you can find all the necessary information there). If you want you can already add your colleagues' forks

git remote add THEIRNAME https://github.com/THEIRGITHUBUSERNAME/....git

fill in the ... when the skeleton repo is there (Pieter)

We will have a closer look at the framework in a minute, but you can already produce a few plots with the following command (this will print some information about the files and samples being processed, and take a few minutes):

bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1

The output directory ../test_out_1 is chosen to avoid unintentionally committing the outputs; you can also make an output directory and add it to the .gitignore file, as you prefer.

combine

One team will work on the statistical analysis, with the combine tool. The setup for this is a bit different, but should be familiar, since based on CMSSW; that means that you should do this in a fresh shell (screen or tmux can be really convenient in such cases, just keep in mind that you'll need to type `kinit` and give your password once a day to keep it working).

add (move) installation instructions (Pietro)

Measurement of the $\mathrm{t}\bar{\mathrm{t}}$ production cross section

this part can be mostly copied from before, with some bamboo hints for A, B, and C added

Useful links

copy, update, and add a few more (NanoAOD doc, for instance)

A brief introduction to the bamboo framework

All frameworks — by definition — impose some constraints on what you can do, and how you can do that. That is done for good reasons, usually, but it makes it a little bit unfair to impose a specific one for a project of just a few days, since you have very little time to get acquainted to it. On the other hand, we do have enough time to build up a fairly complete analysis, and run into the limitations of ad-hoc code.

This introduction is an attempt to quickly get you up to speed with the mental model and assumptions in bamboo, but it will certainly miss some points that may be important, so feel free to ask and point those out. After the exercise you can open an issue (also for missing or unclear documentation) in the bamboo repository on gitlab (or discuss in the mattermost channel). There is also lengthier HTML documentation; for the exercises you will probably use the helper function list and API reference most.

Principles

bamboo is based on ROOT::RDataFrame. If you have not done the ROOT short exercise please have a look, especially at the second part, it is a very good introduction.

To summarise, the key point of RDataFrame is that you do not directly write the code that will loop over the events, read branches etc., but declare a computation graph based on the structure of the tree (the branch names and their types), and everything is computed efficiently (in a single loop over the inputs, if possible, and in (JIT-)compiled C++) afterwards, when retrieving the results. An important consequence of this is that the code you write needs to work for all possible events, e.g. if you want to make a histogram with the leading jet $p_\mathrm{T}$, the Histo1D node that does that should be under a Filter node that asks for at least one jet, otherwise the first event without jets will cause a segmentation fault.

The goal of bamboo is to make it easy to build a full analysis with RDataFrame. For the most common case of producing stack plots comparing the sum of MC samples to data, that means:

  1. since you will want to build (almost) the same RDataFrame graph for the different samples, you should define a list of samples (minimally a unique name and list of files, but it's useful to add more information, indicating e.g. if it is an MC sample or not, if it should be grouped with others in the plots) in a YAML configuration file, and define a module with a method that produces the RDataFrame graph for a sample.
    This is a strong constraint, but it also brings an important benefit: the code needed for splitting the work over batch jobs (one or several per sample) and combining the results is fully separated from the analysis-specific part, so to use lxbatch (or a different batch system) you only need to pass one command-line switch.
    Now it's also clear why the typical way to run something with bamboo is
    bambooRun -m tt5TeV.py:DileptonControlPlots dilepton.yml -o ../test_out_1
    dilepton.yml is the YAML configuration file with sample definitions, and tt5TeV.py a python module with a class DileptonControlPlots that has a definePlots method to build the RDataFrame graph.
    Feel free to launch this, it should not take more than a few minutes and give you a few plots in the ../test_out_1 directory.
  2. to make plots, you need an RDataFrame with Filter nodes (to apply cuts), Histo1D nodes (to fill histograms), and Define nodes (to make things efficient, by reusing intermediate results instead of calculating them again). In bamboo, you instead define Selection and Plot objects, which correspond almost one-to-one to Filter and Histo1D nodes, respectively, but add some additional information. Define nodes are in most cases inserted automatically behind the scenes.
    A Selection corresponds to a set of cuts (Filter node), and also holds an event weight. It is constructed by adding cuts and/or weight factors to another Selection (starting from the trivial one, with all events in the input and unit event weight), by calling the refine method of the parent selection. This is a rather natural way to think about an analysis, once one gets used to it: cuts define selection stages, or selection regions, and the corrections that are multiplied with the event weight depend on the cuts applied so far.
    A Plot is then the combination of a Selection with an x-axis variable, a binning, and layout options (two-dimensional plots are also supported, the only difference is that they have two variables and two binnings).
    So essentially, the definePlots method takes the root (trivial) Selection, and returns a list of Plot objects — defining other Selection objects as needed.
  3. in RDataFrame you can pass C++ callables (from C++), code strings (from C++ and python), and numba-compiled python methods. In bamboo, there is a python object representation of the tree (you can think of it as a generic event), which puts the structure back by combining branches with the same prefix, adding some helpers, etc. You can get attributes and call functions on this tree view to build expressions, e.g. you can use leadMu = tree.Muon[0] and fill leadMu.pt instead of Muon_pt[0], or (tree.Muon[0].p4+tree.Muon[1].p4).M() instead of something long enough to need a helper function. These expressions will automatically be converted into code strings (inserting Define nodes if needed) when needed for RDataFrame, if used in a Selection or Plot. Since the expressions are python objects, you can save intermediate results and reuse them to make your code simpler.
    This may seem a bit strange at first, but it's just a more pythonic view of the NanoAOD branches. The easiest way to get an idea of how this can be used is by trying. The following command will give you an IPython shell, the event view is called tree here:
    bambooRun -i -m tt5TeV.py:DileptonControlPlots dilepton.yml
    # try tree.<TAB>https://cmsdas.github.io/root-short-exercise/
    # or j = tree.Jet[0], and then j.<TAB>

An example in detail

Let's start from the example command above:

bambooRun -m tt5TeV.py:DileptonControlPlots dilepton.yml -o ../test_out_1
this shows the main inputs: a YAML configuration file, with a list of samples to process, options for the plots, and some other pieces of informations (e.g. the integrated luminosity in the data files), and a python class (which inherits from a base class, making it a 'bamboo analysis module' that bambooRun can call certain methods on). It will proceed in two steps:
  • first all samples (defined in the YAML configuration file) are "processed": in the case of plots this means that the histograms needed by the Plot objects defined in the module are filled (it's also possible to do other things, e.g. output reduced trees for training a multivariate classifier, but we don't need that here)
  • then a "postprocessing" step is run, combining the different samples into plots. The plotIt tool (code, documentation) is used for this, a standalone C++ executable that does just this. It needs the ROOT files and another YAML file, but bamboo can generate the latter (it is written to plots.yml in the output directory, but you probably won't need it)
Since the first step usually is by far the most time-consuming, bamboo makes it possible to split the two: bambooRun --onlypost (with otherwise the same arguments) will run just the second step (very convenient if all you need to do is change some colour or axis title). The first step can also be run in an distributed way on HTCondor or slurm, since it's always only looking at one sample at a time (samples can also be split, and the results merged). This is done by adding the --distributed=driver option.

... bamboo needs to know a few things about it (the code is generic, such that it can be used also on other HTCondor and slurm clusters). The easiest way is by saving this file as ~/.config/bamboorc. You can add more HTCondor options (job flavour, maximum CPU time and memory etc.) under the corresponding section.

The histograms for each sample will be written to /results/.root. It is important to know that for MC no scaling with the cross-section and integrated luminosity is done: these, together with the number of generated events, are passed to plotIt to do the normalisation. This has some advantages (there is one place where the scaling is done, and you can change the cross-section in the plots without rerunning), but may be unexpected, or need a bit of extra care when making the datacards.

Please have a look at dilepton.yml, the YAML configuration file. Most of it should be clear, or easy to guess, now.

Now we can have a closer look at the python code of the module we just ran. We can for now ignore the first part of the file, before the DileptonControlPlots class definition: it defines a base class to deal with the input samples (they are very similar to NanoAODv4 with some processing, but there are a few differences that need to be taken into account).

Let's start with the "hello world" example (a dimuon invariant mass plot):

    1class HelloWorld(Nano5TeVHistoModule):
    2    def definePlots(self, t, noSel, sample=None, sampleCfg=None):
    3        from bamboo.plots import Plot, CutFlowReport, SummedPlot
    4        from bamboo.plots import EquidistantBinning as EqB
    5        from bamboo import treefunctions as op
    6
    7        plots = []
    8
    9        muons = op.select(t.Muon, lambda mu : mu.pt > 20.)
   10        twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ])
   11        plots.append(Plot.make1D("dimu_M",
   12            op.invariant_mass(muons[0].p4, muons[1].p4), twoMuSel, EqB(100, 20., 120.),
   13            title="Dimuon invariant mass", plotopts={"show-overflow":False}))
   14
   15        return plots

It's quite short, so let's go through line by line: first, we declare a class, our module, and inherit from the base class (if you go up a few lines, you see that it combines everything for histograms with the adjustments for our samples). Since it's a module to fill histograms, there's the definePlots which returns a list of Plot objects, as explained above; this will be called once for each sample (or once per job, if the sample is split over multiple batch jobs); the event loop will run in (mostly JIT-compiled) C++, that's what makes this fast. The arguments to definePlots are self, a reference to the module itself, t, a view of an event, noSel, the base selection, sample, the name of the sample, and sampleCfg, a dictionary with the fields defined in the YAML analysis config for this sample.

Most of the imports speak for themselves: a CutFlowReport will print the numbers of events passing different selections, and a SummedPlot simply adds up the histograms from different plots (e.g. to make a combined jet pt plot from those for the dimuon and dielectron categories). EquidistantBinning holds an axis binning (number of bins, minimum, and maximum); VariableBinning also exists. bamboo.treefunctions defines a collection of helper functions, that can be used on objects retrieved from the event view, e.g. op.invariant_mass(tree.Muon[0], tree.Muon[1]) — the full list is available here (this is necessary because we're working with placeholder objects instead of the actual values; these functions record the operations instead of immediately performing them).

Now things get more interesting: first we define an empty list that we can add things to (and return, eventually); this example only defines one plot, but a realistic analysis quickly grows to hundreds of plots that are produced in one go.

muons = op.select(t.Muon, lambda mu : mu.pt > 20.)
makes a list of muons that pass a selection. Adding more cuts is easy, by replacing mu.pt > 20 with op.AND(mu.pt > 20, ...), and muons can be used in the same way as t.Muon. This syntax may seem a bit strange at first, but it's very powerful (as an exercise, think about how you could select jets that are not within a certain angular distance from any electron or muon — the answer is also in the bamboo documentation).

The next line also shows something that was mentioned above: it defines a Selection of the events with two muons.

twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ])
and the next line defines a Plot with the dimuon invariant mass (which can safely be calculated for events with two muons):
plots.append(Plot.make1D("dimu_M",
            op.invariant_mass(muons[0].p4, muons[1].p4), twoMuSel, EqB(100, 20., 120.),
            title="Dimuon invariant mass", plotopts={"show-overflow":False}))

A few more ingredients you may need

finish writing this smile (Pieter); tasks A, B and C (or the section that defines them) should link here; add github/gitlab links

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2020-09-02 - PieterDavid
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback