ls -ltr
std::cout << "This is a test" << std::endl;
This is a test
orange notes are to-do items for the facilitators while preparing the exercise - so these should be gone by the time we start
ssh -Y USERNAME@lxplus.cern.ch
source /cvmfs/sft.cern.ch/lcg/views/LCG_98python3/x86_64-centos7-gcc9-opt/setup.sh source ~davidp/public/CMSDAS2020CERN/bamboovenv/bin/activate
create "final" virtualenv (latest platform and bamboo) in the cmsdas area (Pieter)
~/.bashrc
and try again in a new shell. If that doesn't help, please ask the facilitators.
In case you are using (t)csh instead of bash or zsh, you should use the csh version of the scripts above instead (setup.csh
and activate.csh
).
git clone -o skeleton https://github.com/.../....git cd ... git remote add origin git@github.com:GITHUBUSERNAME/....gitYou can test that git and the repository are correctly set up with
git fetch origin
. It will not fetch anything at this point, but it will check that you can push to your fork from lxplus (you should have gone through the necessary steps in the pre-exercises; if something does not work (anymore) you can find all the necessary information there).
If you want you can already add your colleagues' forks
git remote add THEIRNAME https://github.com/THEIRGITHUBUSERNAME/....git
fill in the ... when the skeleton repo is there (Pieter)
We will have a closer look at the framework in a minute, but you can already produce a few plots with the following command (this will print some information about the files and samples being processed, and take a few minutes):bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1The output directory
../test_out_1
is chosen to avoid unintentionally committing the outputs; you can also make an output
directory and add it to the .gitignore
file, as you prefer.
add (move) installation instructions (Pietro)
this part can be mostly copied from before, with some bamboo hints for A, B, and C added
copy, update, and add a few more (NanoAOD doc, for instance)
ROOT::RDataFrame
.
If you have not done the ROOT short exerciseHisto1D
node that does that should be under a Filter
node that asks for at least one jet, otherwise the first event without jets will cause a segmentation fault.
The goal of bamboo is to make it easy to build a full analysis with RDataFrame.
For the most common case of producing stack plots comparing the sum of MC samples to data, that means:
bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1
dilepton.yml
is the YAML configuration file with sample definitions, and tt5TeV.py
a python module with a class HelloWorld
that has a definePlots
method to build the RDataFrame graph.../test_out_1
directory.
Filter
nodes (to apply cuts), Histo1D
nodes (to fill histograms), and Define
nodes (to make things efficient, by reusing intermediate results instead of calculating them again). In bamboo, you instead define Selection
and Plot
objects, which correspond almost one-to-one to Filter
and Histo1D
nodes, respectively, but add some additional information. Define
nodes are in most cases inserted automatically behind the scenes.Selection
corresponds to a set of cuts (Filter
node), and also holds an event weight. It is constructed by adding cuts and/or weight factors to another Selection
(starting from the trivial one, with all events in the input and unit event weight), by calling the refine
method of the parent selection. This is a rather natural way to think about an analysis, once one gets used to it: cuts define selection stages, or selection regions, and the corrections that are multiplied with the event weight depend on the cuts applied so far.Plot
is then the combination of a Selection
with an x-axis variable, a binning, and layout options (two-dimensional plots are also supported, the only difference is that they have two variables and two binnings).definePlots
method takes the root (trivial) Selection
, and returns a list of Plot
objects — defining other Selection
objects as needed.
leadMu = tree.Muon[0]
and fill leadMu.pt
instead of Muon_pt[0]
, or (tree.Muon[0].p4+tree.Muon[1].p4).M()
instead of something long enough to need a helper function. These expressions will automatically be converted into code strings (inserting Define
nodes if needed) when needed for RDataFrame, if used in a Selection
or Plot
. Since the expressions are python objects, you can save intermediate results and reuse them to make your code simpler.tree
here: bambooRun -i -m tt5TeV.py:HelloWorld dilepton.yml # try tree.<TAB>https://cmsdas.github.io/root-short-exercise/ # or j = tree.Jet[0], and then j.<TAB>
bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1this shows the main inputs: a YAML configuration file, with a list of samples to process, options for the plots, and some other pieces of informations (e.g. the integrated luminosity in the data files), and a python class (which inherits from a base class, making it a 'bamboo analysis module' that
bambooRun
can call certain methods on).
It will proceed in two steps: Plot
objects defined in the module are filled (it's also possible to do other things, e.g. output reduced trees for training a multivariate classifier, but we don't need that here)
plots.yml
in the output directory, but you probably won't need it)
bambooRun --onlypost
(with otherwise the same arguments) will run just the second step (very convenient if all you need to do is change some colour or axis title).
The first step can also be run in an distributed way on HTCondor or slurm, since it's always only looking at one sample at a time (samples can also be split, and the results merged). This is done by adding the --distributed=driver
option.
~/.config/bamboorc
. You can add more HTCondor options (job flavour, maximum CPU time and memory etc.) under the corresponding section.
. It is important to know that for MC no scaling with the cross-section and integrated luminosity is done: these, together with the number of generated events, are passed to plotIt to do the normalisation. This has some advantages (there is one place where the scaling is done, and you can change the cross-section in the plots without rerunning), but may be unexpected, or need a bit of extra care when making the datacards.
Please have a look at dilepton.yml
, the YAML configuration file. Most of it should be clear, or easy to guess, now.
Now we can have a closer look at the python code of the module we just ran. We can for now ignore the first part of the file, before the HelloWorld
class definition: it defines a base class to deal with the input samples (they are very similar to NanoAODv4 with some processing, but there are a few differences that need to be taken into account).
Let's start with the "hello world" example (a dimuon invariant mass plot):
1class HelloWorld(Nano5TeVHistoModule): 2 def definePlots(self, t, noSel, sample=None, sampleCfg=None): 3 from bamboo.plots import Plot, CutFlowReport, SummedPlot 4 from bamboo.plots import EquidistantBinning as EqB 5 from bamboo import treefunctions as op 6 7 plots = [] 8 9 muons = op.select(t.Muon, lambda mu : mu.pt > 20.) 10 twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ]) 11 plots.append(Plot.make1D("dimu_M", 12 op.invariant_mass(muons[0].p4, muons[1].p4), twoMuSel, EqB(100, 20., 120.), 13 title="Dimuon invariant mass", plotopts={"show-overflow":False})) 14 15 return plots
definePlots
which returns a list of Plot
objects, as explained above; this will be called once for each sample (or once per job, if the sample is split over multiple batch jobs); the event loop will run in (mostly JIT-compiled) C++, that's what makes this fast.
The arguments to definePlots
are self
, a reference to the module itself, t
, a view of an event, noSel
, the base selection, sample
, the name of the sample, and sampleCfg
, a dictionary with the fields defined in the YAML analysis config for this sample.
Most of the imports speak for themselves: a CutFlowReport
will print the numbers of events passing different selections, and a SummedPlot
simply adds up the histograms from different plots (e.g. to make a combined jet pt plot from those for the dimuon and dielectron categories).
EquidistantBinning
holds an axis binning (number of bins, minimum, and maximum); VariableBinning
also exists.
bamboo.treefunctions
defines a collection of helper functions, that can be used on objects retrieved from the event view, e.g. op.invariant_mass(tree.Muon[0], tree.Muon[1])
— the full list is available heremuons = op.select(t.Muon, lambda mu : mu.pt > 20.)makes a list of muons that pass a selection. Adding more cuts is easy, by replacing
mu.pt > 20
with op.AND(mu.pt > 20, ...)
, and muons
can be used in the same way as t.Muon
.
This syntax may seem a bit strange at first, but it's very powerful (as an exercise, think about how you could select jets that are not within a certain angular distance from any electron or muon — the answer is also in the bamboo documentationSelection
of the events with two muons.
twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ])and the next line defines a
Plot
with the dimuon invariant mass (which can safely be calculated for events with two muons):
plots.append(Plot.make1D("dimu_M", op.invariant_mass(muons[0].p4, muons[1].p4), twoMuSel, EqB(100, 20., 120.), title="Dimuon invariant mass", plotopts={"show-overflow":False}))
finish writing this (Pieter); tasks A, B and C (or the section that defines them) should link here; add github/gitlab links