TWiki
>
Main Web
>
TWikiUsers
>
PieterDavid
>
Draft_tt5TeV_CMSDASCERN2020_piedavid
(revision 9) (raw view)
Edit
Attach
PDF
<style type="text/css" media="all"> pre { text-align: left; padding: 10px; margin-left: 20px; color: black; } pre.command { background-color: lightgrey; } pre.cfg { background-color: lightblue; } pre.code { background-color: lightpink; } pre.output { background-color: lightgreen; } p.todo { background-color: coral; } </style> <script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script> <script type="text/javascript" src="/twiki/pub/TWiki/MathJax/mathjax-MathJax-727332c/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> ---+ CMS Data Analysis School at CERN 2020: $\mathrm{t}\bar{\mathrm{t}}$ cross section measurement at $\sqrt{s} = 5.02$ TeV -- Exercise %TWISTY{mode="div" showlink="Show Contents" hidelink="Hide contents..." remember="on" showimgleft="%ICONURLPATH{toggleopen-small}%" hideimgleft="%ICONURLPATH{toggleclose-small}%"}% %TOC{title="Contents:"}% %ENDTWISTY% ---+ Facilitators * [[mailto:jrgonzal@cern.ch][Juan González]] (Universidad de Oviedo) * [[mailto:Andrea.Giammanco@cern.ch][Andrea Giammanco]] (Université catholique de Louvain) * [[mailto:Pietro.Vischia@cern.ch][Pietro Vischia]] (Université catholique de Louvain) * [[mailto:Pieter.David@cern.ch][Pieter David]] (Université catholique de Louvain) ---+ Introduction This exercise will guide you through a measurement of the top-quark cross-section in final state with one or two leptons using the 2017 5TeV dataset, and physics analysis in CMS in general, using the standard NanoAOD format. The following colour scheme will be used: * Commands that you should run in the shell will be shown in a gray box, e.g.: <pre class="command">ls -ltr</pre> * Code snippets to be looked at (e.g, as examples) or edited will be shown in a pink box, e.g.: <pre class="code"> std::cout << "This is a test" << std::endl; </pre> * Output such as screen printouts will be shown in a green box, e.g.: <pre class="output">This is a test</pre> <p class="todo">orange notes are to-do items for the facilitators while preparing the exercise - so these should be gone by the time we start :-) </p> ---+ Getting set up As for the pre-exercises and short exercises, we will use lxplus: <pre class="command"> ssh -Y USERNAME@lxplus.cern.ch </pre> ---++ bamboo For most of the analysis we will use the [[https://gitlab.cern.ch/cp3-cms/bamboo][bamboo]] framework, which only needs a recent version of ROOT (it's based on !RDataFrame) and python3. We created a [[https://docs.python.org/3/library/venv.html][virtual environment]] with everything you will need (following [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/install.html][this recipe]]). You can pick it up with <pre class="command"> source /cvmfs/sft.cern.ch/lcg/views/LCG_98python3/x86_64-centos7-gcc9-opt/setup.sh source ~davidp/public/CMSDAS2020CERN/bamboovenv/bin/activate </pre> <p class="todo">create "final" virtualenv (latest platform and bamboo) in the cmsdas area (Pieter)</p> %TWISTY{mode="div" showlink="If this does not work" hidelink="If this does not work..." firststart="hide" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%"}% ... the most likely reason is that you did not start from a clean shell (the LCG software distribution brings its own gcc, python etc., similarly to CMSSW, but mixing several of these is generally a bad idea). Please check your =~/.bashrc= and try again in a new shell. If that doesn't help, please ask the facilitators. In case you are using (t)csh instead of bash or zsh, you should use the csh version of the scripts above instead (=setup.csh= and =activate.csh=). %ENDTWISTY% %TWISTY{mode="div" showlink="Other installation options" hidelink="Other installation options" firststart="hide" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%"}% It is recommended to use lxplus, at least for this part not so much for bamboo, but for the input files that you also need to copy to your laptop or institute storage, and for accessing lxbatch, which is useful for the single lepton channel. To install bamboo there are three options: using the LCG software distribution from CVMFS and a virtual environment (that's what's done above), using conda and pip, or from a docker image. All three are described in the [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/install.html][installation guide]]. %ENDTWISTY% ---++ git repository for analysis code Next, clone the skeleton repository. To be able to exchange code within your team, and with other teams, it's a good idea to create a personal fork on Github (with the "fork" button on the top right), and add it as a remote. <pre class="command"> git clone -o skeleton https://github.com/.../....git cd ... git remote add origin git@github.com:GITHUBUSERNAME/....git </pre> You can test that git and the repository are correctly set up with =git fetch origin=. It will not fetch anything at this point, but it will check that you can push to your fork from lxplus (you should have gone through the necessary steps in the [[https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDASAtCERNPreExerciseFirstSet#Obtain_a_github_account][pre-exercises]]; if something does not work (anymore) you can find all the necessary information there). If you want you can already add your colleagues' forks <pre class="command"> git remote add THEIRNAME https://github.com/THEIRGITHUBUSERNAME/....git </pre> <p class="todo">fill in the ... when the skeleton repo is there (Pieter)</todo> We will have a closer look at the framework in a minute, but you can already produce a few plots with the following command (this will print some information about the files and samples being processed, and take a few minutes): <pre class="command"> bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1 </pre> The output directory =../test_out_1= is chosen to avoid unintentionally committing the outputs; you can also make an =output= directory and add it to the =.gitignore= file, as you prefer. ---++ combine One team will work on the statistical analysis, with the [[http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/][combine]] tool. The setup for this is a bit different, but should be familiar, since based on CMSSW; that means that you should do this in a fresh shell (screen or tmux can be really convenient in such cases, just keep in mind that you'll need to type `kinit` and give your password once a day to keep it working). <p class="todo">add (move) installation instructions (Pietro)</p> ---+ Measurement of the $\mathrm{t}\bar{\mathrm{t}}$ production cross section <p class="todo">this part can be mostly copied from before, with some bamboo hints for A, B, and C added</p> ---+ Useful links <p class="todo">copy, update, and add a few more (NanoAOD doc, for instance)</p> ---+ A brief introduction to the bamboo framework All frameworks by definition impose some constraints on what you can do, and how you can do that. That is done for good reasons, usually, but it makes it a little bit unfair to impose a specific one for a project of just a few days, since you have very little time to get acquainted to it. On the other hand, we do have enough time to build up a fairly complete analysis, and run into the limitations of ad-hoc code. This introduction is an attempt to quickly get you up to speed with the mental model and assumptions in bamboo, but it will certainly miss some points that may be important, so feel free to ask and point those out. After the exercise you can open an issue (also for missing or unclear documentation) in the [[https://gitlab.cern.ch/cp3-cms/bamboo][bamboo repository]] on gitlab (or discuss in the [[https://mattermost.web.cern.ch/cms-exp/channels/bamboo][mattermost channel]]). There is also lengthier [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/index.html][HTML documentation]]; for the exercises you will probably use the [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/treefunctions.html][helper function]] list and [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html][API reference]] most. ---++ Principles bamboo is based on =[[https://root.cern/doc/master/classROOT_1_1RDataFrame.html][ROOT::RDataFrame]]=. If you have not done the [[https://cmsdas.github.io/root-short-exercise/][ROOT short exercise]] please have a look, especially at the second part, it is a very good introduction. To summarise, the key point of !RDataFrame is that you do not directly write the code that will loop over the events, read branches etc., but declare a computation graph based on the structure of the tree (the branch names and their types), and everything is computed efficiently (in a single loop over the inputs, if possible, and in (JIT-)compiled C++) afterwards, when retrieving the results. An important consequence of this is that the code you write needs to work for all possible events, e.g. if you want to make a histogram with the leading jet $p_\mathrm{T}$, the =Histo1D= node that does that should be under a =Filter= node that asks for at least one jet, otherwise the first event without jets will cause a segmentation fault. The goal of bamboo is to make it easy to build a full analysis with !RDataFrame. For the most common case of producing stack plots comparing the sum of MC samples to data, that means: 1. since you will want to build (almost) the same !RDataFrame graph for the different samples, you should define a list of samples (minimally a unique name and list of files, but it's useful to add more information, indicating e.g. if it is an MC sample or not, if it should be grouped with others in the plots) in a YAML configuration file, and define a module with a method that produces the !RDataFrame graph for a sample.<br/> This is a strong constraint, but it also brings an important benefit: the code needed for splitting the work over batch jobs (one or several per sample) and combining the results is fully separated from the analysis-specific part, so to use lxbatch (or a different batch system) you only need to pass one command-line switch.<br /> Now it's also clear why the typical way to run something with bamboo is <pre class="command">bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1</pre> =dilepton.yml= is the YAML configuration file with sample definitions, and =tt5TeV.py= a python module with a class =HelloWorld= that has a =definePlots= method to build the !RDataFrame graph.<br /> Feel free to launch this, it should not take more than a few minutes and give you a few plots in the =../test_out_1= directory. 2. to make plots, you need an !RDataFrame with =Filter= nodes (to apply cuts), =Histo1D= nodes (to fill histograms), and =Define= nodes (to make things efficient, by reusing intermediate results instead of calculating them again). In bamboo, you instead define =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.plots.Selection][Selection]]= and =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.plots.Plot][Plot]]= objects, which correspond almost one-to-one to =Filter= and =Histo1D= nodes, respectively, but add some additional information. =Define= nodes are in most cases inserted automatically behind the scenes.<br /> A =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.plots.Selection][Selection]]= corresponds to a set of cuts (=Filter= node), and also holds an event weight. It is constructed by adding cuts and/or weight factors to another =Selection= (starting from the trivial one, with all events in the input and unit event weight), by calling the =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.plots.Selection.refine][refine]]= method of the parent selection. This is a rather natural way to think about an analysis, once one gets used to it: cuts define selection stages, or selection regions, and the corrections that are multiplied with the event weight depend on the cuts applied so far.<br /> A =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.plots.Plot][Plot]]= is then the combination of a =Selection= with an x-axis variable, a binning, and layout options (two-dimensional plots are also supported, the only difference is that they have two variables and two binnings).<br /> So essentially, the =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.analysismodules.HistogramsModule.definePlots][definePlots]]= method takes the root (trivial) =Selection=, and returns a list of =Plot= objects defining other =Selection= objects as needed. 3. in !RDataFrame you can pass C++ callables (from C++), code strings (from C++ and python), and numba-compiled python methods. In bamboo, there is a python object representation of the tree (you can think of it as a generic event), which puts the structure back by combining branches with the same prefix, adding some helpers, etc. You can get attributes and call functions on this tree view to build expressions, e.g. you can use =leadMu = tree.Muon[0]= and fill =leadMu.pt= instead of =Muon_pt[0]=, or =(tree.Muon[0].p4+tree.Muon[1].p4).M()= instead of something long enough to need a helper function. These expressions will automatically be converted into code strings (inserting =Define= nodes if needed) when needed for !RDataFrame, if used in a =Selection= or =Plot=. Since the expressions are python objects, you can save intermediate results and reuse them to make your code simpler.<br /> This may seem a bit strange at first, but it's just a more pythonic view of the !NanoAOD branches. The easiest way to get an idea of how this can be used is by trying. The following command will give you an [[http://ipython.org][IPython]] shell, the event view is called =tree= here: <pre class="command"> bambooRun -i -m tt5TeV.py:HelloWorld dilepton.yml # try tree.<TAB>https://cmsdas.github.io/root-short-exercise/ # or j = tree.Jet[0], and then j.<TAB></pre> ---++ An example in detail Let's start from the example command that you may already have launched above: <pre class="command"> bambooRun -m tt5TeV.py:HelloWorld dilepton.yml -o ../test_out_1 </pre> this shows the main inputs: a YAML configuration file, with a list of samples to process, options for the plots, and some other pieces of informations (e.g. the integrated luminosity in the data files), and a python class (which inherits from a base class, making it a 'bamboo analysis module' that =bambooRun= can call certain methods on). It will proceed in two steps: * first all samples (defined in the YAML configuration file) are "processed": in the case of plots this means that the histograms needed by the =Plot= objects defined in the module are filled (it's also possible to do other things, e.g. output reduced trees for training a multivariate classifier, but we don't need that here) * then a "postprocessing" step is run, combining the different samples into plots. The plotIt tool ([[https://github.com/cp3-llbb/plotIt][code]], [[http://cp3-llbb.github.io/plotit/][documentation]]) is used for this, a standalone C++ executable that does just this. It needs the ROOT files and another YAML file, but bamboo can generate the latter (it is written to =plots.yml= in the output directory, but you probably won't need it) Since the first step usually is by far the most time-consuming, bamboo makes it possible to split the two: =bambooRun --onlypost= (with otherwise the same arguments) will run just the second step (very convenient if all you need to do is change some colour or axis title). The first step can also be run in an distributed way on HTCondor or slurm, since it's always only looking at one sample at a time (samples can also be split, and the results merged). This is done by adding the =--distributed=driver= option. %TWISTY{mode="div" showlink="To use lxbatch" hidelink="To use lxbatch..." firststart="hide" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%"}% ... bamboo needs to know a few things about it (the code is generic, such that it can be used also on other HTCondor and slurm clusters). The easiest way is by saving [[https://gitlab.cern.ch/cp3-cms/bamboo/-/blob/master/examples/cern.ini][this file]] as =~/.config/bamboorc=. You can add more HTCondor options (job flavour, maximum CPU time and memory etc.) under the corresponding section. %ENDTWISTY% The histograms for each sample will be written to =<output>/results/<sample>.root=. It is important to know that for MC no scaling with the cross-section and integrated luminosity is done: these, together with the number of generated events, are passed to plotIt to do the normalisation. This has some advantages (there is one place where the scaling is done, and you can change the cross-section in the plots without rerunning), but may be unexpected, or need a bit of extra care when making the datacards. Please have a look at =dilepton.yml=, the YAML configuration file. Most of it should be clear, or easy to guess, now. Now we can have a closer look at the python code of the module we just ran. We can for now ignore the first part of the file, before the =HelloWorld= class definition: it defines a base class to deal with the input samples (they are very similar to !NanoAODv4 with some processing, but there are a few differences that need to be taken into account). Let's start with the "hello world" example (a dimuon invariant mass plot): <sticky> %CODE{ lang="python" num="on"}% class HelloWorld(Nano5TeVHistoModule): def definePlots(self, t, noSel, sample=None, sampleCfg=None): from bamboo.plots import Plot, CutFlowReport, SummedPlot from bamboo.plots import EquidistantBinning as EqB from bamboo import treefunctions as op plots = [] muons = op.select(t.Muon, lambda mu : mu.pt > 20.) twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ]) plots.append(Plot.make1D("dimu_M", op.invariant_mass(muons[0].p4, muons[1].p4), twoMuSel, EqB(100, 20., 120.), title="Dimuon invariant mass", plotopts={"show-overflow":False})) return plots %ENDCODE% </sticky> It's quite short, so let's go through line by line: first, we declare a class, our module, and inherit from the base class (if you go up a few lines, you see that it combines everything for histograms with the adjustments for our samples). Since it's a module to fill histograms, there's the =definePlots= which returns a list of =[[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/apiref.html#bamboo.plots.Plot][Plot]]= objects, as explained above; this will be called once for each sample (or once per job, if the sample is split over multiple batch jobs); the event loop will run in (mostly JIT-compiled) C++, that's what makes this fast. The arguments to =definePlots= are =self=, a reference to the module itself, =t=, a view of an event, =noSel=, the base selection, =sample=, the name of the sample, and =sampleCfg=, a dictionary with the fields defined in the YAML analysis config for this sample. Most of the imports speak for themselves: a =CutFlowReport= will print the numbers of events passing different selections, and a =SummedPlot= simply adds up the histograms from different plots (e.g. to make a combined jet pt plot from those for the dimuon and dielectron categories). =EquidistantBinning= holds an axis binning (number of bins, minimum, and maximum); =VariableBinning= also exists. =bamboo.treefunctions= defines a collection of helper functions, that can be used on objects retrieved from the event view, e.g. =op.invariant_mass(tree.Muon[0], tree.Muon[1])= the full list is available [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/treefunctions.html][here]] (this is necessary because we're working with placeholder objects instead of the actual values; these functions record the operations instead of immediately performing them). Now things get more interesting: first we define an empty list that we can add things to (and return, eventually); this example only defines one plot, but a realistic analysis quickly grows to hundreds of plots that are produced in one go. %CODE{ lang="python" num=9}% muons = op.select(t.Muon, lambda mu : mu.pt > 20.) %ENDCODE% makes a list of muons that pass a selection. Adding more cuts is easy, by replacing =mu.pt > 20= with =op.AND(mu.pt > 20, ...)=, and =muons= can be used in the same way as =t.Muon=. This syntax may seem a bit strange at first, but it's very powerful (as an exercise, think about how you could select jets that are not within a certain angular distance from any electron or muon the answer is also in the bamboo [[https://cp3.irmp.ucl.ac.be/~pdavid/bamboo/index.html][documentation]]). The next line also shows something that was mentioned above: it defines a =Selection= of the events with two muons. %CODE{ lang="python" num=10}% twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ]) %ENDCODE% and the next line defines a =Plot= with the dimuon invariant mass (which can safely be calculated for events with two muons): <sticky> %CODE{ lang="python" num=11}% plots.append(Plot.make1D("dimu_M", op.invariant_mass(muons[0].p4, muons[1].p4), twoMuSel, EqB(100, 20., 120.), title="Dimuon invariant mass", plotopts={"show-overflow":False})) %ENDCODE% </sticky> ---++ A few more ingredients you may need <p class="todo">finish writing this :-) (Pieter); tasks A, B and C (or the section that defines them) should link here; add github/gitlab links</p>
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r11
<
r10
<
r9
<
r8
<
r7
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r9 - 2020-09-02
-
PieterDavid
Log In
Main
Home
Index
Search
User Search
Changes
Notifications
RSS Feed
Documentation
Support
Webs
Main
Main Archive
Plugins
Sandbox for tests
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
Main
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback