2020 CMSDAS CERN Long Exercise: Higgs decaying to WW

Physics context

Facilitators

Andrea Massironi (INFN and CERN), Arun Kumar (Rice University)

Miscellaneous Notes

The color scheme of the Exercise is as follows:
  • Commands will be embedded in a grey box:
    cmsRun testRun_cfg.py
  • Output and screen printouts will be embedded in a green box:
    Begin processing the 3506th record. Run 1, Event 1270974, LumiSection 6358 at 16-Dec-2015 21:45:08.969 CST 
  • Code will be embedded in a pink box:
    if ( pt < 25 ) continue; 

The whole exercise is working on the lxplus accounts at CERN.

To login on lxplus:

ssh -X username@lxplus.cern.ch

Getting familiar with the HWW analysis

This exercise concerns the search for the Higgs boson decaying into WW, which in turn disintegrate leptonically into two charged leptons and two neutrinos. The final state is characterized by two leptons with moderate transverse momentum (pT of the order of 30 GeV) and significant MET.

Introduction slides.

Reference papers:

Setting up the environment

In this exercise we use the ntuples similar to the ones that have been used for HIG-19-002 (H->WW differential measurement, Full Run II). The analysis shown is based on the Run II 2018 data and MC. The analysis will be "blinded" in the final signal phase space (more details later)

These ntuples are nanoAOD samples, on top of which variables, scale factors, weights, ... are calculated and added (so called post-processing steps), as part of the HWW analysis framework. This framework, called "latino", consists of a set of tools to post-process and to analyse the ntuples. For this exercise we will use the latino framework for the sake of analysing the ntuples. No post-processing will be needed (it is a time and storage consuming work). If you want to use the latino framework/ntuples in the future please write an email to latinos-hep@cernNOSPAMPLEASE.ch (if you have issues in writing to this egroup, just drop us an email).

Install CMSSW to get access to ROOT

Get the correct CMSSW release:

cmsrel CMSSW_10_6_4
cd CMSSW_10_6_4/src/
cmsenv

Remember to always go in the CMSSW src folder and do cmsenv when you login on remote machine to load all the proper settings.

The nanoAOD are flat ROOT ntuples, so only ROOT is in principle needed to look and analyze them. The location of the ntuples for the analysis is the following:

(on lxplus)

# MC background
ls  /eos/user/c/cmsdas/long-exercises/hww/Autumn18_102X_nAODv6_Full2018v6/MCl1loose2018v6__MCCorr2018v6__l2loose__l2tightOR2018v6/
ls  /eos/cms/store/group/phys_higgs/cmshww/amassiro/HWWNano/Autumn18_102X_nAODv6_Full2018v6/MCl1loose2018v6__MCCorr2018v6__l2loose__l2tightOR2018v6/
# MC signal
ls  /eos/user/c/cmsdas/long-exercises/hww/Autumn18_102X_nAODv6_Full2018v6/MCl1loose2018v6__MCCorr2018v6__l2loose__l2tightOR2018v6/
ls  /eos/cms/store/group/phys_higgs/cmshww/amassiro/HWWNano/Autumn18_102X_nAODv6_Full2018v6/MCl1loose2018v6__MCCorr2018v6__l2loose__l2tightOR2018v6/
# Data
ls  /eos/user/c/cmsdas/long-exercises/hww/Run2018_102X_nAODv6_Full2018v6/DATAl1loose2018v6__l2loose__l2tightOR2018v6/
ls  /eos/cms/store/group/phys_higgs/cmshww/amassiro/HWWNano/Run2018_102X_nAODv6_Full2018v6/DATAl1loose2018v6__l2loose__l2tightOR2018v6/

The two locations are redundant.

For data, we have different directories for the different 2018 eras (A, B, C,D ).

See PPD presentation

Inside each directory you will find different root files. Their name will tell you the primary dataset the ntuple is derived from. In this analysis we use the following primary datasets (and corresponding un-prescaled triggers):

Dataset Run range HLT path
SingleMuon A-D HLT_IsoMu24
DoubleMuon A-D HLT_Mu17_TrkIsoVVL_Mu8_TrkIsoVVL_DZ_Mass3p8
MuonEG A-D HLT_Mu23_TrkIsoVVL_Ele12_CaloIdL_TrackIdL_IsoVL,HLT_Mu12_TrkIsoVVL_Ele23_CaloIdL_TrackIdL_IsoVL_DZ
DoubleEG / EGamma A-D HLT_Ele23_Ele12_CaloIdL_TrackIdL_IsoVL
SingleElectron / EGamma A-D HLT_Ele32_WPTight_Gsf

Warning, important Datasets are not exclusive! An event can enter different datasets (e.g. an event with two high pT muons will be both in the SingleMuon and DoubleMuon datasets). Arbitration is needed at analysis level to avoid double counting events. The SingleEG and DoulbeEG PD they have been merged into EGamma PD in year 2018 during Run II.

For the MC, each root file has a name describing the physics process that is simulated in that sample. If you want to see the pairing, see the file Autumn18_102X_nAODv6.py.

Ntuple content

Each root file contains a TTree called "Events". The trees have many branches, that correspond to single physics variables. They may be:
  • single floats, for example variables characterising the whole event
  • vectors of variables, for example variables related to a particle type, as pT of the electrons. For these cases, also an an integer defining the size of the vector is present for example "nLepton". The variables are then defined as Collection_variable (e.g.
Lepton_pt[0]
) and the indexing is such that the objects are pT ordered (Object_pt[0] > Object_pt[1] > Object_pt[2] > ...).

The general strategy is the following:

  • events from the data are required to pass the trigger selections described above (with arbitration described in the following)
  • in the Monte Carlo simulations (MC) the trigger selection is missing, and it is emulated weighing events with coefficients that mimic the trigger efficiency effect. Weights are used also to correct any residual differences observed between data and MC. All the weights used have to be multiplied, to produce a total weighting factor.

More information about nanoAOD trees can be found at in the documentation in mc102X_doc.html

Some variables have been added in the aforementioned post-processing, for example the combined variable "invariant mass of the two leading pT leptons", mll. If you want to learn more and discover how the variables are built, check here (use the "search" field of github):

Exercises

Part 1. Introduction

Part 1. A Introduction on template analysis

What is a template analysis and difference w.r.t. parametric analysis. See slides for example.

To do:

  • get a signal ttree (e.g. /eos/user/c/cmsdas/long-exercises/hww/Autumn18_102X_nAODv6_Full2018v6/MCl1loose2018v6__MCCorr2018v6__l2loose__l2tightOR2018v6/nanoLatino_GluGluHToWWTo2L2NuPowheg_M125__part0.root)
  • calculate number events you expect in 59.7/fb, Nsig, after you apply the cut "2 leptons with pt > 20 "
"(Lepton_pt[0]>20. && Lepton_pt[1]>20.)"
  • get a background ttree (e.g. /eos/user/c/cmsdas/long-exercises/hww/Autumn18_102X_nAODv6_Full2018v6/MCl1loose2018v6__MCCorr2018v6__l2loose__l2tightOR2018v6/nanoLatino_WW-LO__part0.root)
  • calculate number events you expect in 59.7/fb, Nbkg
  • from Nsig and Nbkg calculate the expected significance Nsig/sqrt(Nbkg)
  • Given a number of data events measured Ndata, how do you measure the signal cross section?

Suggestions:

  • cross section of the sample is stored in the ROOT tree:
Events->Scan("Xsec")
  • Number of events we have analysed
Runs->Scan("genEventSumw_")
  • Number of events we have selected: how do we get this value?

Introduction to an analysis.

The list of cuts applied defines a phase space.

How to build a likelihood:

Part 1. B Introduction on HWW analysis

Signal events are characterised by the presence of a Higgs boson which decays into two W bosons, that in turn disintegrate leptonically into a charged lepton and a neutrino each. Therefore, the final state consists in two charged leptons of opposite charge and missing energy. Being produced by the decay of a W boson, the charged leptons are isolated. Tau leptons are unstable and decay either leptonically (in that case we still have an isolated electron or muon) or hadronically, hence difficult to reconstruct. Therefore, the analysis looks at the final stated with electrons and muons only. Charged leptons may or may not have the same flavour, therefore three different final states exist:

  • two isolated muons of different charge and MET,
  • two isolated electrons of different charge and MET,
  • one isolated muon, one isolated electron of opposite charge and MET.

Backgrounds are due to all the other processes that may produce the same final state of the signal. To categories of background exist: the irreducible background is due to all the physics processes that generate exactly the same topology of the signal, while the reducible background is due to all those processes that mimic the signal because of faults in the reconstruction. This may happen because of several reasons, typically due to the detector behaviour. For example, hadronic jets may be wrongly identified as charged leptons, or real charged leptons or hadronic jets may not be identified, originating fake MET as if they were neutrinos.

The actual processes that originate backgrounds changes depending on whether the two charged leptons have the same flavour: if this happens, then all the processes where a Z boson is present together with fake MET originate background, which makes the channel with different flavour leptons the most sensitive one. The major source of reducible background is due to the production of a W boson with jets, and one jet is mis-identified as a second charged lepton.

Part 1. C How to fill histograms (and build datacards)

How to improve an analysis?

  • so far only 1 phase space, 1 cut-and-count experiment
  • if you have 2 cut-and-count experiments, coming from two different phase spaces (for example, same-flavour and different-flavour charged lepton pairs)? Calculate: Nsig_1, Nsig_2, Nbkg_1, Nbkg_2 ( and Ndata_1, Ndata_2)
  • easy way to store this information: TH1F, where on the Y-axis there is the number of expected (Nsig, Nbkg) or observed (Ndata) events, on the X-axis each bin corresponds to one phase space.
    • therefore, the binning of any histogram showing a relevant variable may be interpreted as a subdivision of the analysis phase space into sub-spaces

To do:

  • get a signal ttree
  • calculate the number events you expect in 59.7/fb, Nsig_x, in two orthogonal phase spaces, and fill an histogram with them (one histogram with 2 bins)

Part 2. HWW setup

Part 2. A Install HWW setup

Install framework:

cmsrel CMSSW_10_6_4
cd CMSSW_10_6_4/src/
cmsenv

git clone --branch 13TeV git@github.com:latinos/setup.git LatinosSetup

source LatinosSetup/SetupShapeOnly.sh

scramv1 b -j 20

Do you have problems with the git clone? Check if you have not setup your github account.

Part 2. B Trees content

In addition to the standard variables, we pre-compute in the ntuples several combined kinematic variables, that are useful for the analysis. A set of those variables that are likely to be needed in this exercise is in the table below.

name description
ptll transverse momentum of the sum of the two leading lepton 4-momenta
mll invariant mass of the sum of the two leading lepton 4-momenta
dphill Δφ between the two leading leptons
drll ΔR= between the two leading leptons
mth transverse mass of the candidate made by the two leading leptons and the MET ()

ROOT tips: how can you discover the content of a tree?

ttree->MakeClass()

Then open the file with ".h" extension that has been created.

Part 2. C List of signals

There are different production mechanisms for the Higgs boson, and different decays. This analysis concentrates on ggF (gluon fusion, also shortened in ggH) and VBF (Vector Boson Fusion) production mechanism, and in the decay channel WW. However other production mechanisms could be selected by our selections and differnet decay modes too (for example a tau lepton decaying leptonically into a muon could be selected). Most of production mechanisms and important decays are considered in te list os samples.

Want to learn more about Higgs productions and decays?

Part 2. D List of backgrounds

Main backgrounds:

  • WW
  • top pair production
  • Drell-Yan (DY)
  • W + jets (a.k.a. non-prompt)
  • V + virtual photon (Vg*)

It's important to define control regions to check that our simulation well describes the backgrounds.

Part 2. E Make plots introduction

Install configurations:

 git clone git@github.com:amassiro/PlotsConfigurationsCMSDAS2020CERN.git

Then a small customization is needed:

cd LatinoAnalysis/Tools/python
cp userConfig_TEMPLATE.py userConfig.py
# edit the userConfig.py so the paths correspond to a directory where you have write access (this will be used to write log files)
cd ../../../
scram b

There are 6 main python files in each configuration. A configuration is the setup that defines a set of phase spaces, the variables to look at, the samples, ...

  • configuration.py
  • structure.py
  • cuts.py
  • samples.py
  • variables.py
  • nuisances.py

In addition there are two python configuration files for plotting and to make life easier in defining cuts and weights.

  • plot.py
  • aliases.py

Let's look at them one by one.

configuration.py
Main configuration file, essentially contains links to the other configuration files. It also contains the path for the output of mkShapesMulti.py (outputDir) and to that of mkPlot.py (outputDirPlots). It contains the luminosity (lumi) which is a factor by which each MC sample is multiplied.

samples.py
This file defines the samples, i.e. the root files the code should run on, and defines a grouping of which files belong to which process. The first important information in this file is the directory where the ntuples are (directory variable). This file, as well as the other configuration files, are executed by mkShapesMulti.py. mkShapesMulti.py defines in its code an empty dictionary called samples, which gets filled when mkShapesMulti.py loads and executes samples.py.

Each instance in the dictionary is expected to be formatted as follows:

samples['MYPROCESS'] = {   
               'name'  :   [] #python list of strings containing the file names corresponding prothe process "MYPROCESS"
               'weight' : '1.', #a string corresponding to the weight to be applied to every event 
               'weights': [] # OPTIONAL a list of weights to be applied to each file specified in 'name' above. It is expected to have the same length as 'name'. Can be missing
               'isData': [], OPTIONAL vector of '0' or '1' to specify whether each file is MC (0) or data (1). Default is MC. If  'isData': ['all'] --> all siles are assumed to be data.
               'FilesPerJob' : 2 , #OPTIONAL when running in batch, how many files per job 
}

Do you have problems in reading a complex python file? Try to use the code easyDescription.py

easyDescription.py   --inputFileSamples=samples.py   --outputFileSamples=samples_unrolled.py

Also notice that instead of typing each file name, we use a function, getSampleFiles which retrieves all files with the matching string (sometimes a single dataset is split into several files with names like latino_DYJetsToLL_M-50__partXXX.root, a call of getSampleFiles(directory, 'DYJetsToLL_M-50') will return all of them).

cuts.py
This file defines the selections to be applied on all files in samples.py. It defines a variable called supercut which defines a selection that is always applied, then it defines several regions with different instances in the the cuts dictionary. Also in this case, similartly to the dictionary samples in samples.py, an empty cuts dictionary in the code of makeShapesMulti.py.

A phase space region is then defined as:

cuts['REGION_NAME'] = 'string defining the cut'

For both the supercut and the string defining the cut of each region one should use C++ syntax (as one would do with TTree::Draw).

It is also possible to define categories (cuts in cuts). This helps in making the code for plotting faster.

See for example here.

variables.py
This file defines the variables to plot. For each sample in samples.py, for each cut in cuts.py all variables in this file are plotted.

An empty variables dictionary is defined in the code of makeShapesMulti.py, which is filled in the variables.py file. The syntax is:

variables['VARIABLE']  = {  
          'name': 'expression',        # variable expression as one would use in TTree::Draw. Also 2D expression works e.g. var1:var2    
          'range' : range:             # anything that a TH1 can digest van be put here: 
                                       # a 3-valued tuple is interpreted as (nbins, xmin, xmax).
                                       # a 6-valued tuple is interpreted as (nbinsx, xmin, xmax, nbinsy, ymin, ymax)
                                       # a ([list]) is interpreted as a vector of bin edges
                                       # a ([list],[list],) is interpreted as a 2D vector of bin edges (mind the comma before the closing ")")
          'xaxis' : 'DR_{ll}',         # x axis name, human readable name, what goes into h->GetXaxis()->SetTitle()
          'fold' : NUMBER,             # 0 -> no underflow/overflow folding. 1 -> fold underflow in the first bin. 2-> fold overflow in the last bin. 3 -> fold both underflow and overflow.
          'divideByBinWidth': VALUE,   #OPTIONAL, whether to divide (1) or not (0) the bin content by the bin width (for variable bin size histograms). Default is 0
}

structure.py
This file is mostly needed when producing datacards with mkDatacards.py script described below. It mainly defines what is signal and what is data. Optionally it can instruct mkShapesMulti.py not to run a particular selection on a particular sample.

structure['NAME']  = { # NAME should match the names in samples.py
     'isSignal' : 0, # 0/1 with obvious meaning
     'isData'   : 0, # 0/1 with obvious meaning 
     'removeFromCuts' : []  #OPTIONAL list of cuts that should not be run for this sample (default is empty)
}

plot.py
This file defines the style for the plots. It is used by mkPlot.py. Similarly to other configuration file it is based on dictionaries that are created as empty dictionaries in mkPlot.py. There is two relevant dictionaries:

  • plot: has one entry per sample in samples.py, defines color, whether it is signal or data, whether the plot should be scaled or whether individual cuts should be scaled for that sample.
  • groupPlot: defines the actual content of each plot. Allows to group samples together, if for example one wants to put ggWW and WW together in the plots only. It also defines a human readable name for the plot to go into the legend.

Note that we could define just "plot" and not "groupPlot", as shown here.

Each entry in the plot dictionary has the following syntax:

plot['NAME']  = { #same name as in samples.py  
    'color': 418,    # kGreen+2
    'isSignal' : VALUE, # 0 -> background: all samples with this flag set to 0 are plotted stacked. 
                                # 1 -> is signal: this gets plotted both stacked and superimposed
                                # 2 -> is signal: this gets plotted only superimposed, not stacked. 
    'isData'   : 0, #0/1 with obvious meaning. It is not a duplicate of structure.py. This is used to handle blinding. See below.
    'isBlind' : 0 # if set to 1, all samples with isData = 1 are not shown.
    'scale'    : 1.0, # OPTIONAL whether to scale the sample by a fixed amount
    'cuts'  : {       #OPTIONAL: whether to plot this sample only for specified cuts and applying the specified scale factor.
                       'cut name'      : scale value ,
     },
}

Each entry in the groupPlot dictionary has the following syntax:

groupPlot['GROUPNAME']  = {  
                  'nameHR' : "NAME THAT GOES IN THE LEGEND",
                  'isSignal' : 0, #overrides entries in plot
                  'color'    : 617,   #overrides entries in plot
                  'samples'  : [] #List of sampels to group under the same name in the legend
}

Finally plot.py defines the following set of additional variables:

legend['lumi'] = 'L = 42.0/fb'
legend['sqrt'] = '#sqrt{s} = 13 TeV'
with obvious meaning. These are only used to be displayed on the plot (i.e. the value of the luminosity used to scale the MC is not this one, but the one in configuration.py).

Part 2. F Zmumu and Zee

First simple phase space: Z>mumu and Z>ee

cd PlotsConfigurationsCMSDAS2020CERN/ControlRegions/DY/

To create the historgrams (runing in local):

mkShapesMulti.py --pycfg=configuration.py --batchSplit=Samples,Files

The output will be something like the following:


--------------------------------------------------------------------------------------------------

   ___|   |                               \  |         |
 \___ \   __ \    _` |  __ \    _ \      |\/ |   _` |  |  /   _ \   __|
       |  | | |  (   |  |   |   __/      |   |  (   |    <    __/  |
 _____/  _| |_| \__,_|  .__/  \___|     _|  _| \__,_| _|\_\ \___| _|
                       _|

--------------------------------------------------------------------------------------------------

 loadOptDefaults::pycfg =  None
 - new default value: treeName = Events
 - new default value: plotFile = plot.py
 - new default value: samplesFile = samples.py
 - new default value: nuisancesFile = nuisances.py
 - new default value: cutsFile = cuts.py
 - new default value: variablesFile = variables.py
 - new default value: tag = DY2018_final
 - new default value: lumi = 59.7
 - new default value: outputDir = rootFile
 configuration file =  configuration.py
 treeName           =  Events
 lumi =                59.7
 inputDir =            ./data/
 outputDir =           rootFile
batchSplit:  ['Samples', 'Files']
~~~~~~~~~~~ Running mkShape in normal mode...
======================
==== makeNominals ====
======================
 supercut =  mll>60 && Lepton_pt[0]>20 && Lepton_pt[1]>10 && (nLepton>=2 && Alt$(Lepton_pt[2],0)<10) && abs(Lepton_eta[0])<2.5 && abs(Lepton_eta[1])<2.5
 outputFileName =  rootFile/plots_DY2018_final.root

  
cut =  Zee  ::  (Lepton_pdgId[0] * Lepton_pdgId[1] == -11*11)                    && Lepton_pt[0]>25 && Lepton_pt[1]>13                  && mll>60 && mll<120                
cut =  Zmm  ::  (Lepton_pdgId[0] * Lepton_pdgId[1] == -13*13)                    && mll>60 && mll<120                

  
    variable = nvtx :: PV_npvsGood
      range: (20, 0, 100)
    variable = mllpeak :: mll
      range: (20, 80, 100)
    variable = ptll :: ptll
      range: (20, 0, 200)

  

  

It took some time? Not finished yet?

Kill the program (ctrl+c) ... it will take ages to end ...

Now let's try using htcondor, to run in parallel: the idea is that while you do a lot of ttree->Draw of variables you can parallelize them for each variable, for each cut (phase space), and for each sample, such that they are all run at the same time and then we collect the results. The root files and number of events to be analysed (data and MC) is quite huge, and the lxbatch (condor) cluster is here to help.

    mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday

Check if jobs are done by doing:

    condor_q

You get as output something like:

-- Schedd: bigbird15.cern.ch : <188.184.90.241:9618?... @ 09/18/20 12:00:30
OWNER    BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
amassiro ID: 9607321   9/14 09:29     28      7      _     35 9607321.0-28
amassiro ID: 9607450   9/14 09:33     55     11      1     67 9607450.0-64

where you can see if any job is still running, idle, or done.

You can also check if the root files with the histograms are available:

ls rootFile/

Add root files (literally merge the root files into one only), but before remove the old file (if any). You need to add the root files because each job on condor has created a root file with some histograms, but you need to add them all together.

    mv rootFile/plots_DY2018_final.root    rootFile/plots_DY2018_final_localVersion.root
    mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files

If this is too slow try to hadd manually (TAG is the one in configuration.py)

    cd rootFileTAG
    hadd -j 5 -f plots_TAG_ALL.root plots_TAG_ALL_* 

Let's look at the plots. Once the histograms are ready in the root file, we can merge them to create nice TCanvas. You can do it by hand ... or can use some scripts already prepared. If you are interested in the actual code, have a look at mkPlot.py. To do so:

mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_DY2018_final.root

and look at the folder:

ls plotDY_test1/

Questions:

  • why is the folder name "plotDY_test1" ?
  • looking at the plot, do you see any discrepancy data vs MC?

We have considered only DY as MC: are there other samples with 2 electrons or 2 muons in the final state?

We ran only on Run2018A so far (14.00 /fb), let's scale to full 2018 dataset too.

    mkShapesMulti.py --pycfg=configuration_complete.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 

once the output is ready:

    mkShapesMulti.py --pycfg=configuration_complete.py --doHadd=1 --batchSplit=Samples,Files

and plot:

    mkPlot.py --pycfg=configuration_complete.py --inputFile=rootFile/plots_DY2018_final_complete_v6.root

Add new variables and plot.

... it's playtime.

Queue Flavours

The following table lists the queue types of HTCondor (link) and of LSF (link):

HTCondor LSF
name max duration name
espresso 20min 8nm
microcentury 1h 1nh
longlunch 2h 8nh
workday 8h 1nd
tomorrow 1d 2nd
testmatch 3d 1nw
nextweek 1w 2nw

The longer the time, the lower your priority. Use wisely the queues.

Part 3. Scale factors and data driven backgrounds

What are the scale factors, how are they defined.

Event weights and flags

MC and data ntuples have several weights. MC weights are needed first and foremost to normalize the MC sample to the luminosity of the data. Also, event weights are computed to take into account the different scale factors that we use to improve the description of the data. In data we have flags for the different trigger bits (basically events that do not pass the trigger are weighted 0, those that pass are weighted 1). The following table lists the most important weights:

name description available in data available in MC
XSWeight if you weight MC events with this weight you will get MC normalized to 1/fb. Thus, to normalize to the data luminosity (59.7/fb in 2018) you have to weight MC as XSWeight*59.7. Notice that XSWeight takes into account the effect of negative weight events (sometimes present in NLO MC simulations), i.e. XSWeight can be sometimes negative. No Yes / Done
puWeight This weight is needed to equalize the Pile-Up profile in MC to that in Data. You need to understand that most of the time the simulation is done before, or at least partly before, the data taking, thus the PU profile in the MC is a guess of what will happen with data. This weight is the ratio of the PU profile in data to that guess that was used when producing the MC. No Yes / Done
TriggerEffWeight_2l This is the trigger efficiency (and is a function of the pT and η of the leptons). As mentioned above, we do not apply the trigger in MC (for a similar reason to what is mentioned above for the PU profile, the trigger menu used in MC is a guess, and cannot reflect exactly what will be used in data). Rather we weight events with the trigger efficiency directly. The trigger efficiency is measured by us for the soup of triggers that we use. No Yes / Done
SFweight2l is the product of puWeight*TriggerEffWeight_2l. Details in link No Yes / Done
ttHMVA_SF_2l This is the combined SF for the Id/Iso scale factors of the two leading leptons, for the specified electron and muon working points. See here for details. No Yes / Done
LepWPCut This tells you whether the first two leading leptons pass the specified electron/muon working points. See here for details. Yes / Done Yes / Done
METFilter_MC A flag that tells whether the event passes a series of filters devised by the JetMET POG to filter anomalous MET events (MC version) No Yes / Done
METFilter_DATA A flag that tells whether the event passes a series of filters devised by the JetMET POG to filter anomalous MET events (DATA version) Yes / Done No
Trigger_ElMu flag that is 1 for events passing the triggers chosen for the MuonEG dataset Yes / Done No
Trigger_dblMu flag that is 1 for events passing the triggers chosen for the DoubleMuon dataset Yes / Done No
Trigger_sngMu flag that is 1 for events passing the triggers chosen for the SingleMuon dataset Yes / Done No
Trigger_dblEl flag that is 1 for events passing the triggers chosen for the DoubleEG dataset Yes / Done No
Trigger_sngEl flag that is 1 for events passing the triggers chosen for the SingleElectron dataset Yes / Done No

Part 3. A Scale factors and where they have been measured

Scale factors (SF) are corrections applied to MC samples to fix imperfections in the simulation. These corrections are derived in data in control regions, meaning in regions where the signal is not present.

The origin of the mis-modelling could be from the hard scattering (theory uncertainty), or from the simulation of the response of particles with the detector (Geant4), or due to the conditions evolution in time in data (the MC has only one set of conditions), such as noise and radiation damage effects on the detectors.

The SF can be:

  • object based scale factors
  • event based scale factors

Object based SF are for example:

  • lepton identification and isolation efficiency. The identification criteria for leptons could be mis-simulated, then a scale factor is applied
  • jet related scale factors, such as b-tag efficiency mis-modelling

Event based SF are for example:

  • normalization of a sample. for example if a new NNLO cross section is available, or if a background normalization is found to be mis-modelled in a control region (a background whose theoretical uncertainty is big)
  • trigger efficiency. The trigger could be mis-modelled. We measure the trigger efficiency "per leg" of the triggers considered (single leptons and double leptons) and combine the efficiency to get the per-event one. We do not require the trigger in the simulation, but apply directly the efficiency to MC events

A typical phase space where SF are measured is Zmumu and Zee events, by means of the so called tag and probe method. An additional exercise for the tag and probe method for electron identification scale factors is given at the end of this twiki.

Effect of the scale factors:

cd PlotsConfigurationsCMSDAS2020CERN/ControlRegions/DY/

Plot with and without scale-factor for:

  • trigger
  • lepton id/iso
  • pile up

Look at the lepton pt1 and pt2, eta, number of jets, number of vertices, ...

Usual steps:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=espresso
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_DY2018_final_complete_v6.root

Part 3. B Data driven backgrounds

Non-prompt and same sign control region.

Questions:

  • what is the non-prompt contribution?
  • why have we chosen same sign to isolate "non-prompt" background?

cd PlotsConfigurationsCMSDAS2020CERN/ControlRegions/NonPrompt/

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_NonPrompt2018_v6.root 

Questions:

  • why do we have Vg* as background in the same sign?
  • can you identify a different phase space to isolate Vg*?

Understanding the configuration:

  • try easyDescriptor.py if things are not clear

Part 3. C Data driven normalization of backgrounds

Top control region

cd PlotsConfigurationsCMSDAS2020CERN/ControlRegions/Top/

Question:

  • how can we isolate the top background?

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_Top2018_v6.root 

Questions:

  • how much do we trust MC predictions? ("rateparam")
  • how can we use control regions to constrain background contamination estimation in signal phase space?
  • why are we applying the mth cut? Any guess?

DYtautau and why is this a background?

Interplay between different analyses (e.g. H>tautau) and importance of orthogonality of selections. This is where the coordination between conveners (HWW and Htautau) is essential.

Questions:

  • what about WW background?

Part 4. Nuisances

Objects, experimental and theory uncertainties

nuisances.py
Technical definition of nuisances.

There are several sources of systematic uncertainties in this analysis. They are modelled as nuisance parameters in the ML fit of the signal yield. Some of the systematic uncertainties affect only the yield of a given process. One of this kind of systematics is the data luminosity, which is modelled in the ML fit with a lognormal prior with a sigma of 2.5%. Other nuisances change both the shape and the yield of a given template. For example, an uncertainty on lepton momentum scale can affect the way one event is classified completely, maybe changing the ordering in pT of leptons, changing the MET etc. For these uncertainties mkShapesMulti.py can be instructed to produce an alternative shape for each sample, corresponding to one sigma variation of the relevant quantity.

The systematic uncertainties are made known to mkShapesMulti.py via the nuisances.py configuration file. It goes without saying that most of the work when designing an analysis goes into understanding the systematic uncertainties.

Each entry in the nuisances.py file has the following structure:

nuisances['lumi']  = {
               'name'  : 'lumi_13TeV',
               'samples'  : {
                  # the value of the nuisance per sample    
                  'ggH_hww' : '1.025'
               },
               'type'  : 'lnN',  #can also be "shape": In the latter case mkShapesMulti.py will run the varied shapes according to the following two possible kinds. It can also be "rateParam", in this case a uniform prior is used.
               'kind': 'KIND' # OPTIONAL relevant only  for type "shape". It can be "weight"--> Use the specified weight to reweight events; or "tree" --> uses the provided alternative trees
               'cuts': [] #OPTIONAL list of cuts that are affected by this nuisance, as defined in cuts.py    
}

Technical implementation in combine: you can have mainly 2 kind of nuisances implemented in the datacard (text file that defines the likelihood)

  • lnN, log-normal prior, defined manually in the datacard. "lnN 1.12" is about 12% uncertainty in the expected yield
  • shape, provide alternative histograms scaling up/down the corresponding nuisance. It is equivalent to many lnN correlated nuisances in different phase spaces (bins of the histogram), but it's easier to define it with an histogram

Technical implementation in the framework: for both the kind of the aforementioned nuisances, you can define them in the following way:

  • explicitly lnN value (see example of "lumi" above).
  • weights: reweight the events with a different weight in the tree and create new histograms (or transform it into a lnN)
  • alternative trees: as mentioned, in some cases, we cannot just reweight the events since kinematics is changing, and alternative trees are used to define up/down variations.

Part 4. Theory Nuisances

Theory nuisances.

Open a root file of signal and do a tree->Draw with different weights for scale variation.

Nuisances:

  • scale choice ((LHEScaleWeight[8]', 'LHEScaleWeight[0]))
  • PDF uncertainty
  • PS and UE simulation (weights and alternative samples) 'PSWeight[0]', 'PSWeight[1]', 'PSWeight[2]', 'PSWeight[3]'
  • higher order corrections (electroweak)

The trick of using higher order calculations and uncertainties, provided our MC is within the uncertainties.

Full example of ggH uncertainty and number of jets description. Scale variation could underestimate the uncertainty! Talk with your theory colleague, also to give feedback about what we (CMS) need!

Example of simple lnN from external calculation. Example of scale variation with weights. Example of "envelope" uncertainty (e.g. pdf uncertainty)

How these can be automatically implemented:

cd PlotsConfigurationsCMSDAS2020CERN/Nuisances/TheoryScale

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_TheoryNuisance.root
mkDatacards.py --pycfg=configuration.py --inputFile=rootFile/plots_TheoryNuisance.root

This is the first time we run "mkDatacards.py " !

So far, just run it to get some histograms that we need ... explanation later.

Check the effect of the nuisances on the distributions:

    cd ../../../LatinoAnalysis/ShapeAnalysis/test/draw

    python DrawNuisancesAll.py \
     --inputFile ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/TheoryScale/datacards/hww2l2v_13TeV_em_0j/ptll/shapes/histos.root  \
     --outputDirPlots ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/TheoryScale/df_nuisance  \
     --nuisancesFile ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/TheoryScale/nuisances.py  \
     --samplesFile   ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/TheoryScale/samples.py \
     --cutName hww2l2v_13TeV_em_0j

Only "shape" nuisances are reported in the plots.

Question:

  • why are the distributions like that?

Part 4. Experimental Nuisances

Use lepton scale as a prototype.

Open 3 root file of signal and do a tree->Draw with:

  • nominal
  • scale up
  • scale down

Important:

  • talk to POG colleagues
  • read POG twiki pages
  • work in POG and DPG

Example:

Don't be shy, participate to POGs and DPG meetings, ask questions, send email to Hypernews.

Part 4. A Lepton scale

How these can be automatically implemented:

cd PlotsConfigurationsCMSDAS2020CERN/Nuisances/LeptonScale

Example: electron scale uncertainty

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_LeptonScaleNuisances.root
mkDatacards.py --pycfg=configuration.py --inputFile=rootFile/plots_LeptonScaleNuisances.root

Check the effect of the nuisances on the distributions:

    cd ../../../LatinoAnalysis/ShapeAnalysis/test/draw

    python DrawNuisancesAll.py \
     --inputFile ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/LeptonScale/datacards/df/ptll/shapes/histos_df.root  \
     --outputDirPlots ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/LeptonScale/df_nuisance  \
     --nuisancesFile ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/LeptonScale/nuisances.py  \
     --samplesFile   ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/LeptonScale/samples.py \
     --cutName df

Question:

  • How much do you expect the impact of this nuisance to be about?
  • Could we simplify the "datacard"? (See option 'AsLnN': '1')

Part 4. A Jet energy scale

NB: it is a tricky business, this is a (reasonable) approximation Different sources of uncertainties enter into the jet energy scale one. In this case an "envelope" approach has been assumed, considering scaling up/down the jet energy scale coherently. This is what was done in Run 1 and for most of Run 2.

Nowadays, with increased luminosity, we are sensitive to more tiny effects, and all the sources affecting the scale of jets are treated separately.

cd PlotsConfigurationsCMSDAS2020CERN/Nuisances/JetScale

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_JetScaleNuisances.root
mkDatacards.py --pycfg=configuration.py --inputFile=rootFile/plots_JetScaleNuisances.root

Check the effect of the nuisances on the distributions:

    cd ../../../LatinoAnalysis/ShapeAnalysis/test/draw

    python DrawNuisancesAll.py \
     --inputFile ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/JetScale/datacards/df/ptll/shapes/histos_df.root  \
     --outputDirPlots ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/JetScale/df_nuisance  \
     --nuisancesFile ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/JetScale/nuisances.py  \
     --samplesFile   ../../../../PlotsConfigurationsCMSDAS2020CERN/Nuisances/JetScale/samples.py \
     --cutName df

Question:

  • How much do you expect the impact of this nuisance to be about?
  • Could we simplify the "datacard"? (See option 'AsLnN': '1')
  • Which variables do you expect to be affected?
  • Try to plot the "DrawNuisancesAll" for different variables

Part 5. Some details and special treatment of backgrounds

General treatment of background: top, DYtautau, non-prompt, Vg* ...

Question:

  • How could you define top control region?
  • How could you define DY>tautau control region?
  • How could you define Vg* control region?

Control regions are used in the final fit to "normalize the background contribution". Use of "rateparam".

Part 6. Make plots [continued]

Control regions: defined as single-bin histogram, just counting the number of expected and observed events.

Play with configurations and prepare control regions.

Steps forward a signal region

ggH

cd PlotsConfigurationsCMSDAS2020CERN/SignalRegion/ggH

aliases.py
Technical definition of aliases.

The file aliases.py defines some useful shortcuts in the definition of weights, cuts, variables ... It also helps in speeding up the creation of the histograms, since these variables get pre-computed.

If you want to learn more about a similar approach of data-analysis, see the new ROOT developments of RDataFrame.

In order to use the file aliases.py, it has to be declared in Configuration.py

# file with TTree aliases
aliasesFile = 'aliases.py'

Each entry in the aliases.py file has the following structure:

aliases['shortened_name'] = {
    'expr': 'mll >0 && mll < 4',
    'samples': 'VgS'  # Optional: you can define for which samples this alias is defined
}

Part 6. Make plots for ggH and VBF phase space

ggH and VBF phase space plots and datacards.

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_ggH2018_v6.root
mkDatacards.py --pycfg=configuration.py --inputFile=rootFile/plots_ggH2018_v6.root

Question:

  • This is the 0-jets phase space
  • Which is the main background?

Question:

  • let's check the additional nuisances
  • autoMCstat, a.k.a. bin-by-bin uncertainty related to finite MC statistics. See article, this other article, and the combine guide about Barlow-Beeston approach to MC stat (bin-by-bin) uncertainties.

More jets and VBF

cd PlotsConfigurationsCMSDAS2020CERN/SignalRegion/VBF

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_VBF2018_v6.root
mkDatacards.py --pycfg=configuration.py --inputFile=rootFile/plots_VBF2018_v6.root

Sidenote

  • there are other production mechanisms (VH, ttH)
  • dedicated analyses, e.g. 3 leptons, same sign, 4 leptons, ...

Question:

  • Which is the main background in the 2-jets phase space?
  • Can you plot separately ggH and VBF?
  • Which cuts could improve VBF purity (vs ggH) and efficiency (vs background)?
  • Check normalized distributions

mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_VBF2018_v6.root  --plotNormalizedDistributions

Now everything together:

cd PlotsConfigurationsCMSDAS2020CERN/SignalRegion/All

Run:

mkShapesMulti.py --pycfg=configuration.py --doBatch=1 --batchSplit=Samples,Files --batchQueue=workday 
mkShapesMulti.py --pycfg=configuration.py --doHadd=1 --batchSplit=Samples,Files
mkPlot.py --pycfg=configuration.py --inputFile=rootFile/plots_Inclusive.root
mkDatacards.py --pycfg=configuration.py --inputFile=rootFile/plots_Inclusive.root

Question:

  • main backgrounds in each phase space
  • why I did not split the 2-jet phase space in "em" and "me"?

Part 7. Introduction on how an analysis get approved in CMS

See: http://cms.web.cern.ch/content/how-does-cms-publish-analysis

Blinding:

Blind one single variable selectively in different phase spaces:

'blind': {
     'cut1': (100, 1000),    # min-max of blinded region: 100 = min, 1000 = max
     'cut2':  'full'
}

As mentioned earlier, if you want to blind a full phase space, you can define in plot.py the data to be blind:

plot['NAME']  = { #same name as in samples.py  
    'color': 418,    # kGreen+2
    'isSignal' : VALUE, # 0 -> background: all samples with this flag set to 0 are plotted stacked. 
                                # 1 -> is signal: this gets plotted both stacked and superimposed
                                # 2 -> is signal: this gets plotted only superimposed, not stacked. 
    'isData'   : 0, #0/1 with obvious meaning. It is not a duplicate of structure.py. This is used to handle blinding. See below.
    'isBlind' : 1 # if set to 1, all samples with isData = 1 are not shown.
    'scale'    : 1.0, # OPTIONAL whether to scale the sample by a fixed amount
    'cuts'  : {       #OPTIONAL: whether to plot this sample only for specified cuts and applying the specified scale factor.
                       'cut name'      : scale value ,
     },
}

Question:

  • why blinding? Where blinding?

Introduction slides.

Part 8. Combine

Combine tutorial: https://indico.cern.ch/event/859454/

Install "combine":

https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/master

https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/

Install in an independent cmssw release for combine

cmsrel CMSSW_10_2_13
cd CMSSW_10_2_13/src/
cmsenv

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit
cd HiggsAnalysis/CombinedLimit

Update to most recent recommended combine:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit
git fetch origin
git checkout v8.1.0
scramv1 b clean; scramv1 b

Move between a folder and the other. The last one where you run

cmsenv

is the one that is used.

Suggestion: have a "combine" CMSSW release only for combine. You may use it for different analyses, but keep it up to date.

Creation of datacards

cd PlotsConfigurationsCMSDAS2020CERN/SignalRegion/All

What does combine do: likelihood.

Parameters of interest and "default" model.

Combine the datacards:

    combineCards.py SR0jem=datacards/hww2l2v_13TeV_em_0j/mll/datacard.txt \
                    SR0jme=datacards/hww2l2v_13TeV_me_0j/mll/datacard.txt \
                    SR2j=datacards/hww2l2v_13TeV_2j/mll/datacard.txt \
                    TOP0j=datacards/hww2l2v_13TeV_top_0j/events/datacard.txt \
                    TOP2j=datacards/hww2l2v_13TeV_top_2j/events/datacard.txt \
                    DY0j=datacards/hww2l2v_13TeV_dytt_0j/events/datacard.txt \
                    DY2j=datacards/hww2l2v_13TeV_dytt_2j/events/datacard.txt \
                    > combined.txt

and text2workspace

       text2workspace.py     combined.txt    -o combined.root

  • Fit

    combine -M FitDiagnostics   -t -1 --expectSignal=1 combined.root   &> logFit.txt
    combine -M AsymptoticLimits -t -1 --expectSignal=0 combined.root   &> logLimit.txt
    combine -M Significance     -t -1 --expectSignal=1 combined.root   &> logSignificance.txt

Questions:

  • FitDiagnostics, what is a "signal strength modifier"?
  • AsymptoticLimits, what is a "limit"?
  • Significance, what is a "significance"?

New tools:

  • Impact plots

Install "combine harvester" from http://cms-analysis.github.io/CombineHarvester/

    git clone https://github.com/cms-analysis/CombineHarvester.git CombineHarvester

Create impact plots:

    #do the initial fit
    combineTool.py -M Impacts -d combined.root -m 125 --doInitialFit -t -1 --expectSignal=1 -n nuis.125 
    
    # do the initial fit for rateParams separately
     
    combineTool.py -M Impacts -d combined.root -m 125 --doInitialFit -t -1 --expectSignal=1 \
                 --named CMS_hww_WWnorm0j,CMS_hww_Topnorm0j,CMS_hww_DYttnorm0j,CMS_hww_WWnorm2j,CMS_hww_Topnorm2j,CMS_hww_DYttnorm2j  \
                 --setParameterRanges CMS_hww_WWnorm0j=-2,4:CMS_hww_Topnorm0j=-2,4:CMS_hww_DYttnorm0j=-2,4:CMS_hww_WWnorm2j=-2,4:CMS_hww_Topnorm2j=-2,4:CMS_hww_DYttnorm2j=-2,4         \
                 -n rateParams.125
    
    
    # do the fits for each nuisance
    combineTool.py -M Impacts -d combined.root -m 125 --doFits -t -1 --expectSignal=1 --job-mode condor --task-name nuis -n nuis.125 
    
    # do the fit for each rateParam
    combineTool.py -M Impacts -d combined.root -m 125 --doFits -t -1 --expectSignal=1 --job-mode condor --task-name rateParams \
            --named CMS_hww_WWnorm0j,CMS_hww_Topnorm0j,CMS_hww_DYttnorm0j,CMS_hww_WWnorm2j,CMS_hww_Topnorm2j,CMS_hww_DYttnorm2j \
            --setParameterRanges CMS_hww_WWnorm0j=-2,4:CMS_hww_Topnorm0j=-2,4:CMS_hww_DYttnorm0j=-2,4:CMS_hww_WWnorm2j=-2,4:CMS_hww_Topnorm2j=-2,4:CMS_hww_DYttnorm2j=-2,4 \
            -n rateParams.125
    

Now plots:


    #collect job output
    combineTool.py -M Impacts -d combined.root -m 125 -t -1 --expectSignal=1 -o impacts.125.nuis.json -n nuis.125
    
    combineTool.py -M Impacts -d combined.root -m 125 -t -1 --expectSignal=1 --named CMS_hww_WWnorm0j,CMS_hww_Topnorm0j,CMS_hww_DYttnorm0j,CMS_hww_WWnorm2j,CMS_hww_Topnorm2j,CMS_hww_DYttnorm2j -o impacts.125.rateParams.json -n rateParams.125
    
    
    #combine the two jsons
    echo "{\"params\":" > impacts.125.json
    jq -s ".[0].params+.[1].params" impacts.125.nuis.json impacts.125.rateParams.json >> impacts.125.json 
    echo ",\"POIs\":" >> impacts.125.json
    jq -s ".[0].POIs" impacts.125.nuis.json impacts.125.rateParams.json >> impacts.125.json
    echo "}" >> impacts.125.json
    # make plots
    plotImpacts.py -i impacts.125.json -o impacts.125 
    

Questions:

  • What is the main nuisance
  • How do we read the impact plots, namely the plot where the impact of each nuisance on the uncertainty of the signal strength modifier is reported?

Part 9. Results

Final plots:

  • Input distributions
  • Signal strength
  • Impact plots
  • Significance
  • different model: couplings (kV:kf) or muVBF:muGGH

    text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel \
                  --PO 'map=.*/ggH_h*:muGGH[1,0.0,2.0]' \
                  --PO 'map=.*/qqH_h*:muVBF[1,0.0,2.0]' \
                  combined.txt -o combined.multidim.root

Perform the scan (via condor):

    combineTool.py -d   combined.multidim.root  -M MultiDimFit    \
               --algo=grid     --X-rtd OPTIMIZE_BOUNDS=0   \
               --setParameters  muGGH=1,muVBF=1 \
               -t -1   -n "mycondor"   \
               --points 100    --job-mode condor \
               --task-name condor-all    \
               --split-points 1 

And now plot:

              
    hadd higgs_2dScan.root   higgsCombinemycondor.POINTS.*.MultiDimFit.mH120.root
    root -l    higgs_2dScan.root     Draw2DImproved.cxx\(\"#mu_\{GGH\}\",\"#mu_\{VBF\}\",\"muGGH\",\"muVBF\",2,\"1\"\)

Questions:

  • likelihood scan and uncertainty on Parameters of Interest (POIs)
  • Is there any correlation between muGGF and muVBF?
  • how could you improve this result (i.e. make the ellipse smaller)?

Bonus:

  • Differential distributions
  • Complex models (e.g. couplings), Effective Field theory
  • Full Run II searches and combination among years
  • Full Run II searches and combination between channels (HWW, HZZ, Hgg, ...)

Part 10. Additional Tag and Probe Exercise

Tag and Probe is the most common method to estimate the efficiency of a selection. We are going to use 2018 data and MC for this study. Broadly the steps involved are :

  • select events with two electrons with opposite sign.
  • require one of the electrons to pass the 'electron id' and trigger along with a pt cut depending on the trigger used. This electron is called `tag`.
If we don't find such an electron in the event then we reject the event.
  • then we find a second electron in the event which is reconstructed as electron. There is no other selection requirement on this electron.
  • we make a pair of tag and probe and then require the invariant mass of the pair between 60 GeV to 120 GeV. This is to make sure that we are selecting events coming from the decay of Z boson. What we get after this is called `total probe collection' which acts as denominator of the efficiency.
  • After this we ask the probe to pass the selection criteria for which we want to measure the efficiency. For this exercise, since we want to measure the efficiency of electron identification, so the selection is electron id.
  • The tag and probe pairs where the probe passing the selection forms the numerator of efficiency.

Some more explanation of the method is given here https://twiki.cern.ch/twiki/bin/view/CMSPublic/TagAndProbe

Here is the package we are going to use : https://github.com/arunhep/NanoTnP/tree/CMSDAS2020

Various steps are explained in the following README file : https://github.com/arunhep/NanoTnP/blob/CMSDAS2020/README.md

-- Main.AndreaMassironi - 2020-08-26

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2020-09-29 - AndreaMassironi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback