Under construction

tW Analysis Short Exercise, CMS Data Analysis School, Beijing

Aim of the exercise: Learn to identify variables that can be used to discriminate between processes and how they can be incorporated into a multi-variate analysis.

*Duncan Leggat, IHEP


The color scheme of the exercises is as follows:

  • Commands will be embedded in a grey box:
    cmsRun test_cfg.py
  • Output and screen printouts will be embedded in a green box:
    Hello world! 
  • C++ code snippets will be embedded in a pink box:
    if ( p < 2.5 ) continue; 
  • Python code will be embedded in a light blue box:
    process.source = cms.Source("PoolSource", fileNames = cms.untracked.vstring("file:inputfile.root") )

Reference material


TMVA manual


List of public results for tW

Contains the first evidence and observation papers, as well as recent Run 2 measurements and combinations.

Single Top Physics

[Introduction slides link here]

The top quark is the most massive of the currently observed fundamental particles. Thanks to its high mass it boasts a number of rare properties that, when studied, offer a unique insight into the Standard Model.

  • Large couplings and a special place in electroweak symmetry breaking,
  • Decays before it can hadronise, meaning it can be used to study the properties of bare quarks.

Top quarks are usually produced in association with their anti-quark partner in strong interactions. These ttbar events are produced at a high flux at the LHC and are well studied.

Top quarks can also be produced singly via the electroweak force through interactions with a W boson. The three main production modes for these events are as shown below: a) t-channel (exchange of a W boson), 2) s-channel (W decay to tb pair) 3) tW (associated production of a top quark with a W boson).


Single top events have a clean final state with relatively few backgrounds. They each contain a direct Wtb vertex in their LO Feynman diagrams and, as such, would be very sensitive to any BSM physics that impacts this.

Indeed there is a measured tension with the SM in single top physics - the top is measured to be less polarised than that predicted by theory (paper).

As they also act as backgrounds for many Higgs and unrelated BSM studies it is vitally important that all the single top quark channels are well understood and constrained.


tW is the associated production of a single top quark with a W boson.

tW has a unique feature of interest in the single top channels in that it is indistinguishable from ttbar at NLO. The resonant Feynman diagrams for the two processes:


This presents an intuitive problem when trying to define the processes separately, and in particular when attempting to simulate them. In the world of Monte Carlo two solutions to the above problem have been proposed:

  • Diagram removal Removing the doubly resonant diagrams from the tW signal definition,
  • Diagram subtraction Adding in a guage invariant subtraction term to locally cancel the resonance.

In general we use the DR scheme and include the DS as a systematic uncertainty.

In this exercise we will learn how we can separate the two processes.

Exercise 1: Identifying discriminating variables

Firstly we need to log into the PKU cluster and set up our environment:

ssh -XY @hepfarm02.phy.pku.edu.cn
cmsrel CMSSW CMSSW_7_6_7
cd CMSSW_7_6_7/src

We will find the ntuples we will work on in [a directory?]/trees/

[currently in /home/cmsdas/leggat/tWShortExercise/trees/ but I don't know if this is public...]

There are ntuples available for the signal tW process, the dominant ttbar background and proton-proton collision data. Various selection requirements have already been applied to the events in these ntuples to improve the tW signal purity of the samples:

  • Passes primary vertex reconstruction,
  • Exactly 1 electron,
  • Exactly 1 muon,
  • Lepton pair must be of opposite sign,
  • Exactly one jet in the event that passes b-tagging requirements.

Note We are using the emu channel only for this exercise. Dilepton states also exist for electron-electron and muon-muon channels, but they introduce additional Drell-Yan backgrounds that make them less sensitive than the emu.

The trees can be examined using a TBrowser:

[duncanleg@lxslc607] > root trees/tWDilepton.root
root [0] 
Attaching file trees/tWDilepton.root as _file0...
root [1] TBrowser t

You will see a tree with the name 'TNT/BOOM' that contains selected events. These events contain a large number of distributions, from simple kinematic properties of the reconstructed particles to more complicated functions of the overall state of the event.

Comparing distributions of signal and background

Due to the similarities in the final states of tW and ttbar we end up selecting a large number of ttbar events no matter how we define our signal region. But we want to be able to distinguish our signal events to study them properly. Surely we can do better?

We can consider the physics of the two processes to come up with clues that can separate them. ttbar enters our signal region when one of the b-jets is not selected in the event selection step. This gives us a few windows to look into possible discriminating variables.

  • The additional jet from a ttbar interaction must still be there somewhere.

Possibly it was too low pt or too high eta to be selected, or it failed ID checks. If this is the case it may be in the event as a 'loose' jet - a jet with looser selection criteria applied to it. Examining the additional jets can yield some separation power.

  • If we are missing a jet, the kinematics of the remaining particles will be affected.

Before the interaction the particles have no transverse momentum (i.e. they are moving directly along the beampipe). This means that the sum transverse momentum of all particles in the system should remain zero after the interaction. If we lose a jet from the event, this summed momentum will be larger. Similarly we might expect differences in the missing transverse momentum originating from the lost jet.

We have a convenient script to compare the distributions between the signal and background contained in /scripts:

python scripts/compareDistributions.py

In the pre-amble of this script can be found the variable

 variablesToCompare = [

which controls which variables in the ntuples we will compare.

The script will produce a set of images of the comparisons in the outFiles directory in your cwd. An example of the output distributions can be seen below:


In the event that the distribution jumps around because of statistical effects such as the one above, there is another option in the premable:

 rebinHists = 1

that can be used to rebin the distributions. Changing this to a larger integer will reduce the number of bins in the distribution, smoothing out the bumps we observe. For example, with this set to 4 the previous plot becomes:


Although it can be obvious by eye if the signal and background match, it is useful to have a mathematical definition of separation in order to quantify the power of each variable. We define separation as: Selection_810.png where y_s and y_b are the the signal and background PDFs, respectively.

If two distributions are identical they will give a separation power of 0, whilst two independent distributions give a separation of 1. The higher the value, the better. In our script, the separation is calculated in the method:

 def calculateSeparation(hist1,hist2) 

It is printed in the terminal when the comparisons are made.

Now we will try to find some variables with good separating power between the tW signal and ttbar backgroun.

Some suggestions, based on previous tW analyses:

Variable name in tree Description
tWVars_HT The scalar sum of the transverse momentum of the leptons, jet and MET in the event
tWVars_Pt_sys The vector sum of the transverse momentum of the lepton jet + MET system
tWVars_nJet2030 The number of jets in the event with a transverse momentum between 20 and 30 GeV. These are one form of 'loose' jets
Met_type1PF_pt The missing transverse momentum of the event
tWVars_Pt_LeadJet Transverse momentum of the leading jet
tWVars_Mass_AllJetsLeptonMET Mass of the jet, leptons + MET system
tWVars_AllJetsLeptons_Centrality The centrality (Pt/P) of the lepton+jets system
tWVars_ptSysOverHT Pt_sys divided by HT
tWVars_Pt_Electron Pt of the electron
tWVars_Pt_Muon Pt of the muon
tWVars_Pt_Leptons Pt of the muon+electron system
tWVars_Pt_AllJetsLeptons Pt of the leptons plus the jet
tWVars_DeltaRBJetLeptons The angular separation between the b-jet and the dilepton system
tWVars_DeltaPhiJetMET Separation in phi between the selected jet and the missign transverse energy
These variables are just starter suggestions. Try out some others and try to find variables with good separation power.

Once you have found some variables to use, it is time to move on.

Comparing data and simulation

The data ntuples are still under construction, please skip to exercise 2!

Before we can use the selected variables for further analysis we need to check that the simulation well describes the data.

To do this we have another script that will produce plots that stack the MC and compare it to the available data:

python scripts/dataMCComparisonPlot.py / 

Take some time to check the shape and normalisation for each chosen variable. If there is anything suspicious we should consider whether or not we can use the variable for the analysis.

Exercise 2: Building a multivariate analysis

Now that we have selected some well described discriminating variables, we can put them together into a single multivariate analysis.

For this we will be using a Boosted Decision Tree (BDT) in the TMVA framework (see reference material). Other MVAs and frameworks exists (see extended reading), but they are outside the scope of this exercise.

Performing the training and checking the output

The code we will be using the construct the BDT is in /scripts/mvaScript.C (and the header mvaScript.h).

In the constructor you will find the definition below;

_varList = {"tWVars_E_Muon","tWVars_HT","tWVars_Pt_sys"}; 

This is the list of variables that the BDT will use for its training (and subsequent reading steps). Edit this list to contain the variables previously selected.

In order to run the training one must use the commands:

duncanleg@lxslc604 /publicfs/cms/user/duncanleg/cmsdas > root
root [0] .L scripts/mvaScript.C+    //Loads in the mva scripts and compiles it
Info in : creating shared library /publicfs/cms/user/duncanleg/cmsdas/./scripts/mvaScript_C.so
root [1] mvaScript t                      //Make the script object
(mvaScript &) @0x7ff98985f040
root [2] t.doTraining()                  //Carries out the training

TMVA will run the training and output useful information into the terminal. Assuming all has run correctly the output will finish with the lines:

Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html

Information such as the number of training and testing events, mean values of and correlations between the chosen variables are displayed. Of interest are two tables that are produced during the running. These tables rank the input variables by separation power (as previously calcualted):

                         : Ranking input variables (method unspecific)...
IdTransformation         : Ranking result (top variable is best ranked)
                         : --------------------------------------
                         : Rank : Variable      : Separation
                         : --------------------------------------
                         :    1 : tWVars_Pt_sys : 8.137e-02
                         :    2 : tWVars_E_Muon : 7.085e-02
                         : --------------------------------------

And by importance in the training:

                         : Ranking input variables (method specific)...
tW_ttbar_BDT             : Ranking result (top variable is best ranked)
                         : -----------------------------------------------
                         : Rank : Variable      : Variable Importance
                         : -----------------------------------------------
                         :    1 : tWVars_Pt_sys : 5.293e-01
                         :    2 : tWVars_E_Muon : 4.707e-01
                         : -----------------------------------------------

The importance variable measures the fraction of decision nodes that use the variable in their selection, and as such must sum to 1. If the importance of a variable is zero, it is not being used in the training and should be removed.

Both of these can give clues as to whether the selected variables are useful in the training and can be used to cull those that are not.

Checking the BDT training

Once the training has successfully run, we can check the output recorded to the 'training' directory of your cwd.

Open root and then the file in a TMVA GUI window:


root [0] TMVA::TMVAGui("training/tW_ttbar_training.root")

This will bring up a control window with numerous options on it that can be used to test and analyse the BDT training.


Check the following useful information from the GUI for your training:

Input Variables - (1a) Input variables (training sample)

The first button in the GUI produces plots comparing the signal and background distributions for each input variable, similar to what we did in the first exercise.

Check for Overtraining

It is possible that carrying out too much training can lead to artificially good separation of signal and background.

[refer to slides here]

In order to check whether this is occurring, the input trees are split into two independent sets; the training and test samples. The training is conducted on the training sample. The calculated weights are then applied to the test sample. The output distributions of the test and training samples can then be checked for consistency. Any large deviations indicate the BDT has been overtrained.

The TMVAGui can automatically produce this test using button 4.

  • (4a) plots the output of the training sample only,
  • (4b) plots the output of the training and test samples on the same canvas. A Kolmogrov-Smirov test is also conducted to calculate the consistency of the test and training results for both the signal and background. In depth details of the KS test can be found here, but for us it is enough to know that 1. means the two distributions are identical, and lower values are worse.

Consider the two overtraining plots below.

Selection_812.png Selection_811.png

One is problematic, whilst the other is a good example of a well-trained BDT.

In this example the bad training mostly derives from a lack of events for training, but it can clearly be seen that the shape of the signal sample is very different between the testing and training samples, implying that overtraining has indeed occurred.

If there are ample statistics to train the BDT but overtraining is still observed, then there is a systemic problem with the training parameters used. Tuning these parameters can solve this problem, and will be described below.

Input variable correlations

Button 3 produces a matrix that shows the correlations between the input variables for both the signal and background samples. See the example below:


Variables that are highly correlated supply the same information to the MVA training, making one of them redundant. Ideally the above plot is entirely green (outside the diagonal). You should remove any variables that are highly correlated with a number of other variables.

Examine the ROC

By placing a cut on the BDT discriminant we can reject a number of both signal and background events. The higher we make the cut the more background events we reject, at the cost of removing additional signal events. We call this the background and signal efficiency, respectively.

Ideally we want to maximise the signal efficiency whilst minising the background. We can visualise this in the following ways:

  • Button (5a) shows us the two efficiencies plotted as a function of discriminant cut.
  • Button (5b) plots the two together in what is known as a 'Receiver Operator Characteristic' (ROC).

The ROC curve gives us an immediate idea of how the MVA is performing: the closer to the top right the curve reaches (i.e. high signal efficiency with maximum background rejection) the better. The area under the curve can then be used as a proxy for the power of the separation the MVA can achieve, and can be considered the probability of the BDT correctly flagging a signal event.

  • An area under the ROC of 1. implies complete rejection of background with no loss of signal
  • An area under the ROC of 0.5 means as many background events are flagged as signal as signal events. It is equivalent to flipping a coin for each event.

See the example of 3 ROCs for separately trained BDT below:


The area under the ROC is calculated during the training and displayed in the output:

                          : ------------------------------------------------------------------------------------
                         : Evaluation results ranked by best signal efficiency and purity (area)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet       MVA                       
                         : Name:         Method:          ROC-integ
                         : loader        tW_ttbar_BDT   : 0.553
                         : -------------------------------------------------------------------------------------------------------------------

Maximising the ROC whilst keeping a handle on the overtraining is the key to a successful MVA.

Exercise 3: Optimising your BDT response by tuning parameters

In the constructor of the mvaScript there are a number of settings that we can change to adjust the tune of the BDT. These will have different impacts on the training, and can heavily influence the final separating power and accuracy of the classifier.

The parameters to vary are:

Parameter name Default value Description
nTrees 800 The number of trees to be trained. A higher number will lead to better separation but also overtraining. Usually the sweetspot is around 300-500
nCuts 20 Number of cuts points allowed for each decision node
MaxDepth 3 Maximum allowed depth of the trees
BoostType AdaBoost Different boosting regimes exist for training the trees. Useable options: 'AdaBoost', 'GradBoost' and 'Bagging'
There are many more parameters that one can play around with, and all are explained in the TMVA Users' Guide (see reference material).

Now you should try changing the values of each and seeing how it affects the output and tests.

Find the best possible ROC curve for the BDT whilst minimising overtraining and correlations by testing different variables and training parameters. *Hint* Multiple classifiers can be added to the same factory and trained in the same script!

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg 1-s2.0-S2405428315000027-gr1.jpg r1 manage 15.5 K 2019-11-29 - 11:16 DuncanLeggat  
PNGpng Selection_806.png r1 manage 34.1 K 2019-11-29 - 03:51 DuncanLeggat The TMVA GUI
PNGpng Selection_807.png r1 manage 50.1 K 2019-11-29 - 04:16 DuncanLeggat eta comparison 1
PNGpng Selection_808.png r1 manage 45.9 K 2019-11-29 - 04:31 DuncanLeggat  
PNGpng Selection_809.png r1 manage 44.8 K 2019-11-29 - 06:49 DuncanLeggat After rebinning eta plot
PNGpng Selection_810.png r1 manage 3.5 K 2019-11-29 - 07:03 DuncanLeggat  
PNGpng Selection_811.png r1 manage 176.9 K 2019-11-29 - 09:35 DuncanLeggat BDT overtraining good vs bad
PNGpng Selection_812.png r1 manage 22.0 K 2019-11-29 - 09:35 DuncanLeggat BDT overtraining good vs bad
PNGpng Selection_813.png r1 manage 35.9 K 2019-11-29 - 11:34 DuncanLeggat  
PNGpng Selection_814.png r1 manage 37.7 K 2019-11-29 - 11:54 DuncanLeggat  
PNGpng Selection_815.png r1 manage 60.5 K 2019-11-29 - 11:56 DuncanLeggat correlation plots
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2019-11-29 - DuncanLeggat
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback