- Oviedo Statistics Class 2017
- Summary
- Reference materials
- Computing environment
- Exercise #0: Fit a lower statistics data sample using RooFit
- Exercise #1: using toy MCs to test fit robustness
- Exercise #2: compute the excess significance
- Exercise #3: compute upper limits to the signal yield using frequentist and Bayesian methods
- Exercise #4: determine the cross section with full statistics
- Exercise #5: this is a mistery...
- Exercise #6: Nuisance parameters and systematics
- Exercise #7: Feldman-Cousins: a digression

The goal of the exercise is to get acquainted with few of the statistical analysis practices used in HEP. Introductory slides will present the main features of RooFit and RooStats, as well as giving some information on the hands-on session.

This exercises use the CMS public data from the 2010 proton-proton running. In particular, the invariant mass spectrum of muon pairs from data is provided. The purpose of the session is to approach the discovery of a "new" particle in the context of the observation of the ψ(2S) resonance in the CMS data. No specific knowledge of the CMS apparatus is needed. This is what the distribution of the dimuon invariant mass spectrum looks like for about 2% of the total 2010 statistics:

The J/ψ is of course very visible around the 3.1 GeV mass point. We expect to see an excess around 3.65 GeV for the ψ(2S). Is it there? Let's investigate!

Do notice that usually experiments do not proceed in this way towards a discovery. In general, blind analyses are performed and the methodology and modelization for data interpretation is decided before looking at the actual data. For the purpose of describing the statistical tools in this lecture, we will neglect these considerations for now.

The exercise session introduces methods and tools for tackling very common problems in statistics while complying with rigorous standards set in today's experimental particle physics. The main topics are finding a confidence interval for a parameter of interest, and estimating the statistical significance of an excess of events with respect to the background expectations. We cover the most popular techniques: profile likelihood, Bayesian. Participants will learn the main features of the RooFit and RooStats software frameworks for statistics and data modeling. We overview the popular cases of a hunting a signal peak over a predictable falling background.

- Statistics in Theory - a lecture by Bob Cousins
- Statistical methods in LHC data analysis - Luca Lista
- RooFit reference slides - by Wouter Verkerke, one of RooFit authors
- RooFit tutorials - a set of working macros that showcase all major features of RooFit
- RooStats Manual - concise, contains clear summary of statistics concepts and definitions
- RooStats tutorial - by Kyle Cranmer, one of the RooStats developers
- RooStats tutorials - a set of working macros that showcase all major features of RooStats

All exercises should run on any ROOT installation containing also the RooFit libraries (RooStats is included in the RooFit installation).
On your laptop you obtain this by passing the *--enable-roofit* on the *configure* step of the ROOT installation. If you have a debian based installation, you just need to install the libRooFit library.
All the code here uses PyROOT, which is a python interface to ROOT libraries, so a python installation is also required.

Alternatively, if you have a CERN account, most of the central ROOT installations contain RooFit.

PyROOT programs can be ran using the following command from the shell prompt:

python macro.py

In order to have all ROOT libraries available in PyROOT, the following line should be added at the begin of every program (`macro.py`, in the previous command line example):

import ROOT

First of all, you will need a copy of the data ROOT file containing the invariant mass information from dimuon events. We will start with a data sample which corresponds to about 20% of the statistics available on 2010:

wget https://twiki.cern.ch/twiki/pub/Main/OviedoStatRooStats2017/DataSet_lowstat.root .

Let's create a skeleton for a PyROOT program. Let's call it, for instance, `exercise_0.py`, and load the ROOT libraries:

#Import the ROOT libraries import ROOT

- Open the ROOT file containing the data, and extract the RooDataSet object. A RooDataSet is essentially an ntuple (similar to a TTree in simple ROOT) containing your N-dimensional observables (in this case it has one dimension, it contains the dimuon invariant mass measurement for the selected events):

fInput = ROOT.TFile("DataSet_lowstat.root") dataset = fInput.Get("data")

- Define the observables of the problem. In our case, this is a one-dimensional problem, and the observable is the invariant mass of the dimuon system. To handle this, RooFit uses the class
`RooRealVar`. Note that at the constructor you have several options. In the following, either a range will be specified, or an initial value plus a range.

#The observable mass = ROOT.RooRealVar("mass","#mu^{+}#mu^{-} invariant mass",2.,6.,"GeV")

- Modelization of the observable distribution. Let's start from the most striking feature: the J/ψ peak. We will describe this signal using a Crystal Ball function (see Wikipedia), which is essentially a Gaussian core with a polynomial left tail to account for energy losses. First, we have to define the parameters of the function.

#The Jpsi signal parametrization: we'll use a Crystal Ball meanJpsi = ROOT.RooRealVar("meanJpsi","The mean of the Jpsi Gaussian",3.1,2.8,3.2) sigmaJpsi = ROOT.RooRealVar("sigmaJpsi","The width of the Jpsi Gaussian",0.3,0.0001,1.) alphaJpsi = ROOT.RooRealVar("alphaJpsi","The alpha of the Jpsi Gaussian",1.5,-5.,5.) nJpsi = ROOT.RooRealVar("nJpsi","The alpha of the Jpsi Gaussian",1.5,0.5,5.)

- Define a PDF for the signal model. The Crystal Ball distribution, in this case, is provided by the
`RooCBshape`class:

CBJpsi = ROOT.RooCBShape("CBJpsi","The Jpsi Crystall Ball",mass,meanJpsi,sigmaJpsi,alphaJpsi,nJpsi)

- Now we have to define a model for the excess around 3.65 GeV. Since there aren't many events involved, the functional form to describe this peak doesn't need to be very sophisticated. A simple Gaussian function (described by
`RooGaussian`) is probably enough for the level of precision required. We are going to do an assumption here: the resolution in the mass region of this excess is probably similar to the dimuon resolution around the J/ψ mass peak. Therefore, we will use the width of the Gaussian core of the J/ψ Crystal Ball to describe the width of the Gaussian for this new excess. In this way, we use the high statistics of the J/ψ peak to better constrain our signal modelization

#The psi(2S) signal parametrization: width will be similar to Jpsi core of the CB (almost Gaussian) meanpsi2S = ROOT.RooRealVar("meanpsi2S","The mean of the psi(2S) Gaussian",3.7,3.65,3.75) gausspsi2S = ROOT.RooGaussian("gausspsi2S","The psi(2S) Gaussian",mass,meanpsi2S,sigmaJpsi)

- Now also define the background. A third order polynomial is enough here. We'll use Chebychev polynomials:

#Background parametrization: just a polynomial a1 = ROOT.RooRealVar("a1","The a1 of background",-0.7,-2.,2.) a2 = ROOT.RooRealVar("a2","The a2 of background",0.3,-2.,2.) a3 = ROOT.RooRealVar("a3","The a3 of background",-0.03,-2.,2.) backgroundPDF = ROOT.RooChebychev("backgroundPDF","The background PDF",mass,ROOT.RooArgList(a1,a2,a3))

- Now, define the yields for the different contributions to the spectrum, again using the
`RooRealVar`class:

NJpsi = ROOT.RooRealVar("NJpsi","The Jpsi signal events",1500.,0.1,10000.) Npsi = ROOT.RooRealVar("Npsi","The psi signal events",100.,0.1,5000.) Nbkg = ROOT.RooRealVar("Nbkg","The bkg events",5000.,0.1,50000.)

- The total PDF is a linear combination of the different components with weights that are proportional to
`NJpsi`,`Npsi`and`Nbkg`. This is implemented in RooFit in the class`RooAddPdf`. Note that the lists of PDF and yields are passed via the`RooArgList`class:

#Compose the total PDF totPDF = ROOT.RooAddPdf("totPDF","The total PDF",ROOT.RooArgList(CBJpsi,gausspsi2S,backgroundPDF),ROOT.RooArgList(NJpsi,Npsi,Nbkg))

- As we have the dataset ready, we can fit it to the PDF model in order to determine the desired parameters:

#Do the actual fit totPDF.fitTo(dataset, ROOT.RooFit.Extended(1)) #Print values of the parameters (that now reflect fitted values and errors) print "##############" meanpsi2S.Print() NJpsi.Print() Npsi.Print() print "##############"

- The result of the fit is printed as usual Minuit output, but you can also produce a plot of the fit PDF and its signal and background components, separately. Note that you need to use the class
`RooPlot`, you can't directly call the usual ROOT`Draw()`method for RooFit objects:

#Now plot the data and the fit result xframe = mass.frame() dataset.plotOn(xframe) totPDF.plotOn(xframe) #One can also plot the single components of the total PDF, like the background component totPDF.plotOn(xframe, ROOT.RooFit.Components("backgroundPDF"), ROOT.RooFit.LineStyle(ROOT.kDashed), ROOT.RooFit.LineColor(ROOT.kRed)) #Draw the results c1 = ROOT.TCanvas() xframe.Draw() c1.SaveAs("exercise_0.png")

- In order to make the PDF model and the dataset available for use in next exercises, you can save it into a
`RooWorkspace`object that can be written directly into a ROOT file:

#Now save the data and the PDF into a Workspace, for later use for statistical analysis ws = ROOT.RooWorkspace("ws") getattr(ws,'import')(dataset) getattr(ws,'import')(totPDF) fOutput = ROOT.TFile("Workspace_mumufit.root","RECREATE") ws.Write() fOutput.Write() fOutput.Close()

- Now execute the python program. The result should look like the following:

The entire exercise is available below:

- First, create a skeleton python and call it
`exercise_1.py`. Load ROOT:

import ROOT

- Import the workspace from the file produced in the previous exercise:

#Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print()

- The MC study machinery needs to be fed the observable and the PDF, so we'll extract them from the RooWorkspace. Important: what we saved in the Workspace is the PDF (with all the parameters) in its last status, which is just after the fit in exercise 0:

#Get the observable and PDF out of the Workspace mass = ws.var("mass") totPDF = ws.pdf("totPDF")

- Initialize RooMCStudy. It requires to know the PDF and the observable (for the toy MC generation). You can pass additional arguments. Here, we add the request Extended(1), which means that in every single toy MC experiment the total number of events generated will be a Poissonian fluctuation of the expected total number of events (which is the fit result of the previous exercise). Also, we want to save the fit result of each experiment for later use.

#Initialize RooMCStudy mc_study = ROOT.RooMCStudy(totPDF, ROOT.RooArgSet(mass), ROOT.RooFit.Extended(1), ROOT.RooFit.FitOptions(ROOT.RooFit.Save(1)) )

- Now we need to generate N experiments, fit them, and save the result. It is done with this line:

#Generate 1000 experiments and fit each one, each fluctuating in Nevents = NJpsi + Npsi + Nbkg mc_study.generateAndFit(1000)

- Now we can study the behavior of our parameters along the experiments. RooFit provides some automatic tools to keep track of all the fit results and plot them. Let's consider our parameter of interest,
`cross_psi`, and verify its statistical properties, like its central value across the experiments, the error, and the pull:

#Now let's see the results. For example, the study of the cross section variable cross_psi = ws.var("cross_psi") frame_cross_par = mc_study.plotParam(cross_psi, ROOT.RooFit.Bins(40)) frame_cross_err = mc_study.plotError(cross_psi, ROOT.RooFit.Bins(40), ROOT.RooFit.FrameRange(0.,30.) ) frame_cross_pul = mc_study.plotPull(cross_psi, ROOT.RooFit.Bins(40), ROOT.RooFit.FitGauss(1) )

- We may also want to check that the NLL has a healthy distribution across the experiments:

#Also, let's see the distribution of the NLL for all the fits frame_nll = mc_study.plotNLL(ROOT.RooFit.Bins(40))

- Now let's plot all this:

#Now plot the whole thing ROOT.gStyle.SetOptStat(0) mcstudy_Canvas = ROOT.TCanvas("mcstudy_Canvas") mcstudy_Canvas.Divide(2,2) mcstudy_Canvas.cd(1) frame_cross_par.Draw() mcstudy_Canvas.cd(2) frame_cross_err.Draw() mcstudy_Canvas.cd(3) frame_cross_pul.Draw() mcstudy_Canvas.cd(4) frame_nll.Draw() mcstudy_Canvas.SaveAs("exercise_1.png")

- The
`cross_psi`variable may have some strongly non-Gaussian behavior, as it involves a low statistics object that may be affected by strong fluctuations. As a comparison, let's also save the behavior of a variable that involves the J/ψ. After all, this is a very large peak with relatively high statistics, and we expect that variables involved with this peak will show a Gaussian behavior:

#Now plot some less constroversial variable meanJpsi = ws.var("meanJpsi") frame_mjpsi_par = mc_study.plotParam(meanJpsi, ROOT.RooFit.Bins(40)) frame_mjpsi_pul = mc_study.plotPull(meanJpsi, ROOT.RooFit.Bins(40), ROOT.RooFit.FitGauss(1) ) mcstudy_Canvas_jpsi = ROOT.TCanvas("mcstudy_Canvas_jpsi") mcstudy_Canvas_jpsi.Divide(2,1) mcstudy_Canvas_jpsi.cd(1) frame_mjpsi_par.Draw() mcstudy_Canvas_jpsi.cd(2) frame_mjpsi_pul.Draw() mcstudy_Canvas_jpsi.SaveAs("exercise_1_mjpsi.png")

- You are now ready to run the exercise. What conclusions can you make about the
`cross_psi`parameter?

The entire exercise is available below:

- First of all, create a skeleton python program and call it, for instance,
`exercise_2.py`. Load ROOT:

import ROOT

- Import the workspace from the file produced in the previous exercise:

#Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print()

- You can set as constant some of the parameters of the model. As an example, we set the J/ψ mean of the Crystal Ball function. A parameter that is constant is of course unchanged, for example, in a likelihood scan. The more parameters are constant, the higher your statistical sensitivity, but this is a strong statement: setting a parameter as constant means that its uncertainty is negligible in this problem. Note how you can access variables by name using the
`RooWorkspace`class:

#You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty ws.var("meanJpsi").setConstant(1)

- RooStats needs to configure the PDF model using the class
`ModelConfig`:

#Set the RooModelConfig and let it know what the content of the workspace is about model = ROOT.RooStats.ModelConfig() model.SetWorkspace(ws) model.SetPdf("totPDF")

- Let's access the parameter
`cross_psi`from the workspace and take a "snapshot" (i.e.: a clone). We want to set`cross_psi=0`in order to determine the signal significance, measured from the p-value corresponding to the background-only hypothesis.

#Here we explicitly set the value of the parameters for the null hypothesis #We want no signal contribution, so cross_psi = 0 cross_psi = ws.var("cross_psi") poi = ROOT.RooArgSet(cross_psi) nullParams = poi.snapshot() nullParams.setRealValue("cross_psi",0.)

- We use the class ProfileLikelihoodCalculator to implement the profile-likelihood method and determine the significance with the class HypoTestResult:

#Build the profile likelihood calculator plc = ROOT.RooStats.ProfileLikelihoodCalculator() plc.SetData(ws.data("data")) plc.SetModel(model) plc.SetParameters(poi) plc.SetNullParameters(nullParams)

* Now we just need to print out the result:

#We get a HypoTestResult out of the calculator, and we can query it. htr = plc.GetHypoTest() print "-------------------------------------------------" print "The p-value for the null is ", htr.NullPValue() print "Corresponding to a signifcance of ", htr.Significance() print "-------------------------------------------------" #PyROOT sometimes fails cleaning memory, this helps del plc

Run this exercise. Is the peak aound 3.65 GeV significant?

The complete exercise is available below:

- Create a new python file, call it, for instance,
`exercise_3.py`, and load the ROOT libraries. Then, import the workspace created in exercise #0. Also, set a few parameters as constant. You would normally not do this, but otherwise the processing time for calculator becomes too long for this hands-on session:

import ROOT #Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print() #You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty #Right now we will fix all the nuisance parameters just to speed up the computing time ws.var("meanJpsi").setConstant(1) ws.var("sigmaJpsi").setConstant(1) ws.var("alphaJpsi").setConstant(1) ws.var("nJpsi").setConstant(1) ws.var("NJpsi").setConstant(1) ws.var("meanpsi2S").setConstant(1) ws.var("Nbkg").setConstant(1) ws.var("a1").setConstant(1) ws.var("a2").setConstant(1) ws.var("a3").setConstant(1)

- Now we need to have two PDF models: one with both signal and background, one with background only, which was the only model needed in the previous exercise. Again, this is just to setup the proper
`ModelConfig`for RooStats.

#Configure the model, we need both the S+B and the B only models sbModel = ROOT.RooStats.ModelConfig() sbModel.SetWorkspace(ws) sbModel.SetPdf("totPDF") sbModel.SetName("S+B Model") poi = ROOT.RooArgSet(ws.var("cross_psi")) poi.find("cross_psi").setRange(0.,40.) #this is mostly for plotting sbModel.SetParametersOfInterest(poi) bModel = sbModel.Clone() bModel.SetPdf("totPDF") bModel.SetName( sbModel.GetName() + "_with_poi_0") poi.find("cross_psi").setVal(0) bModel.SetSnapshot(poi)

- Part 1:
**compute a CLs, modified frequentist upper limit**. The computation is based on toy Monte Carlos that allow to compute confidence levels for s+b and b hypotheses. This requires the interplay of different RooStats classes:`FrequentistCalculator`, which is initialized taking as input the two PDF models (background only, signal plus background),`HypoTestInverter`, which, for a given calculator (a`FrequentistCalculator`in this case) computes the confidence interval and returns it as a`HypoTestInverterResult`object. In the computation, the profile-likelihood test statistics (via the`ProfileLikelihoodTestStat`class) is passed to the "sampler" (`ToyMCSampler`) present in the "calculator" object.

#First example is with a frequentist approach fc = ROOT.RooStats.FrequentistCalculator(ws.data("data"), bModel, sbModel) fc.SetToys(2500,1500) #Create hypotest inverter passing the desired calculator calc = ROOT.RooStats.HypoTestInverter(fc) #set confidence level (e.g. 95% upper limits) calc.SetConfidenceLevel(0.95) #use CLs calc.UseCLs(1) #reduce the noise calc.SetVerbose(0) #Configure ToyMC Samler toymcs = calc.GetHypoTestCalculator().GetTestStatSampler() #Use profile likelihood as test statistics profll = ROOT.RooStats.ProfileLikelihoodTestStat(sbModel.GetPdf()) #for CLs (bounded intervals) use one-sided profile likelihood profll.SetOneSided(1) #set the test statistic to use for toys toymcs.SetTestStatistic(profll) npoints = 8 #Number of points to scan # min and max for the scan (better to choose smaller intervals) poimin = poi.find("cross_psi").getMin() poimax = poi.find("cross_psi").getMax() print "Doing a fixed scan in interval : ", poimin, " , ", poimax calc.SetFixedScan(npoints,poimin,poimax); result = calc.GetInterval() #This is a HypoTestInveter class object upperLimit = result.UpperLimit()

- Part 2:
**compute a Bayesian upper limit**: this part is simple, since no`HypoTestInverter`is needed. The upper limit is just given by the posterior Bayesian PDF obtained by the`BayesianCalculator`. The Bayesian calculator returns a`SimpleInterval`

#Example using the BayesianCalculator #Now we also need to specify a prior in the ModelConfig #To be quicker, we'll use the PDF factory facility of RooWorkspace #Careful! For simplicity, we are using a flat prior, but this doesn't mean it's the best choice! ws.factory("Uniform::prior(cross_psi)") sbModel.SetPriorPdf(ws.pdf("prior")) #Construct the bayesian calculator bc = ROOT.RooStats.BayesianCalculator(ws.data("data"), sbModel) bc.SetConfidenceLevel(0.95) bc.SetLeftSideTailFraction(0.) # for upper limit bcInterval = bc.GetInterval()

- Now print the results of the two methods:

#Now let's print the result of the two methods #First the CLs print "################" print "The observed CLs upper limit is: ", upperLimit #Compute expected limit print "Expected upper limits, using the B (alternate) model : " print " expected limit (median) ", result.GetExpectedUpperLimit(0) print " expected limit (-1 sig) ", result.GetExpectedUpperLimit(-1) print " expected limit (+1 sig) ", result.GetExpectedUpperLimit(1) print "################" #Now let's see what the bayesian limit is print "Bayesian upper limit on cross_psi = ", bcInterval.UpperLimit()

- Plot the results of the two methods:

#Plot now the result of the scan #First the CLs freq_plot = ROOT.RooStats.HypoTestInverterPlot("HTI_Result_Plot","Frequentist scan result for psi xsec",result) #Then the Bayesian posterior bc_plot = bc.GetPosteriorPlot() #Plot in a new canvas with style dataCanvas = ROOT.TCanvas("dataCanvas") dataCanvas.Divide(2,1) dataCanvas.SetLogy(0) dataCanvas.cd(1) freq_plot.Draw("2CL") dataCanvas.cd(2) bc_plot.Draw() dataCanvas.SaveAs("exercise_3.png")

- Pay attention at the actual numbers printed (remember, the meaning of the upper limit is very different in the two approaches!)

- The plot will look like the following:

The complete exercise is available below:

- First of all, obtain the full statistics sample for the 2010 CMS data:

wget https://twiki.cern.ch/twiki/pub/Main/OviedoStatRooStats2017/DataSet.root .

- Return to the program
`exercise_0.py`, and modify it to read the new file. Rerun the program, this will create a new workspace containing the data and the fit for the full statistics - Remember: change the luminosity to the value of the full 2010 CMS statistics: 37 pb-1
- Then rerun exercise_2.py and notice the significance of the excess with full statistics. Congratulations, you're in the '70s!
- Create a new program
`exercise_4.py`that imports the workspace, as in the previous exercises. As before, declare a few parameters constants to speed up the computing (you would probably not do that normally):

import ROOT #Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print() #You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty ws.var("meanJpsi").setConstant(1) ws.var("sigmaJpsi").setConstant(1) ws.var("alphaJpsi").setConstant(1) ws.var("nJpsi").setConstant(1) ws.var("NJpsi").setConstant(1) ws.var("meanpsi2S").setConstant(1) ws.var("Nbkg").setConstant(1) ws.var("a1").setConstant(1) ws.var("a2").setConstant(1) ws.var("a3").setConstant(1) #Let the model know what is the parameter of interest cross_psi = ws.var("cross_psi") cross_psi.setRange(4., 16.) #this is mostly for plotting reasons poi = ROOT.RooArgSet(cross_psi)

- Now, set the
`ModelConfig`and`cross_psi`as parameter of interest (poi). Also, now we want 68 percent intervals:

#Configure the model model = ROOT.RooStats.ModelConfig() model.SetWorkspace(ws) model.SetPdf("totPDF") model.SetParametersOfInterest(poi) #Set confidence level confidenceLevel = 0.68

- The computation of the 68 percent confidence interval is performed using the
`ProfileLikelihoodCalculator`:

#Build the profile likelihood calculator plc = ROOT.RooStats.ProfileLikelihoodCalculator() plc.SetData(ws.data("data")) plc.SetModel(model) plc.SetParameters(poi) plc.SetConfidenceLevel(confidenceLevel) #Get the interval pl_Interval = plc.GetInterval()

- As alternative method, a probability interval can be computed using a Bayesian method. While using the Bayesian calculator would be perfectly correct, in this example with high statistics the integration part of the method would require a really long time. On the other hand, one can use the RooStats Markov Chain MC implementation to speed things up:

#Now let's determine the Bayesian probability interval #We could use the standard Bayesian Calculator, but this would be very slow for the integration #So we profit of the Markov-Chain MC capabilities of RooStats to speed things up mcmc = ROOT.RooStats.MCMCCalculator(ws.data("data") , model) mcmc.SetConfidenceLevel(confidenceLevel) mcmc.SetNumIters(20000) #Metropolis-Hastings algorithm iterations mcmc.SetNumBurnInSteps(100) #first N steps to be ignored as burn-in mcmc.SetLeftSideTailFraction(0.5) #for central interval MCMC_interval = mcmc.GetInterval()

- Finally, plots for both methods can be produced:

#Let's make a plot dataCanvas = ROOT.TCanvas("dataCanvas") dataCanvas.Divide(2,1) dataCanvas.cd(1) plot_Interval = ROOT.RooStats.LikelihoodIntervalPlot(pl_Interval) plot_Interval.SetTitle("Profile Likelihood Ratio") plot_Interval.SetMaximum(3.) plot_Interval.Draw() dataCanvas.cd(2) plot_MCMC = ROOT.RooStats.MCMCIntervalPlot(MCMC_interval) plot_MCMC.SetTitle("Bayesian probability interval (Markov Chain)") plot_MCMC.Draw() dataCanvas.SaveAs("exercise_4.png")

- And one can print the intervals determined by the two methods:

#Now print the interval for mH for the two methods print "PLC interval is [", pl_Interval.LowerLimit(cross_psi), ", ", pl_Interval.UpperLimit(cross_psi), "]" print "Bayesian interval is [", MCMC_interval.LowerLimit(cross_psi), ", ", MCMC_interval.UpperLimit(cross_psi), "]" #PyROOT sometimes fails cleaning memory, this helps del plc

- The plot will look like this:

The complete exercise is available here:

In this exercise, we will consider a possible source of systematic uncertainty and how to deal with it. Imagine for example that our luminosity knowledge has a 10% systematic (this is a huge number that one would not expect commonly, but it is just the size we need to make the effect very visible in the exercise). How do we add this effect on our parameter of interest, `cross_psi`?

- Create a new program,
`exercise_6.py`that imports the workspace, similarly in the previous exercises:

import ROOT #Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print() #You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty #Right now we will fix all the nuisance parameters just to speed up the computing time ws.var("meanJpsi").setConstant(1) ws.var("sigmaJpsi").setConstant(1) ws.var("alphaJpsi").setConstant(1) ws.var("nJpsi").setConstant(1) ws.var("meanpsi2S").setConstant(1) #now this is again constant, not anymore our parameter of interest! ws.var("Nbkg").setConstant(1) ws.var("a1").setConstant(1) ws.var("a2").setConstant(1) ws.var("a3").setConstant(1) #Now define the number of signal and background events NJpsi = ws.var("NJpsi") cross_psi = ws.var("cross_psi") lumi_psi = ws.var("lumi_psi") eff_psi = ws.var("eff_psi") Nbkg = ws.var("Nbkg")

- First of all, let's consider the baseline case of not having this additional uncertainty. We'll recalculate the significance of the excess, and then plot the likelihood scan of
`cross_psi`. We'll compare these with the result we'll obtain later with the uncertainty:

#Set the RooModelConfig and let it know what the content of the workspace is about model = ROOT.RooStats.ModelConfig() model.SetWorkspace(ws) model.SetPdf("totPDF") model.SetNuisanceParameters(ROOT.RooArgSet(ws.var("NJpsi"))) #Here we explicitly set the value of the parameters for the null. #We want no signal contribution, eg. cross_psi = 0 poi = ROOT.RooArgSet(cross_psi) nullParams = poi.snapshot() nullParams.setRealValue("cross_psi",0.) #Build the profile likelihood calculator plc = ROOT.RooStats.ProfileLikelihoodCalculator() plc.SetData(ws.data("data")) plc.SetModel(model) plc.SetParameters(poi) plc.SetNullParameters(nullParams) #We get a HypoTestResult out of the calculator, and we can query it. htr = plc.GetHypoTest() print "-------------------------------------------------" print "Without background uncertainty" print "The p-value for the null is ", htr.NullPValue() print "Corresponding to a signifcance of ", htr.Significance() print "-------------------------------------------------" #Now get the intervals to do some plots plc.SetConfidenceLevel(0.68) pl_Interval = plc.GetInterval() #Let's make a plot dataCanvas = ROOT.TCanvas("dataCanvas") dataCanvas.cd() plot_Interval = ROOT.RooStats.LikelihoodIntervalPlot(pl_Interval) plot_Interval.SetTitle("Profile Likelihood Ratio for cross_psi") plot_Interval.SetRange(4.,16.) plot_Interval.SetMaximum(10.) plot_Interval.Draw()

- In order to add the systematic uncertainty, we will partly reconstruct the total PDF of the problem, to add additional parameters to describe the uncertainty. First of all, recover from the workspace the J/ψ, ψ(2S) and background PDFs:

#We will partly redo exercise_0 to redefine the signal PDF to account for uncertainties on parameters #J/psi PDF cb_jpsi = ws.pdf("CBJpsi") #Signal PDF gauss2S = ws.pdf("gausspsi2S") #Background PDF backgroundPDF = ws.pdf("backgroundPDF")

- Now comes the difficult part. We will assume the uncertainty to be lognormally distributed. Basically, we will write lumi -> lumi*alpha_lumi, where alpha_lumi = kappa_lumi^beta_lumi, with beta_lumi Gaussian.

#Assume an uncertainty on the luminosity #Construct the uncertainty with a lognormal assumption kappa_lumi = ROOT.RooRealVar("kappa_lumi","Dimension of systematic variation",1.1) #10% systematic beta_lumi = ROOT.RooRealVar("beta_lumi","This is the real nuisance on lumi",0.,-5.,5.) alpha_lumi = ROOT.RooFormulaVar("alpha_lumi","pow(@0,@1)",ROOT.RooArgList(kappa_lumi,beta_lumi)) lumi_psi_nuis = ROOT.RooFormulaVar("lumi_psi_nuis","@0*@1",ROOT.RooArgList(lumi_psi,alpha_lumi)) #Now prepare a gaussian for the nuisance parameter, to be multiplied to the total PDF one = ROOT.RooRealVar("one","one",1.) lumi_syst = ROOT.RooRealVar("lumi_syst","The systematic uncertainty space on lumi",0.,-5.,5.) constr_lumi = ROOT.RooGaussian("constr_lumi","Lumi uncertainty constraint",beta_lumi,lumi_syst,one)

- Now write the new
`Npsi`that includes the new parametrization for the luminosity:

#Now write the new Npsi = xsec * eff * lumi * alpha_lumi Npsi_nuised = ROOT.RooFormulaVar("Npsi_nuised","@0*@1*@2", ROOT.RooArgList(eff_psi,lumi_psi_nuis,cross_psi))

- Now construct the total PDF, and multiply it by the Gaussian constrain on the nuisance:

#Now construct the "new" total PDF PDFtot_nuis_unconstr = ROOT.RooAddPdf("PDFtot_nuis_unconstr","PDFtot_nuis_unconstr",ROOT.RooArgList(cb_jpsi,gauss2S,backgroundPDF),ROOT.RooArgList(NJpsi,Npsi_nuised,Nbkg)) #Now add the gaussian constraint to the total PDF PDFtot_nuis = ROOT.RooProdPdf("PDFtot_nuis","PDFtot_nuis",ROOT.RooArgList(PDFtot_nuis_unconstr,constr_lumi))

- Go through the same procedure to recalculate the significance and plot the likelihood scan for the cross_psi parameter:

#We now have two PDFs: getattr(ws,'import')(PDFtot_nuis) model_nuis = ROOT.RooStats.ModelConfig() model_nuis.SetWorkspace(ws) model_nuis.SetPdf("PDFtot_nuis") model_nuis.SetGlobalObservables(ROOT.RooArgSet(ws.var("lumi_syst"))) model_nuis.SetNuisanceParameters(ROOT.RooArgSet(ws.var("beta_lumi"), ws.var("NJpsi"), ws.var("Nbkg"))) #Build the profile likelihood calculator with the Nbkg uncertainty plc_nuis = ROOT.RooStats.ProfileLikelihoodCalculator() plc_nuis.SetData(ws.data("data")) plc_nuis.SetModel(model_nuis) plc_nuis.SetParameters(poi) plc_nuis.SetNullParameters(nullParams) #We get a HypoTestResult out of the calculator, and we can query it. htr_nuis = plc_nuis.GetHypoTest() print "-------------------------------------------------" print "With background uncertainty" print "The p-value for the null is ", htr_nuis.NullPValue() print "Corresponding to a signifcance of ", htr_nuis.Significance() print "-------------------------------------------------" #Now get the intervals to do some plots plc_nuis.SetConfidenceLevel(0.68) pl_Interval_nuis = plc_nuis.GetInterval() #Let's make a plot plot_Interval_nuis = ROOT.RooStats.LikelihoodIntervalPlot(pl_Interval_nuis) plot_Interval_nuis.SetRange(4.,16.) plot_Interval_nuis.SetContourColor(ROOT.kRed) plot_Interval_nuis.Draw("SAME") dataCanvas.SaveAs("exercise_6.png") #PyROOT sometimes fails cleaning memory, this helps del plc del plc_nuis

- The output plot should be similar to this:

- Note that, according to Wilk's theorem, the significance Z should be approximately given by:

- You can verify this looking at the scan plot.

The complete exercise is available below:

We want to evaluate the confidence interval on a parameter of interest (the number of signal events) in a counting experiment.

- First of all, create a new program,
`exercise_7.py`. We will have to define a few variables: the number of observed events (the observable,`x`) and the number of expected events (which we will define as the sum of the number of expected signal and background events):

import ROOT #Make a simple model x = ROOT.RooRealVar("x","Number of observed events", 1.,0.,20.) mu = ROOT.RooRealVar("mu","The mu parameter", 2.5,0., 15.) #With a limit on mu>=0 b = ROOT.RooConstVar("b","", 3.) mean = ROOT.RooAddition("mean","", ROOT.RooArgList(mu,b)) pois = ROOT.RooPoisson("pois", "", x, mean) parameters = ROOT.RooArgSet(mu)

- Now let's generate a dataset, representing our experiment observation (in this case, this is just a number, the number of observed events):

#Create a toy dataset data = pois.generate(ROOT.RooArgSet(x), 1) data.Print("v") canvas = ROOT.TCanvas("dataCanvas") canvas.cd() frame = x.frame() data.plotOn(frame) frame.Draw() canvas.Update()

- Define the model:

w = ROOT.RooWorkspace() modelConfig = ROOT.RooStats.ModelConfig("poissonProblem",w) modelConfig.SetPdf(pois) modelConfig.SetParametersOfInterest(parameters) modelConfig.SetObservables(ROOT.RooArgSet(x)) w.Print()

- Initialize the Feldman-Cousins tools:

#Show use of Feldman-Cousins fc = ROOT.RooStats.FeldmanCousins(data,modelConfig) fc.SetTestSize(.05) #Set size of test fc.UseAdaptiveSampling(1) fc.AdditionalNToysFactor(10.) fc.FluctuateNumDataEntries(0) #Number counting analysis: dataset always has 1 entry with N events observed fc.SetNBins(200) #Number of points to test per parameter #Nse the Feldman-Cousins tool interval = fc.GetInterval() print "The interval is [", interval.LowerLimit(mu), ", ", interval.UpperLimit(mu), "]"

- Now also plot the result:

#No dedicated plotting class yet, so do it by hand: ROOT.gStyle.SetOptStat(0) parameterScan = fc.GetPointsToScan() #this is a RooDataHist hist = parameterScan.createHistogram("mu",30) hist.GetXaxis().SetTitle("#mu") canvas = ROOT.TCanvas() canvas.cd() hist.Draw() mark = [] #Loop over points to test for i in range(parameterScan.numEntries()) : print "on parameter point ", i, " out of ", parameterScan.numEntries() #Get a parameter point from the list of points to test. tmpPoint = parameterScan.get(i).clone("temp") mark.append(ROOT.TMarker(tmpPoint.getRealValue("mu"), 1, 25)) if (interval.IsInInterval(tmpPoint)): mark[i].SetMarkerColor(ROOT.kBlue) else: mark[i].SetMarkerColor(ROOT.kRed) mark[i].Draw("s") canvas.SaveAs("exercise_7.png") #PyROOT sometimes fails cleaning memory, this helps del fc

- The output plot should be similar to this:

The complete exercise is available below:

-- MarioPelliccioni - 2017-04-15

I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|

root | DataSet.root | r1 | manage | 127.1 K | 2017-04-12 - 09:18 | MarioPelliccioni | Full 2010 statistics dataset for the dimuon mass distribution |

root | DataSet_lowstat.root | r1 | manage | 9.9 K | 2017-04-11 - 16:51 | MarioPelliccioni | Low statistics CMS dataset for the dimuon mass distribution |

png | Data_lowstat.png | r1 | manage | 13.0 K | 2017-04-11 - 16:52 | MarioPelliccioni | Low statistics data for the dimuon mass spectrum |

png | exercise_0.png | r1 | manage | 14.0 K | 2017-04-11 - 16:55 | MarioPelliccioni | Fit to the low statistics dimuon mass spectrum |

txt | exercise_0.py.txt | r1 | manage | 2.6 K | 2017-05-12 - 15:24 | MarioPelliccioni | Exercise 0: fit the dimuon invariant mass spectrum |

txt | exercise_0_p1.py.txt | r1 | manage | 3.4 K | 2017-05-19 - 12:21 | MarioPelliccioni | Exercise 0, but with the cross section |

txt | exercise_1.py.txt | r1 | manage | 1.8 K | 2017-04-28 - 10:20 | MarioPelliccioni | Exercise 1: use toy MC to test the fit results |

txt | exercise_2.py.txt | r1 | manage | 1.3 K | 2017-05-10 - 10:18 | MarioPelliccioni | Exercise 2: calculate the statistical significance of the excess |

png | exercise_3.png | r1 | manage | 17.0 K | 2017-04-28 - 16:26 | MarioPelliccioni | Comparison of different upper limit calculations |

txt | exercise_3.py.txt | r1 | manage | 3.8 K | 2017-05-12 - 15:49 | MarioPelliccioni | Exercise 3: determine the upper limit on Nsig in the low statistics case |

png | exercise_4.png | r1 | manage | 12.6 K | 2017-04-28 - 16:47 | MarioPelliccioni | Intervals on the psi cross section |

txt | exercise_4.py.txt | r1 | manage | 2.5 K | 2017-05-10 - 10:52 | MarioPelliccioni | Exercise 4: determine the interval on the psi cross section with full stat |

png | exercise_6.png | r1 | manage | 12.1 K | 2017-04-28 - 17:18 | MarioPelliccioni | Effect of systematic uncertainties on cross_psi |

txt | exercise_6.py.txt | r1 | manage | 5.2 K | 2017-05-10 - 11:04 | MarioPelliccioni | Exercise 6: deal with systematic uncertainties and nuisance parameters |

png | exercise_7.png | r1 | manage | 7.3 K | 2017-05-08 - 11:06 | MarioPelliccioni | A confidence interval using the Feldman-Cousins approach |

txt | exercise_7.py.txt | r1 | manage | 2.1 K | 2017-05-10 - 11:09 | MarioPelliccioni | Exercise 7: Feldman-Cousins in a counting experiment |

Topic revision: r17 - 2017-05-19 - MarioPelliccioni

**Webs**

- ABATBEA
- ACPP
- ADCgroup
- AEGIS
- AfricaMap
- AgileInfrastructure
- ALICE
- AliceEbyE
- AliceSPD
- AliceSSD
- AliceTOF
- AliFemto
- ALPHA
- Altair
- ArdaGrid
- ASACUSA
- AthenaFCalTBAna
- Atlas
- AtlasLBNL
- AXIALPET
- CAE
- CALICE
- CDS
- CENF
- CERNSearch
- CLIC
- Cloud
- CloudServices
- CMS
- Controls
- CTA
- CvmFS
- DB
- DefaultWeb
- DESgroup
- DPHEP
- DM-LHC
- DSSGroup
- EGEE
- EgeePtf
- ELFms
- EMI
- ETICS
- FIOgroup
- FlukaTeam
- Frontier
- Gaudi
- GeneratorServices
- GuidesInfo
- HardwareLabs
- HCC
- HEPIX
- ILCBDSColl
- ILCTPC
- IMWG
- Inspire
- IPv6
- IT
- ItCommTeam
- ITCoord
- ITdeptTechForum
- ITDRP
- ITGT
- ITSDC
- LAr
- LCG
- LCGAAWorkbook
- Leade
- LHCAccess
- LHCAtHome
- LHCb
- LHCgas
- LHCONE
- LHCOPN
- LinuxSupport
- Main
- Medipix
- Messaging
- MPGD
- NA49
- NA61
- NA62
- NTOF
- Openlab
- PDBService
- Persistency
- PESgroup
- Plugins
- PSAccess
- PSBUpgrade
- R2Eproject
- RCTF
- RD42
- RFCond12
- RFLowLevel
- ROXIE
- Sandbox
- SocialActivities
- SPI
- SRMDev
- SSM
- Student
- SuperComputing
- Support
- SwfCatalogue
- TMVA
- TOTEM
- TWiki
- UNOSAT
- Virtualization
- VOBox
- WITCH
- XTCA

Welcome Guest

Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback

or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback