TWiki> Main Web>WebPreferences>INFNStatRooStats (2015-05-25, MarioPelliccioni)

# INFN SCHOOL OF STATISTICS 2015

The goal of the exercise is to get acquainted with few of the statistical analysis practices used in HEP. Introductory slides will present the main concepts related to test statistics, significance evaluation and upper limit settings in both the frequentist and Bayesian approaches. The exercise introduces the main features of RooFit and RooStats frameworks to define a probabilistic model of the physics problem and to adopt the main statistical method for significance and confidence limit evaluations.

## Summary

The exercise session introduces methods and tools for tackling very common problems in statistics while complying with rigorous standards set in today's experimental particle physics. The main topics are finding a confidence interval for a parameter of interest, and estimating statistical significance of an observation. We cover the most popular techniques: profile likelihood, Bayesian. Participants will learn the main features of RooFits and RooStats software frameworks for statistics and data modeling. We overview the popular cases of a hunting a signal peak over an exponential background. The exercise session will last about 4 hours.

## Computing environment

All exercises should run on any ROOT installation containing also the RooFit libraries (RooStats is included in the RooFit installation).

On your laptop you obtain this by passing the --enable-roofit on the configure step of the installation. If you have a debian based account, you just need to install the libRooFit library.

Alternatively, if you have a CERN account, most of the central ROOT installations contain RooFit.

ROOT macros can be ran using the following command from the shell prompt:

root macro.C

In order to load RooFit and/or RooStats libraries witih a ROOT macro, the following lines should be added at the begin of the macro (macro.c, in the previous command line example):

gSystem->Load("libRooFit");


## Exercise #0: set up the PDF model using RooFit

First of all, create a skeleton for a ROOT macro, call it, for instance, exercise_0.C, and load the RooFit library (RooStats will not be needed, for the moment):

//load the RooFit library, use the RooFit namespace
using namespace RooFit;

void exercise_0() {
// . . .
}

  //First, define the observable for the analysis: the particle's reconstructed mass
RooRealVar mass("mass", "mass", 100., 150.);

//Define mean and sigma for signal PDF
RooRealVar mean("mean", "mean of gaussian", 125., 110., 140.);
RooRealVar sigma("sigma", "width of gaussian", 2., 0.01, 10.);

    //Construct the signal P.D.F., a gaussian function
RooGaussian gauss("gauss", "Signal PDF", mass, mean, sigma);

   //Now define the background P.D.F, a simple exponential
RooRealVar tau("tau", "exponential function parameter", -0.05, -10., -0.001);
RooExponential exponential("exponential", "Background PDF", mass, tau);


//Define the number of signal and background events in the model
RooRealVar Nsig("Nsig", "Number of signal events", 5., 0., 100.);
RooRealVar Nbkg("Nbkg", "Number of background events", 100., 0., 1000.);

  //Now construct the total PDF
RooAddPdf PDFtot("PDFtot", "PDFtot", RooArgList(gauss, exponential), RooArgList(Nsig, Nbkg));

  //Now generate a sample with the total PDF
RooDataSet *data = PDFtot.generate(RooArgSet(mass), NumEvents(Nsig.getVal()+Nbkg.getVal()), Extended(1));

• As we have the dataset ready, we can fit it to the PDF model in order to extract the desired parameters. In this case we use a very low signal yield, so the fit is unable to determine the parameter mean and sigma which are kept constant in the fit to their nominal values:
  //Now fit the PDF to the data
//For low statistics, fix the mean and the width
mean.setConstant(kTRUE);
sigma.setConstant(kTRUE);
PDFtot.fitTo(*data);

// Print values of mean and sigma (that now reflect fitted values and errors, unless you fixed them)
mean.Print();
sigma.Print();

  //Now plot the data and the fitted PDF
RooPlot *massframe = mass.frame(50);
data->plotOn(massframe);
PDFtot.plotOn(massframe);

//One can also plot the single components of the total PDF, like the background component
PDFtot.plotOn(massframe, Components(exponential), LineStyle(kDashed), LineColor(kRed));

//Actually plot the result
TCanvas c1;
c1.cd();
massframe->Draw();
c1.SaveAs("exercise_0.pdf");


• The result will look like the following: • If you wish, you can now modify the signal yield and check the result.

  //Now save the data and the PDF into a Workspace, for later use for statistical analysis
//and save the workspace into a ROOT file
RooWorkspace* w = new RooWorkspace("w") ;
w->import(*data);
w->import(PDFtot);

TFile fOut("exercise_0.root","RECREATE");
fOut.cd();
w->Write();
fOut.Close();

• Done this, you are ready for next exercise.

The entire exercise macro is available below:

## Exercise #1: compute the signal significance

• First of all, create a skeleton for a ROOT macro, call it, for instance, exercise_1.C, and load both the RooFit and RooStats libraries:
gSystem->Load("libRooFit");

using namespace RooFit;
using namespace RooStats;

void exercise_1() {
// . . .
}

• Import the workspace from the file produced in the previous exercise:
  //Open the rootfile and get the workspace from the exercise_0
TFile fIn("exercise_0.root");
fIn.cd();
RooWorkspace *w = (RooWorkspace*) fIn.Get("w");

  //You can set constant parameters that are known
//If you leave them floating, the fit procedure will determine their uncertainty
w->var("mean")->setConstant(kTRUE);
w->var("sigma")->setConstant(kTRUE);
w->var("tau")->setConstant(kTRUE);
w->var("Nbkg")->setConstant(kTRUE);

  //Set the RooModelConfig and let it know what the content of the workspace is about
ModelConfig model;
model.SetWorkspace(*w);
model.SetPdf("PDFtot");

• Let's access the parameter Nsig from the workspace and take a "snapshot" (i.e.: a clone). We want to set Nsig=0 in order to determine the signal significance, measured from the p-value corresponding to the background-only hypothesis.
  // here we explicitly set the value of the parameters for the null.
// We want no signal contribution, eg. Nsig = 0
RooRealVar* Nsig = w->var("Nsig");
RooArgSet poi(*Nsig);
RooArgSet *nullParams = (RooArgSet*) poi.snapshot();
nullParams->setRealValue("Nsig", 0.);

  ///Build the profile likelihood calculator
ProfileLikelihoodCalculator plc;

plc.SetData(*(w->data("PDFtotData")));
plc.SetModel(model);
plc.SetParameters(poi);
plc.SetNullParameters(*nullParams);

/// We get a HypoTestResult out of the calculator, and we can query it
HypoTestResult* htr = plc.GetHypoTest();
cout << "-------------------------------------------------" << endl;
cout << "The p-value for the null is " << htr->NullPValue() << endl;
cout << "Corresponding to a signifcance of " << htr->Significance() << endl;
cout << "-------------------------------------------------\n\n" << endl;

• If you wish, you can change the signal yield in the exercise #0 and compute the significance for different signal yields.

The complete macro from this exercise is available below:

## Exercise #2: compute upper limits to the signal yield using frequentist and Bayesian methods

• Create a skeleton for a ROOT macro, call it, for instance, exercise_2.C, and load both the RooFit and RooStats libraries, and as before import the workspace created with the exercise #0:
gSystem->Load("libRooFit");

using namespace RooFit;
using namespace RooStats;

void exercise_2() {
//Open the rootfile and get the workspace from the exercise_0
TFile fIn("exercise_0.root");
fIn.cd();
RooWorkspace *w = (RooWorkspace*)fIn.Get("w");

//You can set constant parameters that are known
//If you leave them floating, the fit procedure will determine their uncertainty
w->var("mean")->setConstant(kTRUE);
w->var("sigma")->setConstant(kTRUE);
w->var("tau")->setConstant(kTRUE);
}

  //Configure the model, we need both the S+B and the B only models
ModelConfig sbModel;
sbModel.SetWorkspace(*w);
sbModel.SetPdf("PDFtot");
sbModel.SetName("S+B Model");
RooRealVar* poi = w->var("Nsig");
poi->setRange(0.,30.);    //this is mostly for plotting
sbModel.SetParametersOfInterest(*poi);

ModelConfig *bModel = (ModelConfig*) sbModel.Clone();
bModel.SetPdf("PDFtot");
bModel->SetName(TString(sbModel.GetName())+TString("_with_poi_0"));
poi->setVal(0);
bModel->SetSnapshot(*poi);

  FrequentistCalculator fc(*(w->data("PDFtotData")), *bModel, sbModel);
fc.SetToys(2000, 1000);

//Create hypotest inverter passing the desired calculator
HypoTestInverter calc(fc);

// set confidence level (e.g. 95% upper limits)
calc.SetConfidenceLevel(0.95);

//use CLs
calc.UseCLs(true);

//reduce the noise
calc.SetVerbose(false);

//configure ToyMC Sampler
ToyMCSampler *toymcs = (ToyMCSampler*)calc.GetHypoTestCalculator()->GetTestStatSampler();

//profile likelihood test statistics
ProfileLikelihoodTestStat profll(*(sbModel.GetPdf()));

//for CLs (bounded intervals) use one-sided profile likelihood
profll.SetOneSided(true);

//set the test statistic to use
toymcs->SetTestStatistic(&profll);

int npoints = 15;  // number of points to scan
// min and max (better to choose smaller intervals)
double poimin = poi->getMin();
double poimax = poi->getMax();

cout << "Doing a fixed scan  in interval : " << poimin << " , " << poimax << endl;
calc.SetFixedScan(npoints,poimin,poimax);

HypoTestInverterResult * r = calc.GetInterval();
double upperLimit = r->UpperLimit();

cout << "################" << endl;
cout << "The computed CLs upper limit is: " << upperLimit << endl;

//Compute expected limit
cout << "Expected upper limits, using the B (alternate) model : " << endl;
cout << " expected limit (median) " << r->GetExpectedUpperLimit(0) << endl;
cout << " expected limit (-1 sig) " << r->GetExpectedUpperLimit(-1) << endl;
cout << " expected limit (+1 sig) " << r->GetExpectedUpperLimit(1) << endl;

  //Example using the BayesianCalculator
//Now we also need to specify a prior in the ModelConfig
//To be quicker, we'll use the PDF factory facility of RooWorkspace
//NB!! For simplicity, we are using a flat prior, but this doesn't mean it's the best choice!
w->factory("Uniform::prior(Nsig)");
sbModel.SetPriorPdf(*w->pdf("prior"));

//Construct the bayesian calculator
BayesianCalculator bc(*(w->data("PDFtotData")), sbModel);
bc.SetConfidenceLevel(0.95);
bc.SetLeftSideTailFraction(0.); // for upper limit
SimpleInterval* bcInt = bc.GetInterval();

//Now let's see what the bayesian limit is
cout << "Bayesian upper limit on Nsig = " << bcInt->UpperLimit() << endl;

• Plot the results of the two methods:
  // plot now the result of the scan
//First the CLs
HypoTestInverterPlot *plot = new HypoTestInverterPlot("HTI_Result_Plot","HypoTest Scan Result",r);
//Then the Bayesian posterior
RooPlot *bcPlot = bc.GetPosteriorPlot();

// plot in a new canvas with style
TCanvas dataCanvas("dataCanvas");
dataCanvas.Divide(2, 1);
dataCanvas.SetLogy(false);
dataCanvas.cd(1);
plot->Draw("2CL");
dataCanvas.cd(2);
bcPlot->Draw();

dataCanvas.SaveAs("exercise_2.pdf");


• The plot will look like the following: The complete macro from this exercise is available below:

## Exercise #3: determine the mass of the new particle

• First of all, return to the macro exercise_0.C, and increase the signal yield from 5 to 50 and the background yield from 100 to 450, and rerun the macro. This will produce a more visible signal.
• Then rerun exercise_1.C and notice that now the significance is above 6 standard deviations
• Create a new macro exercise_3.C that imports the workspace, as in the previous exercises. This time, all parameters, except mean, are declared as constants:

void exercise_3() {
//Open the rootfile and get the workspace from the exercise_0
TFile fIn("exercise_0.root");
fIn.cd();
RooWorkspace *w = (RooWorkspace*)fIn.Get("w");

//You can set constant parameters that are known
//If you leave them floating, the fit procedure will determine their uncertainty
w->var("mean")->setConstant(kFALSE);   //don't fix the mean, it's what we want to know the interval for!
w->var("sigma")->setConstant(kTRUE);
w->var("tau")->setConstant(kTRUE);
w->var("Nsig")->setConstant(kTRUE);
w->var("Nbkg")->setConstant(kTRUE);

Float_t confidenceLevel = 0.68;
}

• Now, set the ModelConfig and mean as parameter of interest (poi):

//Set the RooModelConfig and let it know what the content of the workspace is about
ModelConfig model;
model.SetWorkspace(*w);
model.SetPdf("PDFtot");

//Let the model know what is the parameter of interest
RooRealVar* mean = w->var("mean");
mean->setRange(120., 130.);   //this is mostly for plotting reasons
RooArgSet poi(*mean);

  //Build the profile likelihood calculator
ProfileLikelihoodCalculator plc;
plc.SetData(*(w->data("PDFtotData")));
plc.SetModel(model);
plc.SetParameters(poi);
plc.SetConfidenceLevel(confidenceLevel);

//Get the interval
LikelihoodInterval* plInt = plc.GetInterval();
cout << "PLC interval is [" << plInt->LowerLimit(*mean) << ", " <<
plInt->UpperLimit(*mean) << "]" << endl;

• As alternative method, the confidence interval can be computed using the posterior Bayesian PDF:
  //Now let's do the same for the Bayesian Calculator
//Now we also need to specify a prior in the ModelConfig
//To be quicker, we'll use the PDF factory facility of RooWorkspace
//NB!! For simplicity, we are using a flat prior, but this doesn't mean it's the best choice!
w->factory("Uniform::prior(mean)");
model.SetPriorPdf(*w->pdf("prior"));

//Construct the bayesian calculator
BayesianCalculator bc(*(w->data("PDFtotData")), model);
bc.SetConfidenceLevel(confidenceLevel);
bc.SetParameters(poi);
SimpleInterval* bcInt = bc.GetInterval();
cout << "Bayesian interval is [" << bcInt->LowerLimit() << ", " <<
bcInt->UpperLimit() << "]" << endl;

• Finally, plots for both methods can be produced:
  TCanvas dataCanvas("dataCanvas");
dataCanvas.Divide(2, 1);
dataCanvas.cd(1);

LikelihoodIntervalPlot plotInt((LikelihoodInterval*)plInt);
plotInt.SetTitle("Profile Likelihood Ratio and Posterior for mH");
plotInt.SetMaximum(3.);
plotInt.Draw();

dataCanvas.cd(2);
RooPlot *bcPlot = bc.GetPosteriorPlot();
bcPlot->Draw();

dataCanvas.SaveAs("exercise_3.pdf");

• The plot will look like the one below: The complete macro from this exercise is available below:

## Exercise #4: add nuisance parameters (systematics)

• First of all, return to the macro exercise_0.C, and increase the signal yield from to 20. For plotting reasons, you may want to decrease the maximum range of Nsig to 100. Also, increase the sigma from 2 to 6, in order to have a broader signal, and rerun the macro.
• Create a new macro exercise_4.C that imports the workspace, similarly in the previous exercises:

void exercise_4() {
//Open the rootfile and get the workspace from the exercise_0
TFile fIn("exercise_0.root");
fIn.cd();
RooWorkspace *w = (RooWorkspace*)fIn.Get("w");

//You can set constant parameters that are known
//If you leave them floating, the fit procedure will determine their uncertainty
w->var("mean")->setConstant(kTRUE);
w->var("sigma")->setConstant(kTRUE);
w->var("tau")->setConstant(kTRUE);
w->var("Nbkg")->setConstant(kTRUE);

//We will partly redo exercise_0 to redefine the signal PDF to account for uncertainties on parameters
//Signal PDF
RooGaussian *gauss = (RooGaussian*)w->pdf("gauss");

//Background PDF
RooExponential *exponential = (RooExponential*)w->pdf("exponential");

//Now define the number of signal and background events
RooRealVar *Nsig = w->var("Nsig");
RooRealVar *Nbkg = w->var("Nbkg");
}

• Now, declare the parameters and PDF needed to model a Gaussian (could also be log-normal...) PDF, that will be used as model for the background yield uncertainty:
  //Assume an uncertainty on the number of background events
//Construct the uncertainty with a lognormal assumption
RooRealVar Nbkg_alpha("Nbkg_alpha", "Dimension of systematic variation of Nbkg", 1., 0.01, 10.);
RooFormulaVar Nbkg_nuis("Nbkg_nuis", "@0*@1", RooArgList(*Nbkg, Nbkg_alpha));

//Now prepare a gaussian for the nuisance parameter, to be multiplied to the total PDF
RooRealVar one("one", "one", 1.);
RooRealVar Nbkg_syst("Nbkg_syst", "The systematic uncertainty on Nbkg", 0.3);    //30% uncertainty
RooGaussian constr_Nbkg("constr_Nbkg", "Background uncertainty constraint", one, Nbkg_alpha, Nbkg_syst);

• Define the new total PDF model, including the nuisance parameter, and import it into the workspace:
  //Now construct the total PDF
RooAddPdf PDFtot_nuis_unconstr("PDFtot_nuis_unconstr", "PDFtot_nuis_unconstr", RooArgList(*gauss, *exponential), RooArgList(*Nsig, Nbkg_nuis));

//Now add the gaussian constraint to the total PDF
RooProdPdf PDFtot_nuis("PDFtot_nuis", "PDFtot_nuis", RooArgList(PDFtot_nuis_unconstr, constr_Nbkg));

PDFtot_nuis.fitTo(*w->data("PDFtotData"), RooFit::Constrain(RooArgSet(Nbkg_alpha)), Extended(1));

//We now have two PDFs and two datasets:
w->import(PDFtot_nuis);

• Configure the model, as before, but using the new total PDF.
  //Set the RooModelConfig and let it know what the content of the workspace is about
ModelConfig model;
model.SetWorkspace(*w);
model.SetPdf("PDFtot");

ModelConfig model_nuis;
model_nuis.SetWorkspace(*w);
model_nuis.SetPdf("PDFtot_nuis");

// here we explicitly set the value of the parameters for the null.
// We want no signal contribution, eg. Nsig = 0
RooRealVar* Nsig = w->var("Nsig");
RooArgSet poi(*Nsig);
RooArgSet *nullParams = (RooArgSet*) poi.snapshot();
nullParams->setRealValue("Nsig",0.);

• Set up two profile-likelihood calculator, one with and another without the nuisance parameter. Print the resulting significance in the two cases:
  //Build the profile likelihood calculator
ProfileLikelihoodCalculator plc;
plc.SetData(*(w->data("PDFtotData")));
plc.SetModel(model);
plc.SetParameters(poi);
plc.SetNullParameters(*nullParams);

// We get a HypoTestResult out of the calculator, and we can query it
HypoTestResult* htr = plc.GetHypoTest();

//Build the profile likelihood calculator with the Nbkg uncertainty
ProfileLikelihoodCalculator plc_nuis;
plc_nuis.SetData(*(w->data("PDFtotData")));
plc_nuis.SetModel(model_nuis);
plc_nuis.SetParameters(poi);
plc_nuis.SetNullParameters(*nullParams);

// We get a HypoTestResult out of the calculator, and we can query it
HypoTestResult* htr_nuis = plc_nuis.GetHypoTest();

cout << "-------------------------------------------------" << endl;
cout << "Without background uncertainty" << endl;
cout << "The p-value for the null is " << htr->NullPValue() << endl;
cout << "Corresponding to a signifcance of " << htr->Significance() << endl;
cout << "-------------------------------------------------\n\n" << endl;

cout << "-------------------------------------------------" << endl;
cout << "With background uncertainty" << endl;
cout << "The p-value for the null is " << htr_nuis->NullPValue() << endl;
cout << "Corresponding to a signifcance of " << htr_nuis->Significance() << endl;
cout << "-------------------------------------------------\n\n" << endl;

• Plot the test-statistics scan for the two cases:
  //Now get the intervals to do some plots
plc.SetConfidenceLevel(0.68);
plc_nuis.SetConfidenceLevel(0.68);
LikelihoodInterval* plInt = plc.GetInterval();
LikelihoodInterval* plInt_nuis = plc_nuis.GetInterval();

// Let's make a plot
TCanvas dataCanvas("dataCanvas");
dataCanvas.cd();

LikelihoodIntervalPlot plotInt((LikelihoodInterval*)plInt);
plotInt.SetTitle("Profile Likelihood Ratio for Nsig");
plotInt.SetMaximum(10.);
plotInt.Draw();

LikelihoodIntervalPlot plotInt_nuis((LikelihoodInterval*)plInt_nuis);
plotInt_nuis.SetContourColor(kRed);
plotInt_nuis.Draw("SAME");

dataCanvas.SaveAs("exercise_4.pdf");

• The output plot should be similar to the one below: • Note that, according to Wilk's theorem, the significance Z should be approximately given by: • You can verify this looking at the scan plot.

The complete macro from this exercise is available below:

-- MarioPelliccioni - 2015-05-19

Topic attachments
I Attachment History Action Size Date Who Comment c exercise_0.C r1 manage 2.6 K 2015-05-19 - 14:50 MarioPelliccioni gif exercise_0.gif r1 manage 11.0 K 2015-05-19 - 14:50 MarioPelliccioni c exercise_1.C r1 manage 1.9 K 2015-05-19 - 14:53 MarioPelliccioni c exercise_2.C r1 manage 4.2 K 2015-05-19 - 15:07 MarioPelliccioni gif exercise_2.gif r1 manage 13.4 K 2015-05-19 - 15:07 MarioPelliccioni c exercise_3.C r1 manage 2.9 K 2015-05-19 - 15:13 MarioPelliccioni gif exercise_3.gif r1 manage 11.8 K 2015-05-19 - 15:13 MarioPelliccioni c exercise_4.C r1 manage 4.9 K 2015-05-19 - 15:31 MarioPelliccioni gif exercise_4.gif r1 manage 8.4 K 2015-05-19 - 15:31 MarioPelliccioni
Topic revision: r2 - 2015-05-25 - MarioPelliccioni Log In

Webs

Welcome Guest  Cern Search TWiki Search Google Search Main All webs Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback