RooStats tutorials
A first set of RooStats tutorials were given at CERN on October 15-16th 2009: (agenda:
http://indico.cern.ch/conferenceDisplay.py?confId=68912
). Below is given some materials used during the RooStats hands-on sessions. The RooFit tutorials are available
here
.
Short introduction to RooStats
RooStats is a software package relying on RooFit for the description of model and data which in turn provides utilities for fitting, plotting, Monte-Carlo studies, etc.
On top of RooFit, the RooStats package provides:
- tools to perform a statistical data analysis with various approaches coded in:
-
ProfileLikelihoodCalculator
,
-
HybridCalculator
,
-
HypoTestInvertor
,
-
BayesianCalculator
,
-
MCMCCalculator
,
-
NeymanConstruction
and FeldmanCousins
,
- etc.
- and utilities such as:
-
HLFactory
, ModelConfig
, BernsteinCorrection
, SPlot
, etc.
The RooStats package has been designed in common between Atlas (K. Cranmer), CMS (G. Schott), RooFit (W. Verkerke) and ROOT (L. Moneta) and is available from the ROOT releases since version 5.22. As of October 2009, the project is getting maturity, most of the for seen methods are included in ROOT version 5.25.02 while a few more developments are still planned. The link to the project web page is:
https://twiki.cern.ch/twiki/bin/view/RooStats/WebHome
Getting started with the software
A few advice words before starting. Remember to:
- work locally on your laptop (if possible) to reduce the WLAN usage,
- compile your macros and run with:
.L macro.C+
then macro();
,
- include:
using namespace RooFit; using namespace RooStats;
,
- avoid running a macro multiple times in the same ROOT CINT sessions (it might unfortunately crash sometimes if you do).
Setting ROOT 5.25.02 using the CERN AFS installation:
- on SLC4:
/afs/cern.ch/sw/lcg/app/releases/ROOT/5.25.02/slc4_amd64_gcc34/root/bin/thisroot.csh
or .sh
- on SLC5:
/afs/cern.ch/sw/lcg/app/releases/ROOT/5.25.02/x86_64-slc5-gcc34-opt/root/bin/thisroot.csh
or .sh
Some links to documentation:
Suggestions of exercises for this tutorial
Below will given code snippets showing how to try different analysis models:
- Poisson likelihood model: let's assume you observe 3 events in data while expecting 1 background event. We do not consider systematic uncertainties to start with. This problem could in principle be solved analytically but we'll use RooStats to solve it (we'll later make this analysis more complicated and then getting an analytical solution is not anymore easy while it is still with RooStats).
- Observable-based analysis: 10 Gaussian-distributed signal event over 100 flat-distributed background. The "mass" observable is distributed in range [0 - 500] and the signal has mean 250 and sigma 15. The prior is assumed Gaussian distributed.
- Assume on the above model a systematic uncertainty: 50% uncertainty on the background yield and/or 33% uncertainty on the sigma in the signal Gaussian model
- Covered tomorrow: combination of analyses
Compute:
- 68% CL 2-sided confidence interval and significance from the (profiled-) likelihood ratio
- 95% CL upper-limit with the (profile-) likelihood method
- Frequentist p-value in the S+B and B-only hypotheses, signal significance, CL_S ratio
- Covered tomorrow:
HypoTestInvertor
and BayesianCalculator
RooStats tutorials macro (an updated version is provided below) give the solution. They work in two steps:
- First specify the model; run for example with:
rs500d_PrepareWorkspace_GaussOverFlat_withSystematics("WS_GaussOverFlat_withSystematics.root",0)
- Different types of data generation are covered: binned/unbinned distributions, with Poisson-fluctuating or fixed (total or bin-by-bin) number of events
- Secondly choose one of the macros applying statistical methods; for example:
rs501_ProfileLikelihoodCalculator_limit("WS_GaussOverFlat_withSystematics.root")
Tutorials you could use are:
-
rs500a
to rs500d
for the model PDF specification,
-
rs501
and rs502
for the likelihood approach,
-
rs505
for the classical hypothesis test,
-
rs701
for the Bayesian credibility interval (tomorrow),
-
rs801
for classical upper limit (HypoTestInvertor
class).
Other tutorials go into more details but those above have been optimized to make them straightforward.
The updated version of the above tutorials is available from:
http://cern.ch/schott/public/updated_roostats_tutorials.zip
On linux, unzip the files with:
unzip updated_roostats_tutorials.zip
Specifying your analysis model
The model is handled by RooFit and in the examples below, the RooFit::RooWorkspace factory is used for that. A few examples are given:
Poisson likelihood model
// Use a RooWorkspace to store the PDF models, prior informations, list of parameters, ...
RooWorkspace myWS("myWS");
// Number of signal and background events
myWS.factory("S[2,0,10]"); // default value 2 and range [0,10]
myWS.factory("B[1]"); // value fixed to 1
// Observable (dummy observable used in case of Poisson likelihood function)
myWS.factory("x[0,1]"); // arbitrary range [0,1]
myWS.var("x")->setBins(1);
// Signal and background distribution of the observable
myWS.factory("Uniform::sigPdf(x)");
myWS.factory("Uniform::bkgPdf(x)");
// S+B and B-only models (both extended PDFs)
myWS.factory("SUM::model(S*sigPdf,B*bkgPdf");
myWS.factory("ExtendPdf::modelBkg(bkgPdf,B)");
// Set random seed to make the generation reproducible
(RooRandom::randomGenerator())->SetSeed(4357);
// generate binned data with fixed number of events
RooAbsData* data = myWS.pdf("model")->generateBinned(*myWS.var("x"),myWS.var("S")->getVal()+myWS.var("B")->getVal(),Name("data"));
Data generated from the Poisson likelihood model
Model with one discriminating variable: Gaussian signal over flat background
// Observable
myWS.factory("mass[0,500]"); // range [0,500]
// Signal and background distribution of the observable
myWS.factory("Gaussian::sigPdf(mass,250,15)");
myWS.factory("Uniform::bkgPdf(mass)");
Data generated from the GaussOverFlat likelihood model
Adding nuisance parameters
// set B and sigma as a non-constant variable ans specify a range
myWS->var("B")->setConstant(false);
myWS->var("B")->setRange(0,200);
myWS->var("sigSigma")->setConstant(false);
myWS->var("sigSigma")->setRange(0,30);
// Other prior PDFs
myWS.factory("Gaussian::prior_sigSigma(sigSigma,15,5)") ;
myWS.factory("Gaussian::prior_B(B,100,50)");
myWS.factory("PROD::priorNuisance(prior_sigSigma,prior_B)");
Using the RooStats statistical methods
ProfileLikelihoodCalculator
Implement an instance of ProfileLikelihoodCalculator
Specify the data, the model and a list of parameters of interest. Set the size of the confidence interval:
// Set up the ProfileLikelihoodCalculator
ProfileLikelihoodCalculator plc(*data,*model,*POI);
// ProfileLikelihoodCalculator usually make intervals: the 95% CL one-sided upper-limit is the same as the two-sided upper-limit of a 90% CL interval
plc.SetTestSize(0.10);
Run and retrieve the results
Confidence interval:
model->fitTo(*data,SumW2Error(kFALSE));
LikelihoodInterval* interval = plc.GetInterval();
model->fitTo(*data,SumW2Error(kFALSE));
const double lowerLimit = interval->LowerLimit(*parameterOfInterest);
const double upperLimit = interval->UpperLimit(*parameterOfInterest);
Significance:
// Create a copy of the POI parameters to set the values to zero
RooArgSet nullparams;
nullparams.addClone(*parameterOfInterest);
((RooRealVar *) (nullparams.first()))->setVal(0);
plc.SetNullParameters(nullparams);
HypoTestResult* plcResult = plc.GetHypoTest();
const double significance = plcResult->Significance();
Making a plot of the results
LikelihoodIntervalPlot plot(interval);
plot.Draw();
There is currently a problem with profile-likelihood plots in presence of systematic uncertainty; the interval gets out right but the plot doesn't. Below is a patch to get the correct plot until this is fixed in RooStats:
// patch to profile-likelihood plot with systematics
TF1 * tmp = interval->GetLikelihoodRatio()->asTF(*parameterOfInterest);
// clone to sample the PL calculation
TF1 * f1 = (TF1 *) tmp->Clone();
f1->Draw("SAME");
ProfileLikelihoodCalculator 95% CL upper limit
HybridCalculator
HybridCalculator
performs sampling of the selected test statistics in both S+B and B-only hypotheses. Then it integrates this distribution to compute the p-values or the CL_S ratio.
Bayesian prior on nuisance parameters are integrated out (using a Monte-Carlo technique).
Other features are not documented here, such as: switching test statistics, merging results run independently on batch, etc.
Implement an instance of HybridCalculator
Specify the data and the null and alternative hypotheses models. More options can be specified later.
See the rs201 macro for more details (but not today).
HybridCalculator hc("hc","HybridCalculator",*data,*modelSB,*modelB);
hc.SetNumberOfToys(5000);
hs.SetTestStatistics(1);
Run and retrieve the results
HybridResult* hcResult = hc.GetHypoTest();
double p_value_sb = hcResult->AlternatePValue();
double p_value_b = hcResult->NullPValue();
double cl_s = hcResult->CLs();
double significance = hcResult->Significance();
Making a plot of the results
HybridPlot* hcPlot = hcResult->GetPlot("hcPlot","p-Values plot",100);
hcPlot->Draw();
Distribution of test statistics (likelihood-ratio) with HybridCalculator from the Poisson likelihood model example
Distribution of test statistics (likelihood-ratio) with HybridCalculator from the GaussOverFlat likelihood model example
HypoTestInvertor
The
HypoTestInvertor
in ROOT 5.25.02 is using
HybridCalculatorResult
for different value of the parameter of interest until it identifies the value of the parameter of interest such as the alternate (ie. S+B) hypothesis is excluded at a given confidence level. In the next ROOT version
HypoTestInvertor
will be extended to also invert hypothesis tests from
ProfileLikelihoodCalculator
. The inversion is based on the CL_S ratio estimated by each hypothesis test. Usage of CL_SB has been make configurable for the next future release.
Patch needed for ROOT 5.25.02
In the rush for the ROOT release one line was forgotten in the constructor. A minor change fixes it for ROOT 5.25.02 (it has been fixed for the next release).
namespace Temp {
class HypoTestInvertor : public RooStats::HypoTestInvertor {
public:
HypoTestInvertor(const char * name, const char * title, RooStats::HypoTestCalculator * calc, RooRealVar * var) :
RooStats::HypoTestInvertor(name, title, calc, var){}
virtual void SetModel( const RooStats::ModelConfig & ){}
ClassDef(Temp::HypoTestInvertor,1);
};
}
This patch requires to run macros in compiled mode exclusively.
Configuring the HypoTestInvertor
One starts by instantiating and configuring a
HybridCalculator
object:
HybridCalculator myhc("myhc","HybridCalculator",*data, totPdf, bkgPdf);
myhc.SetTestStatistics(2);
myhc.SetNumberOfToys(1000);
Implement an instance of HypoTestInvertor
Use the patch version of
HypoTestInvertor
in the
Temp
namespace.
Temp::HypoTestInvertor myInvertor("myInvertor","",&myhc,¶meterOfInterest);
myInvertor.SetTestSize(0.05);
Three types of running modes are currently available:
- run hypothesis tests point-by-point:
myInvertor.RunOnePoint(3.9);
- run hypothesis in an interval for equally-spaced points:
myInvertor.RunFixedScan(5,1,6);
here the interval is [1,6] and 5 points will be tested
- run hypothesis in auto mode until a target value is found close to the test size that had been set:
myInvertor.RunAutoScan(1,6,0.005);
here the interval is [1,6] and points will be tested until one is found with CL_S = 0.050 +/- 0.005.
It is possible to refine a result by running manually further points once you have a rough idea of the upper limit. At least two scan points are needed in order to start estimating the location of the upper/lower limit.
Perform the hypothesis test inversion and plot the results
HypoTestInvertorResult* results = myInvertor.GetInterval();
HypoTestInvertorPlot myInvertorPlot("myInvertorPlot","",results);
TGraph* gr1 = myInvertorPlot.MakePlot();
The following plot show an example of hypothesis test inversion based on CL_SB assuming 1 observed event and 1 expected background event. From the PDG 2008 table 32.3, the upper limit is expected to be 4.74 as was obtained with RooStats (substracting the 1 background event you get 3.74 on the plot below).
HypoTestInvertorPlot overlayed to expected p-value as a function of the signal yield
BayesianCalculator
The
BayesianCalculator
class implements the Bayesian theorem. It also return a confidence interval for one single parameter of interest
Suggestions for the exercises:
- You can compare the limits obtained by
BayesianCalculator
to those from HybridCalculator
on the same model for both upper and lower limits.
- You may also like to compare results with different prior on the parameter of interest (Uniform in S or 1/S, ...)
Implement an instance of HypoTestInvertor
The
BayesianCalculator
constructor inputs: the data, the PDF model, the list of parameters of interest and their prior PDF on the parameter of interest and nuisance parameters to integrate out in the pdf model.
BayesianCalculator bc(data,*model,RooArgSet(*POI),*priorPOI,&nuisanceParameters);
Compute the credibility interval
Set the confidence level of the credibility interval and compute it. Returns a
SimpleInterval
.
bc.SetTestSize(0.05);
SimpleInterval* interval = bc.GetInterval();
double lowerLimit = interval->LowerLimit();
double upperLimit = interval->UpperLimit();
Plotting the posterior probability
// In ROOT >5.25.02:
// bc.GetPosteriorPlot()->Draw();
// This is the current plot:
// bcalc.PlotPosterior();
// The code below produce a much nicer plot:
RooAbsPdf* fPosteriorPdf = bcalc.GetPosteriorPdf();
RooPlot* plot = POI->frame();
plot->SetTitle(TString("Posterior probability of parameter \"")+TString(POI->GetName())+TString("\""));
fPosteriorPdf->plotOn(plot,RooFit::Range(interval->LowerLimit(),interval->UpperLimit(),kFALSE),RooFit::VLines(),RooFit::DrawOption("F"),RooFit::MoveToBack(),RooFit::FillColor(kGray));
fPosteriorPdf->plotOn(plot);
plot->GetYaxis()->SetTitle("posterior probability");
plot->Draw();
BayesianPlot and 90% CL credibility interval
Comparing different methods
Profile Likelihood, Feldman Cousins and MCMC
Download exercises from here:
http://cern.ch/cranmer/public/RooStats_CompareIntervals.tar
or on lxplus
~cranmer/public/RooStats_CompareIntervals.tar
We can modify the 500d tutorial to have a floating mass, and fit a 2-d scan. This is already done in
-
rs500e_PrepareWorkspace_GaussOverFlat_withSystematics_floatingMass.C
and then we compare three types of intervals (profile likelihood, Bayesian using Markov Chain Monte Carlo, and a Feldman Cousins technique that deals with nuisance parameters via the 'profile construction').
-
rs501_ThreeTypesOfLimits.C
The blue contour is from profile likelihood, the other contour is from Markov Chain Monte Carlo, and the squares are for the points tested by the Feldman Cousins tool.
Comparison of Profile Likelihood (blue), Feldman Cousins (red squares), and MCMC (black contour)
Running the same problem with BAT
A
http://www-ekp.physik.uni-karlsruhe.de/~schott/BCRooInterface/ has been developed to the
http://www.mppmu.mpg.de/bat/ (BAT). It is possible with it to run on the same problem as described above with Bayesian method using Markov Chain MC. Here is how to install BAT (on lxplus using the tcsh shell):
wget http://www.mppmu.mpg.de/bat/source/BAT-0.3.1.tar.gz
tar xvfz BAT-0.3.1.tar.gz
cd BAT-0.3.1
./configure --prefix=${PWD}
make
make install
setenv BATINSTALLDIR ${PWD}
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${BATINSTALLDIR}/lib
setenv CPATH ${BATINSTALLDIR}/include
cd models/RooInterface
make
Then copy the input ROOT file (containing the workspace, ie.
WS_GaussOverFlat_withSystematics_floatingMass.root
) in the
RooInterface
directory and run with:
./runRooInterface.exe WS_GaussOverFlat_withSystematics_floatingMass.root myWS
gv bat_plots.ps
We are lucky, it worked out-of-the-box since we have been using the the workspace preparation the same name conventions as the defaults sets in the
BAT/RooInterface
. Here is one of the plots of the results:
Posterior projection obtained with the BAT package after MCMC integration.
--
GregorySchott - 15-Oct-2009 --
KyleCranmer - 16-Oct-2009 --
GregorySchott - 29-Oct-2009