This page is obsolete!
Please refer to
RooStats, the product in which
RooStatsCms converged!
RooStatsCms: a tool for analyses modeling, combination and statistical studies
Complete:
Detailed Review status
IMPORTANT
Most of the code of
RooStatsCms has been moved to
CMS.RooStats. We strongly advice to refer to this latter product in this period of transition!
Prerequisites
The Analyser should be a bit familiar with
RooFit, the chosen base for RSC. Moreover the understanding of concepts like likelihood, hypothesis testing, profiling-marginalisation, systematic uncertainties, significance are taken for granted.
Introduction
RooStatsCms (RSC) is a software framework based on the CMS.RooFit technology and born for the CMS experiment community, whose scope is to allow the modelling and combination of multiple analysis channels together with the accomplishment of statistical studies. That is performed through a variety of methods described in the literature implemented as classes. The design of these classes is oriented to the execution of multilple cpu intensive jobs on batch systems or on the GRID, facilitating the splitting of the calculations and the recollection of the results. In addition the production of plots by means of sophisticated formatting, drawing and graphics manipulation routines is provided transparently for the user.
Analyses and their combinations are characterised in configuration files, thus separating physics inputs from the C++ code. The deployment of such a feature eases the sharing of the input models among the analysis groups establishing common guidelines to summarise Physics results.
A maximum statistical advantage can be drawn from the analyses combination allowing the definition of common variables, constrained parameters and arbitrary correlations among the different quantities.
RSC is therefore meant to complement the existing analyses by means of their combination therewith obtaining earlier discoveries, sharper limits and more refined measurements of physically relevant quantities.
Obtain RooStatsCms
RooStatsCms can be easily obtained from the CMSSW repository, in the package
CMS.PhysicsTools/RooStatsCms. It can be then compiled as a standalone package or in the CMS software using scram.
Standalone
In this case a standard make file is to be used. It will produce a dynamic library and all the necessary CINT dictionaries to use
RooStatsCms in the interactive sessions or in macros.
The recipe to obtain the last version (V01-01-06) and compile it is reported here (bash). Be sure to have Root (at least 5.22) correctly set up:
export CVSROOT=<yourusername>@cmscvs.cern.ch:/cvs_server/repositories/CMSSW
export CVS_RSH=ssh
cvs co -d ./ -r V01-01-06 PhysicsTools/RooStatsCms
cd RooStatsCms
make
make exe -j3
source scripts/RSCenv.sh
We are done! The environment is set, the library compiled and the executables ready in the bin directory.
CMSSW
-- To be done --
Quick Start
Let's now run a first example, m2lnQ_creator.exe, which performs a study about hypotheses separation. From the RooStatsCms
directory, run
bin/m2lnQ_creator.exe macros/examples/example_qqhtt.rsc qqhtt 1000
This command line contains the path to a
datacard (example_qqhtt.rsc), the name of the combined model in the
datacard (qqhtt) and the number of pseudo experiments to be performed. More about the datacard will be explained in the following.
The code will produce a plot,
qqhtt_m2lnQ_distrinutions_1000.png, and its corresponding object in a rootfile,
qqhtt_m2lnQ_distrinutions_1000.root.
The plot should look like:
Hypothesis separation test
If you think you do not know what a separation of hypotheses is or you are not familiar at all with the plot, please see A.L. Read ``Modified Frequentist Analysis of Search Results (The CL_s Method)'', CERN OPEN-2000-205.
Datacards: the analysis configuration files
In the analysis of a physics process the description of its signal and background components,
together with correlations and constraints acting on the parameters, is a critical step.
RooStatsCms provides the possibility to easily model the analyses and their combinations
through the description of their signal and background shapes, yields, parameters, systematics
and correlations via analysis configuration files, called
datacards. The goal of the modelling
component of RSC is to parse the
datacard and generate from it a model according to the
RooFit standards. There are a few classes devoted to this functionality, but the user really
needs to deal only with one of them, the
RscCombinedModel. The approach described above has
mainly three advantages: the factorisation of the analysis description and statistical treatment
in two well defined steps, a common base to describe the outcomes of the studies by the analysis
groups, and a straightforward and documented sharing of the results.
A datacard is an ASCII file in the ".ini" format, therefore presenting key-value pairs organised
in sections. This format was preferred to the XML because of its simplicity and high readability.
The parsing and processing of the datacard is achieved through an extension of the
RooFit
RooStreamParser utility class. This class is already rather advanced. Beyond reading strings
and numeric parameters from configuration files, it implements the interpretation of conditional
statements, file inclusions and comments. In presence of a complicated combination, the user
can take advantage in RSC from these features specifying one single model per
datacard and
then import all of them in a "combination card".
Every analysis model can be described as a function of one or many observables, e.g. invariant
mass, output of a neural network or topological information regarding the decay products. For
each of these variables a description of the signal and background case is to be given, where
both signal and background can be divided in multiple components, e.g. multiple background
sources. To each signal and background components, a shape and a yield can be assigned. For
what concerns the shape, a list of models is present and for those shapes which are not easily
parametrisable, a TH1 histogram in a ROOT file can also be specified. The yields can be expressed
as a product of an arbitrary number of single factors.
Using
RooFit, all the parameters present in the datacard can be specified as constants or
defined in a certain range. In addition to that, exploiting the RSC implementation of the
constraints, the user can directly specify the parameter affected by a Gaussian or a log-normal
systematic uncertainties. In the former case, correlations can be specified among the parameters
via the input of a correlation matrix.
In a combination some parameters might need to be the same throughout many analyses,
e.g. the luminosity or a background rate. This feature is achieved in the modelling through a
"same name, same pointer" mechanism. Indeed every parameter is represented in memory as
a RooRealVar or, in presence of systematic uncertainties, as a derived object, the Constraint
object and the
RscCombinedModel merges all variables with the same name via an association
to the same pointer.
Writing datacards and testing them
It might be useful, to begin with, to look in detail at the example card
macros/examples/example_qqhtt.rsc: it contains a very simple model, with a signal and a background divided in two components. A lot of comments are present into the file and they can help you in understand the basic configuration syntax.
An helper script is distributed with RooStatsCms: create_card_skeleton.py. This scripts tries to guide you through the datacard writing, suggesting some code snippets among the following selection:
- combinedModel
- singleModel
- sigTopLevel
- bkgTopLevel
- bkgTopLevelCompositeYield
- sigDeclaration
- bkgDeclaration
- bkgDeclarationMultipleComponents
To run it just type:
create_card_skeleton.py <snippet_name>
The Minimal Card
After this introduction let's write our first datacard, to describe a simple counting experiment. We will start from this basic model to then increase its complexity.
# The first RooStatsCms card
#section0
[my_combination]
model = combined
components = my_analysis
#section 1
[my_analysis]
variables = x
x = 0 L(0 - 1)
#section 2
[my_analysis_sig]
my_analysis_sig_yield = 100 L (0 - 1000)
#section 3
[my_analysis_sig_x]
model = yieldonly
#section 4
[my_analysis_bkg]
my_analysis_bkg_yield = 1000 C
#section 5
[my_analysis_bkg_x]
model = yieldonly
Let's now understand the card:
- As you can see the comments can be included if they begin with a "#"
- "my_analysis" is the name of out counting analysis. As you can see the name of the analysis occours several times.
- Section 0:
- Let's leave this point apart for the moment. This is how a combined model is build using RooStatsCms.
- Section 1:
- variables: here the user specifies the variables that are used in the analysis. Even if we are in presence of a counting analysis, we specify one variable: x.
- x: having chosen the name x for the variable, we now specify its interval of definition and initial value. The RooFit convenction is used: for an interval initial_value L (min_value - max_value).
- Section 2 - the signal yield description:
- my_analysis_sig_yield: express the yied of the signal component of the analysis. The choice of making it constant of variable is up to the user and up to the context of the statistical treatment.
- Section 3 - the signal shape description:
- my_analysis_sig_x: observe the form of the section name "analysis_name"_"component"_"variable_name".
- model: since we are in presence of a counting experiment we select the "yieldonly" model. Other possibilities will be explored later
- Section 4 and 5 are the background correspondant of sections 2 and 3.
Let's try to have a visual impression of our model (needs dot program of graphviz suite):
create_diagram.exe macros/examples/simple_counting.rsc my_combination
this command will create this nice graph:
XXXXXXX
XXXXXXX
XXXXXXX
The Minimal Card with systematics
Let's now try to add systematics to the model, for example a 20% Gaussian systematic on the background yield:
# The second RooStatsCms card
#section0
[my_combination]
model = combined
components = my_analysis
#section 1
[my_analysis]
variables = x
x = 0 L(0 - 1)
#section 2
[my_analysis_sig]
my_analysis_sig_yield = 100 L (0 - 1000)
#section 3
[my_analysis_sig_x]
model = yieldonly
#section 4
[my_analysis_bkg]
my_analysis_bkg_yield = 1000 L (0 - 10000)
my_analysis_bkg_yield_constraint = Gaussian,1000,0.2
#section 5
[my_analysis_bkg_x]
model = yieldonly
As you can see some changes were applied to section 4: the my_analysis_bkg_yield variable is now free to float between 0 and 10000 and a constraint has been defined.
The syntax for the constraints is the following:
<parameter_name> = X L (X1 - X2)
<parameter_name>_constraint = gaussian,<central value>, <sigma>
It must be observed that the sigma is expressed in percentage, with a number ranging from 0 to 1.
Moving on: models different than the counting experiment
Of course, considering only counting experiments would be a limitation. Indeed, RooStatsCms allows you to specify different shapes instead, or for not easy parametrisable functional forms you can plug an histogram contained in a rootfile.
The way in which this can be performed starts with the line model = yieldonly
with something more elaborated.
In general the form is, let's say for the signal component of the analiysis variable x
of the analtsis:
[my_analysis_sig_x]
model = distribution_name
my_analysis_sig_x_par_name1 = ..
my_analysis_sig_x_par_name2 = ..
my_analysis_sig_x_par_name3 = ..
my_analysis_sig_x_par_name4 = ..
....
Observe that in the parameters name appear the name of the analysis, the component and the name of the variable.
The class that builds the different distributions is the RscBaseModel. A list of the available distributions together with their parameters is presented in the following
- Gaussian
- model name
gauss
.
- variables
mean
, sigma
.
- Double Gaussian
- model name
dblgauss
.
- variables
mean1
, sigma1
, mean2
, sigma2
, frac
. This last parameter represent the relative amplitude of the two gaussians.
- Sum of 4 Gaussian
- model name
fourGaussians
.
- variables
mean1
, sigma1
, mean2
, sigma2
, mean3
, sigma3
, mean4
, sigma4
, frac1
, frac2
, frac3
.
- Breit-Wigner
- model name
BreitWigner
.
- variables
mean1
, sigma1
.
- Exponential
- model name
exponential
.
- variables
slope
.
- Polynomial, 7th degree
- model name
poly7
.
- variables
coef_0
, coef_1
, ... assumed to be 0 if not set.
- Bifurcated Gaussian
- model name
BifurGauss
.
- variables
mean
, sigmaL
, sigmaL
.
- Sum of a crystall distribution ball and a gaussian
- model name
CBShapeGaussian
.
- variables
m0
, sigma
, alpha
, n
, gmean
, gsigma
, frac
.
- Flat distribution
- model name
flat
.
- No variable is needed.
- Counting experiment
- model name
yieldonly
- No variable is needed.
- Histogram
- model name
histo
.
- variables
fileName
, dataName
. The name of the rootfile and of a TH1 there contained.
Let's now try a dummy model which presents a Gaussian signal over an exponentially falling spectrum.
The card that we will use is the following:
[my_combination]
model = combined
components = my_analysis
#section 1
[my_analysis]
variables = x
x = 0 L(0 - 300)
#section 2
[my_analysis_sig]
my_analysis_sig_yield = 40 L (0 - 1000)
#section 3
[my_analysis_sig_x]
model = gauss
my_analysis_sig_x_mean = 70 C
my_analysis_sig_x_sigma = 10 C
#section 4
[my_analysis_bkg]
my_analysis_bkg_yield = 500 C
#section 5
[my_analysis_bkg_x]
model = exponential
my_analysis_bkg_x_slope = -0.01 C
You can try to draw the plot of this model using the create_sig_bkg_plot.exe like this
create_sig_bkg_plot.exe create_sig_bkg_plot.exe $RSCPATH/macros/examples/gaussian_on_expo.rsc my_combination
This would produce this plot for you:
A Gaussian signal over an exponentially falling spectrum
Build your models directly in the C++ code
Coming soon: build a model with RSC directly in the C++ code!
The Profile Likelihood method
The Frequentist Approach
The Modified Frequentist prescription
Hypothesis test inversion
Binomial Confidence intervals
Fore more details about this topic, please see
this page.
Toy MC with CRAB
Fore more details about this topic, please see
this page.
Related publications
The
CHEP09 proceedings are about to be published. Please refere for the time being to
this preprint:
- arXiv0905.4623, Danilo Piparo and Gregory Schott and Gunter Quast, RooStatsCms: a tool for analysis modelling, combination and statistical studies, 2009,
-- DaniloPiparo - 16 Jun 2009