This page is obsolete!

Please refer to RooStats, the product in which RooStatsCms converged!

RooStatsCms: a tool for analyses modeling, combination and statistical studies

Complete: 3

Detailed Review status

IMPORTANT

Most of the code of RooStatsCms has been moved to CMS.RooStats. We strongly advice to refer to this latter product in this period of transition!

Prerequisites

The Analyser should be a bit familiar with RooFit, the chosen base for RSC. Moreover the understanding of concepts like likelihood, hypothesis testing, profiling-marginalisation, systematic uncertainties, significance are taken for granted.

Introduction

RooStatsCms (RSC) is a software framework based on the CMS.RooFit technology and born for the CMS experiment community, whose scope is to allow the modelling and combination of multiple analysis channels together with the accomplishment of statistical studies. That is performed through a variety of methods described in the literature implemented as classes. The design of these classes is oriented to the execution of multilple cpu intensive jobs on batch systems or on the GRID, facilitating the splitting of the calculations and the recollection of the results. In addition the production of plots by means of sophisticated formatting, drawing and graphics manipulation routines is provided transparently for the user. Analyses and their combinations are characterised in configuration files, thus separating physics inputs from the C++ code. The deployment of such a feature eases the sharing of the input models among the analysis groups establishing common guidelines to summarise Physics results. A maximum statistical advantage can be drawn from the analyses combination allowing the definition of common variables, constrained parameters and arbitrary correlations among the different quantities. RSC is therefore meant to complement the existing analyses by means of their combination therewith obtaining earlier discoveries, sharper limits and more refined measurements of physically relevant quantities.

Obtain RooStatsCms

RooStatsCms can be easily obtained from the CMSSW repository, in the package CMS.PhysicsTools/RooStatsCms. It can be then compiled as a standalone package or in the CMS software using scram.

Standalone

In this case a standard make file is to be used. It will produce a dynamic library and all the necessary CINT dictionaries to use RooStatsCms in the interactive sessions or in macros. The recipe to obtain the last version (V01-01-06) and compile it is reported here (bash). Be sure to have Root (at least 5.22) correctly set up:

export CVSROOT=<yourusername>@cmscvs.cern.ch:/cvs_server/repositories/CMSSW
export CVS_RSH=ssh
cvs co -d ./  -r V01-01-06 PhysicsTools/RooStatsCms
cd RooStatsCms
make 
make exe -j3
source scripts/RSCenv.sh

We are done! The environment is set, the library compiled and the executables ready in the bin directory.

CMSSW

-- To be done --

Quick Start

Let's now run a first example, m2lnQ_creator.exe, which performs a study about hypotheses separation. From the RooStatsCms directory, run

bin/m2lnQ_creator.exe macros/examples/example_qqhtt.rsc qqhtt 1000
This command line contains the path to a datacard (example_qqhtt.rsc), the name of the combined model in the datacard (qqhtt) and the number of pseudo experiments to be performed. More about the datacard will be explained in the following. The code will produce a plot, qqhtt_m2lnQ_distrinutions_1000.png, and its corresponding object in a rootfile, qqhtt_m2lnQ_distrinutions_1000.root. The plot should look like:

qqhtt m2lnQ distrinutions 1000.png
Hypothesis separation test
If you think you do not know what a separation of hypotheses is or you are not familiar at all with the plot, please see A.L. Read ``Modified Frequentist Analysis of Search Results (The CL_s Method)'', CERN OPEN-2000-205.

Datacards: the analysis configuration files

In the analysis of a physics process the description of its signal and background components, together with correlations and constraints acting on the parameters, is a critical step. RooStatsCms provides the possibility to easily model the analyses and their combinations through the description of their signal and background shapes, yields, parameters, systematics and correlations via analysis configuration files, called datacards. The goal of the modelling component of RSC is to parse the datacard and generate from it a model according to the RooFit standards. There are a few classes devoted to this functionality, but the user really needs to deal only with one of them, the RscCombinedModel. The approach described above has mainly three advantages: the factorisation of the analysis description and statistical treatment in two well defined steps, a common base to describe the outcomes of the studies by the analysis groups, and a straightforward and documented sharing of the results. A datacard is an ASCII file in the ".ini" format, therefore presenting key-value pairs organised in sections. This format was preferred to the XML because of its simplicity and high readability. The parsing and processing of the datacard is achieved through an extension of the RooFit RooStreamParser utility class. This class is already rather advanced. Beyond reading strings and numeric parameters from configuration files, it implements the interpretation of conditional statements, file inclusions and comments. In presence of a complicated combination, the user can take advantage in RSC from these features specifying one single model per datacard and then import all of them in a "combination card". Every analysis model can be described as a function of one or many observables, e.g. invariant mass, output of a neural network or topological information regarding the decay products. For each of these variables a description of the signal and background case is to be given, where both signal and background can be divided in multiple components, e.g. multiple background sources. To each signal and background components, a shape and a yield can be assigned. For what concerns the shape, a list of models is present and for those shapes which are not easily parametrisable, a TH1 histogram in a ROOT file can also be specified. The yields can be expressed as a product of an arbitrary number of single factors. Using RooFit, all the parameters present in the datacard can be specified as constants or defined in a certain range. In addition to that, exploiting the RSC implementation of the constraints, the user can directly specify the parameter affected by a Gaussian or a log-normal systematic uncertainties. In the former case, correlations can be specified among the parameters via the input of a correlation matrix. In a combination some parameters might need to be the same throughout many analyses, e.g. the luminosity or a background rate. This feature is achieved in the modelling through a "same name, same pointer" mechanism. Indeed every parameter is represented in memory as a RooRealVar or, in presence of systematic uncertainties, as a derived object, the Constraint object and the RscCombinedModel merges all variables with the same name via an association to the same pointer.

Writing datacards and testing them

It might be useful, to begin with, to look in detail at the example card macros/examples/example_qqhtt.rsc: it contains a very simple model, with a signal and a background divided in two components. A lot of comments are present into the file and they can help you in understand the basic configuration syntax.

An helper script is distributed with RooStatsCms: create_card_skeleton.py. This scripts tries to guide you through the datacard writing, suggesting some code snippets among the following selection:

  • combinedModel
  • singleModel
  • sigTopLevel
  • bkgTopLevel
  • bkgTopLevelCompositeYield
  • sigDeclaration
  • bkgDeclaration
  • bkgDeclarationMultipleComponents

To run it just type:

create_card_skeleton.py <snippet_name>

The Minimal Card

After this introduction let's write our first datacard, to describe a simple counting experiment. We will start from this basic model to then increase its complexity.

# The first RooStatsCms card

#section0
[my_combination]
    model = combined
    components = my_analysis

#section 1
[my_analysis]
    variables = x
    x = 0 L(0 - 1)

#section 2
[my_analysis_sig]
    my_analysis_sig_yield = 100 L (0 - 1000)

#section 3
[my_analysis_sig_x]
    model = yieldonly

#section 4
[my_analysis_bkg]
    my_analysis_bkg_yield = 1000 C

#section 5
[my_analysis_bkg_x]
    model = yieldonly

Let's now understand the card:

  • As you can see the comments can be included if they begin with a "#"
  • "my_analysis" is the name of out counting analysis. As you can see the name of the analysis occours several times.
  • Section 0:
    • Let's leave this point apart for the moment. This is how a combined model is build using RooStatsCms.
  • Section 1:
    • variables: here the user specifies the variables that are used in the analysis. Even if we are in presence of a counting analysis, we specify one variable: x.
    • x: having chosen the name x for the variable, we now specify its interval of definition and initial value. The RooFit convenction is used: for an interval initial_value L (min_value - max_value).
  • Section 2 - the signal yield description:
    • my_analysis_sig_yield: express the yied of the signal component of the analysis. The choice of making it constant of variable is up to the user and up to the context of the statistical treatment.
  • Section 3 - the signal shape description:
    • my_analysis_sig_x: observe the form of the section name "analysis_name"_"component"_"variable_name".
    • model: since we are in presence of a counting experiment we select the "yieldonly" model. Other possibilities will be explored later
  • Section 4 and 5 are the background correspondant of sections 2 and 3.

Let's try to have a visual impression of our model (needs dot program of graphviz suite):

create_diagram.exe macros/examples/simple_counting.rsc  my_combination

this command will create this nice graph: XXXXXXX XXXXXXX XXXXXXX

The Minimal Card with systematics

Let's now try to add systematics to the model, for example a 20% Gaussian systematic on the background yield:

# The second RooStatsCms card

#section0
[my_combination]
    model = combined
    components = my_analysis

#section 1
[my_analysis]
    variables = x
    x = 0 L(0 - 1)

#section 2
[my_analysis_sig]
    my_analysis_sig_yield = 100 L (0 - 1000)

#section 3
[my_analysis_sig_x]
    model = yieldonly

#section 4
[my_analysis_bkg]
    my_analysis_bkg_yield = 1000 L (0 - 10000)
    my_analysis_bkg_yield_constraint = Gaussian,1000,0.2

#section 5
[my_analysis_bkg_x]
    model = yieldonly

As you can see some changes were applied to section 4: the my_analysis_bkg_yield variable is now free to float between 0 and 10000 and a constraint has been defined. The syntax for the constraints is the following:

<parameter_name> = X L (X1 - X2)
<parameter_name>_constraint = gaussian,<central value>, <sigma>
It must be observed that the sigma is expressed in percentage, with a number ranging from 0 to 1.

Moving on: models different than the counting experiment

Of course, considering only counting experiments would be a limitation. Indeed, RooStatsCms allows you to specify different shapes instead, or for not easy parametrisable functional forms you can plug an histogram contained in a rootfile. The way in which this can be performed starts with the line model = yieldonly with something more elaborated. In general the form is, let's say for the signal component of the analiysis variable x of the analtsis:

[my_analysis_sig_x]
    model = distribution_name
    my_analysis_sig_x_par_name1 = ..
    my_analysis_sig_x_par_name2 = ..
    my_analysis_sig_x_par_name3 = ..
    my_analysis_sig_x_par_name4 = ..
    ....
Observe that in the parameters name appear the name of the analysis, the component and the name of the variable.

The class that builds the different distributions is the RscBaseModel. A list of the available distributions together with their parameters is presented in the following

  • Gaussian
    • model name gauss.
    • variables mean, sigma.
  • Double Gaussian
    • model name dblgauss.
    • variables mean1, sigma1, mean2, sigma2, frac. This last parameter represent the relative amplitude of the two gaussians.
  • Sum of 4 Gaussian
    • model name fourGaussians.
    • variables mean1, sigma1, mean2, sigma2, mean3, sigma3, mean4, sigma4, frac1, frac2, frac3.
  • Breit-Wigner
    • model name BreitWigner.
    • variables mean1, sigma1.
  • Exponential
    • model name exponential.
    • variables slope.
  • Polynomial, 7th degree
    • model name poly7.
    • variables coef_0, coef_1, ... assumed to be 0 if not set.
  • Bifurcated Gaussian
    • model name BifurGauss.
    • variables mean, sigmaL, sigmaL.
  • Sum of a crystall distribution ball and a gaussian
    • model name CBShapeGaussian.
    • variables m0, sigma, alpha, n, gmean, gsigma, frac .
  • Flat distribution
    • model name flat.
    • No variable is needed.
  • Counting experiment
    • model name yieldonly
    • No variable is needed.
  • Histogram
    • model name histo.
    • variables fileName, dataName. The name of the rootfile and of a TH1 there contained.

Let's now try a dummy model which presents a Gaussian signal over an exponentially falling spectrum. The card that we will use is the following:

[my_combination]
    model = combined
    components = my_analysis

#section 1
[my_analysis]
    variables = x
    x = 0 L(0 - 300)

#section 2
[my_analysis_sig]
    my_analysis_sig_yield = 40 L (0 - 1000)

#section 3
[my_analysis_sig_x]
    model = gauss
    my_analysis_sig_x_mean = 70 C
    my_analysis_sig_x_sigma = 10 C

#section 4
[my_analysis_bkg]
    my_analysis_bkg_yield = 500 C

#section 5
[my_analysis_bkg_x]
    model = exponential
    my_analysis_bkg_x_slope = -0.01 C

You can try to draw the plot of this model using the create_sig_bkg_plot.exe like this

create_sig_bkg_plot.exe create_sig_bkg_plot.exe $RSCPATH/macros/examples/gaussian_on_expo.rsc my_combination
This would produce this plot for you:
my combination component0.png
A Gaussian signal over an exponentially falling spectrum

Build your models directly in the C++ code

Coming soon: build a model with RSC directly in the C++ code!

The Profile Likelihood method

The Frequentist Approach

The Modified Frequentist prescription

Hypothesis test inversion

Binomial Confidence intervals

Fore more details about this topic, please see this page.

Toy MC with CRAB

Fore more details about this topic, please see this page.

Related publications

The CHEP09 proceedings are about to be published. Please refere for the time being to this preprint:
  • arXiv0905.4623, Danilo Piparo and Gregory Schott and Gunter Quast, RooStatsCms: a tool for analysis modelling, combination and statistical studies, 2009,

-- DaniloPiparo - 16 Jun 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng PL_sign.png r1 manage 18.9 K 2009-07-31 - 13:37 NilsRuthmann  
PNGpng my_combination_component0.png r1 manage 9.7 K 2009-06-16 - 17:22 DaniloPiparo  
Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2012-11-11 - DaniloPiparo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback