 TWiki> Main Web>TWikiUsers>DiegoCasadei>BayesianPoissonBasedInference (2019-10-10, DiegoCasadei)

# BayesianPoissonBasedInference

## Foreword

Here we sketch the key differences between frequentist and Bayesian inference. The expert reader can jump directly to the end of this section, where the rule for choosing among them is provided.

The Bayesian approach to statistical inference provides answers in terms of P(model|data) = probability of some model given the observed data, and can be summarized very tightly as follows:

• The probability is interpreted as degree of belief. This is general enough to be applicable to all situations. For example, one can speak about the probability that it rains tomorrow. Bernoulli's theorem (a.k.a. the law of large numbers) ensures that in the long run the observed frequency of a given outcome converges (in probability!) to its true probability. Hence in practice, one gets the very same numerical value of the probability as the frequentist definition, whenever the latter makes sense.
• The Bayes' theorem plays a central role. It can be viewed as a model for our learning process, as it connects our prior probability distribution (representing our degree of belief about different possible outcomes before taking into account the result of the current experiment) to the posterior probability distribution given the observed data (representing our updated state of knowledge).
• Multidimensional problems with nuisance parameters are treated coherently: one finds the marginal probability distribution by integrating over each nuisance parameter (weighted by its prior density).

In contrast, the frequentist approach provides answers in terms of P(data|model). The probability is defined as the long-run frequency, hence a statement like "the probability that it rains tomorrow" is ill-defined in this approach, as one can not (even hypothetically) repeat the experiment. As mentioned above, when this interpretation makes sense the law of large numbers ensures that the probability assignment is the same as in the Bayesian approach (which is more general).

A typical problem in physics is the estimation of the value of some unknown parameter. Usually one wishes to obtain a best value (to be used in computations) and a range enclosing it which represents our uncertainty. The true value is not a random quantity, hence in the frequentist approach one can only think about the fraction of identically repeated experiments whose result (in terms of range of possible values, called confidence interval) contains the true but unknown value. This fraction, in the long run, is called the coverage of the inference method. On the other side, one can assign a degree of belief on different values of the unknown parameter, even though the true value is not a random quantity. Hence the Bayesian approach does not rely on the (possibly hypothetical) series of identically repeated experiments, and provides an answer based only on the assumed model, the chosen prior, and the observed data. The full Bayesian answer is the (marginal) posterior distribution, from which one can extract a best value (e.g. the posterior mean or mode) and a reasonable range (e.g. the narrowest interval covering 68.3% posterior probability, called a 68.3% credible interval).

Frequentist inference is typically based on asymptotic properties (which is sort of unavoidable if the probability is defined as the long run frequency). When the current experiment is far from the asymptotic regime (which is always the case for searches of rare events) one must check the coverage of the solution with Monte Carlo methods, by simulating a lot of identical experiments (called toy experiments). Bayesian inference requires no toy experiments and provides answers which are not based on asymptotics. At the same time, far from the asymptotic regime the influence of the prior can be strong, which means that one must assess the dependence of the solution on the choice of the prior. Luckily enough, "objective" priors exist and are typically used when one is far from the asymptotic regime. The rest of this document is based on them indeed.

Finally, nuisance parameters are treated differently in the two approaches. The Bayesian prescription is to integrate over them (which was numerically prohibitive until not long ago), whereas they are fixed to their best values in the frequentist approach (which is computationally much simpler).

Note that, although the posterior mode with a flat prior is the same as the maximum likelihood estimator of the unknown value, their interpretation is completely different. You'll understand this immediately if you compare P(rain|cloud) with P(cloud|rain).

Rule for choosing between frequentist and Bayesian inference: if you ask about P(model|data) you must choose the Bayesian approach; if you ask about P(data|model) you can choose the frequentist approach (but you can formulate statements about P(data|model) also within the Bayesian framework, as shown at the end of the next section).

## Bayesian inference for Poisson distributed counts

We consider here a counting experiment observing n events, as a result of trigger and offline selections, which originate from a known source called background (or bkg in short) with a possible additional contribution from another independent source called signal (or sig in short). The unknown true bkg parameter is b (given by the product of bkg cross-section and integrated luminosity), and the Poisson parameter s is the unknown signal intensity.

The goal is to perform statistical inference about the parameters s and b with the purpose of

1. assessing how well the background-only hypothesis Hb describes the outcome;
2. in case Hb poorly describes the data, estimate the signal intensity s. Otherwise, place an upper bound to s. In both cases, b is considered a nuisance parameter, as we are mainly interested into s

For the first task, the usual method is to compute the p-value under the bkg-only hypothesis, which is the probability to obtain a discrepancy at least as large as the actual one when no signal is present, and convert it into a statistical significance, essentially mapping the deviation to a Gaussian metric and counting how many "sigmas" away from the peak we are (examples and explcit formulae are provided by , with ROOT macros available on https://github.com/dcasadei/psde ). When the significance of the excess is high enough, one then estimates the signal intensity in the hypothesis that both signal and bkg contributed to the observed counts. If the significance is low, an upper bound on the signal is computed.

As we want some answer for the signal in the form of P(model|data), we choose a Bayesian approach. The starting points are

1. the probability model, which in this case is Poi(sig+bkg) where the signal can be null: • the information about sig and bkg available prior to looking at the experimental outcome: The full solution is the joint posterior density of sig and bkg, obtained with the Bayes' theorem: where the normalization constant can be found by integrating over the right-hand term and imposing that the result is one. Most often, one is only interested into the signal intensity, regardless of the background (which needs to be properly estimated, of course). In this case, the desired solution is the marginal posterior density for the signal alone: If we are interested into statements in the form of P(data|model), we can either follow the frequentist approach or compute the predictive posterior distribution of counting k events in the next experiment: As the analytic solution is known for the Bayesian problem in the case of Poi(sig+bkg), which is valid for all observed counts n and for all possible future counts k (which is not true when using asymptotic expressions), it is easier to compute the predictive posterior distribution above, rather than solving the inference problem with frequentist methods.

## Background prior and marginal model

From auxiliary measurements and/or MC simulations, one gets the prior for the background parameter b. Most often, the best description of such prior information is provided by just two numbers, like the expected background E[b] and its "uncertainty" expressed by the square root of the bkg variance V[b]. In the following, we will describe the prior bkg information by means of a Gamma density with known shape α and rate β parameters (both are non-negative real values): Shape and rate parameters are typically fixed by requiring and , or by imposing the peak value and the uncertainty (the mode is zero when the shape parameter is not larger than one; when this happens the uncertainty alone is not sufficient to uniquely identify the Gamma density and another quantity, e.g. the skewness or the 90% upper bound, shall be used to set one further constraint). When some actual sampling distribution of b exists, it can be fitted with a Gamma density to find the best values of its two parameters.

Note that often people choose to model the bkg prior with a truncated Gaussian distribution. As the knowledge of the prior does not need to be extremely precise when the number n of observed counts is large enough (but this is seldom the problem of interest!), in principle one is free to make any reasonable choice for the bkg prior (unless n is very low). However, the choice of a truncated Gaussian has several drawbacks:

• it is not natural, as b can not be negative while the Gaussian is defined on the whole real line. When looking for the prior distribution which best describes some parameter, the first thing to be considered is the domain of that parameter. One chooses a function defined on that domain. Truncating a Gaussian does the job in a rather unnatural way. Gamma or log-normal densities are the natural choices, but the analytic solution is only available with a Gamma prior
• the σ parameter of the Gaussian is not the standard deviation any more, hence it does not represent our "uncertainty". Similarly, the peak is no more also the expected value
• the truncated Gaussian gives a finite probability to b = 0, which is also unnatural
• the posterior can only be found after numerical integration. This is CPU intensive and error prone
While a log-normal distribution can model the background as naturally as a Gamma density, the analytic solution is available only with the latter. In practice, the shapes of the Gamma and log-normal densities which best model the same data differ in a negligible way, and the resulting changes of the posterior are unnoticeable, even when n is very small or zero. This is also true when using a Gaussian whose peak is far enough (in units of σ) from the origin, in which case the three densities make practically the same job, from the point of view of the posterior. However, it is only for the Gamma prior that the analytic solution is available, hence we only consider this family of probability density functions here.

As we are interested into the marginal posterior for s, we can integrate over b the left- and right-hand terms of the Bayes' theorem, once the prior for b is specified. This reduces the problem to a 1-dimensional inference problem in which the Poisson model is replaced by the marginal model where the polynomial as explained in reference .

## Signal prior

If the result of some other measurement is available, the corresponding marginal posterior for s should be used as prior for the current experiment. Similarly, if some prior information is available about the signal, it is best to encode it into a Gamma density. When a Gamma prior is used for s, the corresponding marginal posterior is a linear combination of Gamma distributions, see .

Most often, one wants to find an "objective" Bayesian solution for the signal, which relies only on the assumed model and no other prior information. This solution is obtained when using the reference prior π(s) for the signal computed in  and available in the Bayesian Analysis Toolkit (see BAT/examples/advanced/referencecounting/ after installing version 0.93 or later). The corresponding marginal posterior is the reference posterior for s. The reference prior π(s) is improper and tends to become flatter and flatter for better prior knowledge about the background, but the reference posterior is proper in all cases.

Often the limiting situation of certain prior background knowledge provides a good approximation to the actual state of knowledge of the experimenter. When the bkg prior is a delta function centered on b0 (which obviously corresponds to E[b] in this case), the reference prior is and the corresponding marginal posterior for the signal is very simple The limiting reference prior π0(s) tends to the flat prior for increasing b0, and always provides a better approximation to the full reference prior π(s) for any bkg shape and rate parameters. Hence it is recommended as default prior in place of the uniform distribution (which is mathematically ill-defined and subject to lots of criticism). More precisely, it is recommended to use π0(s) with b0 being set to the "best" prior value for the background (typically the prior expectation or mode).

## Example based on BAT

This example is based on BAT/examples/advanced/referencecounting/runReferenceCounting.cxx from v0.9.4.1, which was written to process a single triplet of values (Nobs, bkgE, bkgS) representing the observed counts (an integer number), the expected background (floating point value) and its standard deviation (floating point).

Preliminary steps: install BAT and test default example

Download BAT-0.9.4.1.tar.gz from https://www.mppmu.mpg.de/bat/ and extract the archive into the folder BAT-0.9.4.1, then configure ROOT (I tested it with root_v5.34.36), choose where you want to install BAT (here /usr/local/BAT/BAT_v0.9.4.1), and build it:

cd BAT-0.9.4.1
./configure --prefix=/usr/local/BAT/BAT_v0.9.4.1
make
make install


In order to use BAT, you need to set some environment variables:

# The variables below are suggested by BAT.
# Optional: insert lines below in setupBAT.sh
export PATH="/usr/local/BAT/BAT_v0.9.4.1/bin:$PATH" export LD_LIBRARY_PATH="/usr/local/BAT/BAT_v0.9.4.1/lib:$LD_LIBRARY_PATH"
export CPATH="/usr/local/BAT/BAT_v0.9.4.1/include:$CPATH" export PKG_CONFIG_PATH="/usr/local/BAT/BAT_v0.9.4.1/lib/pkgconfig:$PKG_CONFIG_PATH"


Now you can build and run the default example:

cd examples/advanced/referencecounting/
make
./runReferenceCounting


Get and run the new example

The default example process the triplet (Nobs=20, bkgE=10.0, bkgS=5.0). Now rename it to runReferenceCounting_orig.cxx and replace it with the modified version runReferenceCounting.cxx, which is able to loop over a list of input triples of any size. Two input methods are supported:

• ROOT histograms named "counts" and "background" (encoding both the total background and its uncertainty, as bin error). One example is proveded in input.root

• Plain text file, with one triplet per line. One example is input.txt
The result of the computation is summarized in a numerical form (mean, mode, variance, and several quantiles) and saved into a ROOT tree. Several plots are also generated, for each input triplet.

Once you have replaced the BAT original example with this new one, you have to run make once more and then you only need to take care of preparing input files according to your needs.

Please note that this example does not come from BAT developers and so it's not perfect. However it performs its task well enough to be used. Execute ./runReferenceCounting without arguments to learn about its interface.

## Upper limits and upper bounds

It is useful to recall the difference between an upper limit on the intensity that one expects to detect with a predefined probability, characterizing the sensitivity of the detection technique, and the upper bound on the intensity inferred from the actual measurement .

Upper bounds --- usually called "upper limits" in ATLAS papers, although here we prefer to keep different names, following  --- are a way of summarizing the result of a measurement when there is no significant evidence of an excess with respect to the counts expected from the background alone.

On the other hand, upper limits are connected to the detection technique and can/should be computed before looking at the experimental outcome. In other words, upper limits are connected to the sensitivity of the chosen technique and do not depend on the counts n in the signal region. However, they depend on the estimated background in the control region.

Explicit formulae to compute upper limits, hence to characterize the expected analysis' sensitivity, are provided in appendix A of  for the simple case of two regions (signal and control, or source and off-source regions), in which n (signal region) and k (control region) counts have been measured, and the ratio of background rates in the two regions is known. The counts n in the signal region are not used to find the upper limits, but only to find the upper bounds (as explained in  for this simple "on/off" measurement).

## Multiple background sources

In addition to a single source of background, we can treat also the case of multiple (possibly correlated) background sources, as explained in . If denotes the k-dimensional vector of all background components, whose sum represents the total background the joint posterior of s and  is the product of the joint posterior for the signal and the total background, which shall be treated as illustrated above, with the posterior for the relative proportions of each contribution The second term, describing the relative contributions, is the product of the multinomial likelihood function with the Dirichlet priod, which is the conjugate prior for the multinomial model. This means that the posterior is again a Dirichlet density, with parameters . The vector represents the (integer) counts in the signal channel and in each background channel: , where represents the counts coming from the i-th source of background.

The Dirichlet density is where the multidimensional Beta function has been used to write the normalization constant and the prior concentration parameters should encode all available prior information. The following equalities can be used to find the best prior concentration parameters:  Of course, when only the total number of counts n is known, only the first term in the joint posterior ( p(s,\vec{b}|y,\vec{x})) matters, and the problem reduces to the initial model.

The model with signal and background contributions in each bin is more complicate. Details are explained on .

## References

1. G. Choudalakis and D. Casadei, Plotting the Differences Between Data and Expectation, Eur. Phys. J. Plus 127 (2012) 25 , arXiv:1111.2062 2. D. Casadei, Reference analysis of the signal + background model in counting experiments, JINST 7 (2012) P01012 , arXiv:1108.4270 3. D. Casadei, Reference analysis of the signal + background model in counting experiments II. Approximate reference prior , 2014 JINST9 T10006 , arXiv:1407.5893 4. V.L. Kashyap et al., On Computing Upper Limits to Source Intensities, ApJ 719 (2010) 900 5. D. Casadei, Objective Bayesian analysis of "on/off" measurements, ApJ 798 (2015) 5 , arXiv:1407.6610 6. D. Casadei et al., Objective Bayesian analysis of counting experiments with correlated sources of background, doi:10.1080/02664763.2017.1289367 , arXiv:1504.02566 -- DiegoCasadei - 29 Aug 2014

Topic attachments
I Attachment History Action Size Date Who Comment cxx runReferenceCounting.cxx r1 manage 9.1 K 2016-11-24 - 11:37 DiegoCasadei BAT analysis
Topic revision: r4 - 2019-10-10 - DiegoCasadei Log In

Webs

Welcome Guest  Cern Search TWiki Search Google Search Main All webs Copyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback