Recommended Procedure for Searching for New Physics at CMS
DRAFT v1.0, 20 May 2009
Search procedures typically consist of several steps of varying complexity.
Assuming that one has a specific physics signal in mind (as opposed to a
non-specific "fishing expedition"), the first step is to model the signal of
interest and identify a data selection method that will produce a potentially
signal-rich dataset. The magnitude and possibly also the shape of the
background content of that dataset must then be studied. Next, one must
calculate the statistical significance of any effect observed, decide whether
or not to claim discovery, and provide additional information to support and
further characterize the claims made. In this set of recommendations we
address the statistical aspects of the significance calculation, discovery claim,
and characterization of the claim.
Many of the methods described here are implemented in the combine package (described in detail here: SWGuideHiggsAnalysisCombinedLimit ). While the tool provides reasonable defaults for most cases, analysts should still have a reasonable understanding of the implemented methods, to choose the options most fitting of their analysis.
1. Calculating a significance
The numerical value of a significance depends on a number of choices that the user
must make before looking at the data. Obviously the first of these choices
is that of a hypothesis to test, also called null hypothesis. Next, and most
importantly, the user must construct a test statistic, whose purpose is to
quantify the discrepancy between the data and the null hypothesis in some
interesting direction. The hypothesis pointed to by the interesting direction
is referred to as the "alternative hypothesis". Finally, if the problem involves
nuisance parameters, i.e. parameters that one is not interested in but that are
needed in order to make meaningful inferences, the significance will also
depend on how these are handled.
1.1 Choice of null hypothesis
When testing one hypothesis against another, a useful guideline is to choose
as null hypothesis the one for which incorrect rejection is a worse error than
incorrect acceptance. The reason for this guideline is that in the standard
test procedure, one only has full control of the probability of incorrect
rejection, and it is by making that probability small that one can make the
test stringent. In HEP practice, this means that the null hypothesis will be the
"background-only" one, or the "standard model" one in searches for new physics.
Another useful guideline is that if one hypothesis is simple (involves no
unknown parameters) and the other composite (does involve unknown parameters),
then the null hypothesis should be the simple one, since this simplifies the
significance calculation.
Standard statistical terminology refers to the error of rejecting a true null
hypothesis as a "Type-I" error, whereas the error of accepting a false null
hypothesis is "Type-II". The probability of a Type-I error is also known as the
significance level, or discovery threshold of the corresponding hypothesis test,
and is represented by the greek letter α.
Back to contents
1.2 Choice of test statistic
In a simple example, common in HEP, the null hypothesis posits that the true
cross section of a new physics process equals zero. In this case a reasonable
test statistic is the event count in a channel expected to be signal-rich. In
a more complex situation, the signal could manifest itself as a bump on top of
a smooth background spectrum; here one would typically do a maximum-likelihood
fit of the spectrum with and without a signal component, and take the ratio Q
of the resulting likelihood values as test statistic. This second example
illustrates the general definition of the likelihood ratio test statistic:
Q ≡ maxH0 L(θ) ⁄ maxH0+H1 L(θ), |
|
(1) |
where L(θ) is the likelihood function and depends on one or more parameters
θ. The notations max
H0 and max
H0+H1 indicate that in the numerator of Q, the
likelihood is maximized over all θ values allowed under the null hypothesis,
whereas in the denominator of Q, the likelihood is maximized over all θ values
allowed under either the null or alternative hypothesis. Thus, Q is a number
between 0 and 1. In the above example, θ would be the vector of parameters
needed to model the background and signal components of the fit to the data.
Under the null hypothesis one component of θ, namely the signal magnitude,
is zero; under the alternative hypothesis that component is constrained to be
positive.
This test statistic is used as default in the combine tool and should be suitable
for the majority of analyses.
The likelihood ratio Q, or a one-to-one function of it, is usually considered a
good choice of test statistic. This is partly a consequence of the
Neyman-Pearson lemma, which states that when both the null and alternative
hypotheses are simple, then a test based on the likelihood ratio is optimal in the
sense that it maximizes the probability of accepting the alternative hypothesis
when it is true. Unfortunately this property does not generalize to the case where
the null and/or alternative hypothesis is composite. Furthermore, in the latter
case it is not even possible to give a fully general recipe for constructing optimal
test statistics. For example, simple testing problems are known where optimality
requires that the maximizations in equation (
1) be replaced by
integrations (see
Ref. [1]). Nevertheless, since a general
recipe is not available, the likelihood ratio is recommended as a starting point.
In these recommendations we will follow the standard convention of representing
test statistics with upper-case letters when we wish to view them as random variables,
and with lower-case letters when we wish to refer to their observed values.
Back to contents
1.3 Calculating the p-value
In general, observed values of test statistics cannot be directly compared across testing
problems since the latter may differ by their measurement resolution, by the number of fit
parameters, by the dimensionality of the data, etc. It is therefore necessary to calibrate
the evidence provided by the observed value t
0 of a given test statistic T.
One way to do this is to calculate the so-called p-value of the test; this is
the probability of observing t
0 or a "more extreme" value than
t
0 under the null hypothesis, where "more extreme" means
"more in the direction of the alternative hypothesis". If we suppose that large
values of T point to the alternative, the p-value is defined by:
As this equation implies, calculation of a p-value requires knowledge of the
distribution of T under H
0. Only in very simple cases can this distribution be
written down explicitly, in analytical form. Much more common is the situation
where the distribution of T must be estimated by generating toy Monte Carlo
experiments, and p is then approximated by the appropriate tail probability of
this distribution. The main difficulty in this calculation is the handling of
parameters that are not fully specified by H
0. Such parameters are known as
nuisance parameters or systematic uncertainties. In the simple example of
testing for the presence of a new physics signal by counting events, the
expected background level may not be exactly known; the question is then, how
does one take the uncertainty on the expected background into account when
calculating p? How does one generate the toy experiments? To answer these
questions in general it is helpful to classify the nuisance parameters in the
problem according to the type of information that is available about them.
Back to contents
1.3.1 Nuisance parameters constrained by auxiliary measurements
We first consider the case where the nuisance parameters can be measured in
an independent control sample or from sidebands in a fit to the main observation.
We therefore assume here that there is a likelihood function describing the information
about the nuisance parameters contained in the auxiliary data. Two general
approaches are possible:
1.3.1.1 Bootstrap methods
The idea here is to use an estimate of the nuisance parameter(s), for example
a maximum-likelihood estimate, to calculate the p-value or generate the toy
experiments. Since the p-value is by definition a probability calculated
under the null hypothesis, the estimate should also be calculated under the
null hypothesis. Suppose for example that we are testing for the presence of
a Gaussian signal peak on top of a smooth, polynomial background spectrum
whose coefficients are unknown and therefore nuisance parameters. One way to
determine the polynomial coefficients is by a fit to the data. However, this
fit should not be restricted to the "sidebands", but it should include the
signal region as well, since under the null hypothesis the signal region only
contains background. In fact, if one were to exclude the signal region from
the estimation of the background parameters, the resulting p-value would
almost certainly be too "liberal", i.e. exaggerate the signal significance.
Unfortunately, including the signal region in the background estimation has
the corresponding disadvantage, that it makes the resulting p-value too
conservative. Although there are ways to solve this problem (e.g. double
bootstrapping), they are generally too CPU-intensive to be of practical use
in most HEP problems.
1.3.1.2 Supremum methods
This is essentially a "worst-case analysis": since one does not know the true
value of the nuisance parameters, one just quotes the maximum of the p-value
over the entire space of nuisance parameters. This maximum is called the
supremum p-value, notation p
sup. It may of course happen that p
sup=1, in
which case the method is useless. Often however, the data contain some
information about the nuisance parameters, allowing one to construct a
γ-level confidence region R
γ for these parameters. The supremum
method can then be refined as follows. Instead of maximizing the p-value over
the entire nuisance parameter space, maximize it only over R
γ, and add
1-γ to the result to correct for the fact that the confidence level of
R
γ is γ and not 100%. The resulting p-value is called a confidence
interval p-value. For this method to work properly, 1-γ should be chosen
much smaller than the discovery threshold α. In addition, to avoid
biasing the p-value, the values of γ and α, and the form of R
γ
(i.e. whether it is an upper limit, a central interval, a Feldman-Cousins interval...), must
be chosen before looking at the data. It often pays to choose the form of
R
γ carefully, since it affects the final value of the significance.
Unfortunately this is not a trivial task when several nuisance parameters are present.
It can be shown that the supremum and confidence interval p-value
constructions are both conservative: for a given value of α, the
probability for the p-value to be smaller than α is itself smaller than
α. In other words, the true probability for falsely claiming a discovery
is never larger than stated.
Back to contents
1.3.2 Nuisance parameters constrained by Bayesian priors
For some nuisance parameters there are no real data available, only Monte Carlo
calculations or theoretical estimates. Examples include the parametrization of
a background shape, the modeling of parton showers, the amount of soft gluon
radiation, etc. In such cases one may be able to construct a distribution
representing one's degree of belief in various values of the nuisance
parameters. A typical procedure is to assume a Gaussian shape for this
distribution, with mean and width set equal to a reasonable central value and
uncertainty, respectively. If a distribution with heavier tails than the
Gaussian is deemed preferable, one would try a distribution from the Student
family. For parameters constrained to be positive, the Gamma or lognormal
family would be appropriate. For efficiency-type parameters, the beta
distribution is a good choice. It may also happen that one hardly has any
information about a given nuisance parameter. In this case one could choose
a so-called "formal" prior: for example a flat prior in a carefully thought-out
parametrization, or Jeffreys' prior.
Once a prior is given, there are two main methods for constructing a p-value.
The first one is known as the prior-predictive method and consists in first
calculating the p-value as if the nuisance parameters were known, and then
averaging the result over the nuisance parameter prior. This method will not
work if the prior is improper, i.e. if its integral diverges. In that case
one may be able to use the data to calculate a proper posterior for the relevant
nuisance parameter, and the p-value can then be averaged over the posterior.
This is known as the posterior-predictive method.
By construction, the prior-predictive and posterior-predictive p-values are both
smaller than the supremum p-value and therefore less conservative. However
there is no guarantee that the predictive p-values are not somewhat liberal:
there is a risk of overstating the true significance of an observation.
It can be shown that prior- and posterior-predictive p-values are in fact tail
probabilities of the corresponding prior- and posterior-predictive distributions.
In contrast, supremum and bootstrap p-values cannot be viewed as tail
probabilities. This has implications for some numerical computations
(see
section 3.1).
Back to contents
1.3.3 Hybrid methods
It is not always possible to handle all the nuisance parameters in a given
problem by the same method. Consider for example a search for new particles
decaying into dijets. The first step in this analysis is to model the QCD
production of dijets, more specifically the dijet mass spectrum. The latter
must be represented in some parametric form that is general enough to be able
to reproduce predictions from various Monte Carlo programs (e.g. Pythia, Herwig)
as well as NLO pQCD calculations. Unfortunately there is usually no external
(Bayesian or frequentist) information available about the true values of the
parameters of this model for the QCD spectrum. One possible solution is to
construct a formal prior for these parameters, and then adopt a prior- or
posterior-predictive approach depending on whether or not the formal prior is
proper. This approach is quite complex to implement however. A much easier
solution is based on the bootstrap: simply use the parameter values obtained
from a fit to the data, assuming that there are no new particles. Later in
the analysis it will be necessary to take into account uncertainties on the
jet energy resolution and QCD radiation. Usually these uncertainties are of
a Bayesian nature, and are then incorporated via a prior-predictive method.
The overall approach to the dijet search is therefore a "hybrid" of
frequentist bootstrap and Bayesian prior-predictive methods. In general it
is preferable to stick to one class of methods, frequentist or Bayesian, but
this is not always possible.
The parametrization used to model the QCD dijet mass spectrum in the above
example is to a large extent arbitrary. The only constraint it is subjected to
is that it should be able to reproduce some Monte Carlo and pQCD calculations.
The hope is that it will then be general enough to fit the true QCD spectrum,
but there is of course no guarantee that this is the case. Thus, in principle
one should introduce a systematic uncertainty for the choice of parametrization.
This is however very difficult to do in a satisfactory way and is usually
ignored.
Some of the methods for handling nuisance parameters discussed here are
examined in greater detail in
Ref. [2].
Back to contents
1.4 Converting the p-value into a z-value (a number of standard deviations)
Very small p-values, such as those leading to discovery claims, have little
intuitive value. It is therefore traditional to convert p-values into z-values,
where z is the number of standard deviations beyond which the integral of the
tail on one side of a Gaussian distribution equals p. The relation between p and
z can be expressed with the help of an error function:
This is purely a convention, which is applied regardless of whether the underlying
test is "one-sided" or "two-sided". Thus we say for example that a p-value of 2.87E-7
corresponds to 5 sigma's, and one of 1.35E-3 to 3 sigma's.
Back to contents
2. Claiming or failing to claim discovery
The standard for discovery in high energy physics is a z-value of at least 5.
Nevertheless, observations with a z-value less than 5, but larger than 3, are
often considered interesting enough to merit their own designation of "evidence".
In the remainder of these recommendations we will only consider the
discovery/non-discovery alternative, although the same comments apply to evidence
situations.
When interpreting what appears to be a discovery observation, it is important
to keep in mind the distinction between scientific significance and statistical
significance. Thus for example:
- Any discrepancy with the null hypothesis, no matter how insignificant from a scientific point of view, can yield an arbitrarily large statistical significance if the sample size is large enough. In other words, small systematic biases in the modeling of a null hypothesis will certainly be detected given sufficient data.
- If one tests a large enough number of hypotheses, or if one repeats a given hypothesis test on a large enough number of datasets, then one is bound to find a statistically significant effect even if there is no underlying scientific effect. This is the so-called "look-elsewhere effect", which requires that the p-value be multiplied by an appropriate "trials factor" before deciding whether a discovery has been made. By "appropriate" we mean that the trials factor should only correct for the look-elsewhere effect within a given search. So if relevant we include the effective number of histogram bins; the number of channels used; the number of histograms looked at; etc. However we do not include other searches in CMS, etc. This issue is discussed in more detail on a separate page: LookElsewhereEffect
Strictly speaking, one cannot claim discovery on the sole basis of a discrepant
observation. What is needed is an alternative hypothesis that explains the data
better than the null hypothesis. To quantify how much better the alternative
hypothesis explains the data than the null hypothesis, one can report the
corresponding likelihood ratio, or the Bayes factor. The likelihood ratio was
defined in eq.(
1) above; the Bayes factor is also a ratio, but it
involves integrations over proper priors instead of maximizations:
B01 ≡ ∫H0 L(θ) π(θ|H0) dθ ⁄ ∫H1 L(θ) π(θ|H1) dθ |
|
(2) |
It sometimes happens that one wishes to test two hypotheses, but that the sample
size and/or the measurement resolution are insufficient to make a strong
statement with regard to either hypothesis. Nevertheless, one would like to
make a statement at some level of significance, lower than the usual 3- or
5-sigma level. The purpose of this would not be to make a firm decision
regarding the observation of a new effect, but rather to quantify what the data
can and do say about this effect. How should one choose the level of
significance α in this case? Our recommendation is to set the significance level
to that value for which the power of the test (the probability of accepting the
alternative hypothesis when it is true) is 50%. Requiring a higher significance
level, i.e. a lower Type-I error rate, seems undesirable since it would lead to
a better than even chance of a Type-II error. Of course if the Type-I error
rate resulting from this rule is larger than, say, 20%, then one ought to
conclude that the data simply don't have much to say about the hypothesis
of interest.
Back to contents
3. Constructing an interval
For nested hypothesis testing regarding a continuous parameter, it is useful to
supplement the result of the test with a further characterization of the
parameter space. In the discussion that follows we assume that the test is of
the form H
0: θ=θ
0 versus
H
1: θ>θ
0, where θ is the parameter
of interest and θ
0 is a constant. Simple adjustments can be
made when H
1 is of the form θ<θ
0 or
θ≠θ
0 instead. We separately consider the two
possible results of a significance test.
3.1 When failing to claim discovery
Failing to claim discovery does not imply that the null hypothesis is true, only
that the experiment failed to detect a large enough deviation δ from it.
However, one can argue that the higher the probability is of detecting a given
deviation from the null, the stronger the evidence is in favor of H
0 when the
test fails to reject it (see
Ref. [3]). Thus, one way to quantify
the strength of evidence contained in the data in favor of H
0 is to
examine the probability of obtaining a smaller p-value than the one observed if
the true value of θ is θ
0+δ, as a function of
δ:
β(δ) ≡ Pr[ p(T) ≤ p(t0) | θ=θ0+δ ], |
|
(3) |
where p(T) is the p-value evaluated at the test statistic T, and t
0 is the
observed value of T. If the p-value is a genuine tail probability (as is the case when
there are no nuisance parameters in the problem or when the p-value is of the prior-
or posterior-predictive kind), then p(T) is a one-to-one transformation of T and the
definition of β(δ) can be simplified to:
β(δ) = Pr[ T ≥ t0 | θ=θ0+δ ]. |
For consistency, the probability "Pr" in the previous two equations should
be calculated in the same way as for the p-value p(T) itself. In other words, the
nuisance parameters should be handled in the same way in both calculations, and
in the limit δ→0 one should obtain β(δ)→p(t
0).
If for a given δ the probability β(δ) is high,
then one can say that the data provide strong evidence that the true value of
θ is less than θ
0+δ. Thus, a plot of β(δ) versus δ
provides a concise way of illustrating that evidence. Often however, one will
prefer to summarize that curve by a single number, say the δ value δ
up
such that
It is conventional to interpret θ
0+δ
up as an upper limit on the true
value of θ and state that larger values of θ are excluded by the data at
the 95% confidence level. Whether this confidence level is truly frequentist
depends of course on how the p-value was calculated. For example, if a Bayesian
method was used to eliminate nuisance parameters, then the confidence level
will not be fully frequentist.
Note that if the observed p-value p(t
0) is larger than 0.95, there will be no
solution to equation (
4) since β(δ=0)=p(t
0) and β(δ) increases
with δ. In this case all values of θ>θ
0 are excluded at the 95%
confidence level. If the null hypothesis is true, and the p-value is uniformly distributed,
there is a 5% chance that the observed dataset will yield evidence as strong as this in
favor of H
0.
It is important to keep in mind that the true purpose of calculating δ
up
is to quantify the strength of the evidence in favor of H
0 obtained from the
test just performed. It would therefore be misguided:
- to reoptimize the analysis after failing to claim discovery, in order to obtain the best possible (i.e. lowest possible) upper limit; or
- to choose a method for calculating upper limits that is different from equation (4).
In both cases one would end up characterizing a different test than the one
performed.
In that case that the test-statistic used is a likelihood, like equation (1), and both the null- and signal-hypthesis are simple, the probabilities don't have to be computed by pseudo-experiments. Instead Wilks' Theorem allows for a direct mapping of ratios of the likelihood to the probability β(δ), significantly simplifying computations. This method is, for example, implemented in the tool-set used by the SMP group to construct confidence intervals on anomalous couplings (AtgcRooStats).
EXO/B2G/SUSY often search in kinematic regions where the predictions for H0 are far from perfect. In these cases, the CLs construction is used instead, that takes into account also ... probabilities here.
See Ref. [4]
Back to contents
3.2 When claiming discovery
When a discovery is claimed, it is important to quantify the magnitude of the
physical effect that caused the discrepancy with the null hypothesis. This
can also be done with the β function defined in equation (3). A 68%
confidence interval for example, can be defined as the set of θ values that satisfy:
0.16 ≤ β(θ-θ0) ≤ 0.84. |
|
(5) |
This is a central confidence interval since it excludes values of θ for which the
probability of obtaining a p-value smaller than the one observed is either less than 16%
or greater than (100-16)%.
The same comment that was made in section 3.1 applies here:
after a discovery claim the analysis should not be reoptimized to yield (for example) the
shortest possible 68% confidence interval for θ; one should simply use equation
(5).
Back to contents
4. Summary of quantities to report
Always report the observed p-value. It is also a good idea to report the likelihood ratio Q
or the Bayes factor B01 in favor of the null hypothesis. Both quantities compare
the explanatory power of the null hypothesis to that of the alternative in the case of the data
at hand. When the testing problem is of the form
H0: θ=θ0 versus H1: θ>θ0,
the parameter of interest θ is not uniquely specified by the alternative. In this
case one can report the minimum Bayes factor in favor of H0, which is the
Bayes factor minimized over all values of θ allowed under H1.
Finally, for nested hypothesis testing of a continuous parameter:
- when no discovery is claimed, report the 95% confidence upper limit (4);
- when a discovery is claimed, report the 68% central confidence interval (5).
Back to contents
5. Sensitivity of a search procedure
For nested hypothesis testing of a continuous parameter θ, a good way to
quantify the sensitivity of a search procedure before looking at data is to
compute the set of θ values for which
Pr[ p(T) ≤ α | θ ] ≥ 0.95, |
|
(6) |
where α is the discovery threshold and p(T) is the p-value as before. The
interpretation of this "sensitivity set" of θ values is twofold (see Ref. [5]):
- if the true value of θ is in the set, there is a 95% probability of making a discovery;
- if no discovery can be claimed, it will be possible to exclude at least the entire sensitivity set with 95% confidence. This property follows immediately from the discussion in section 3.1.
For the case of testing H0: θ=θ0 versus
H1: θ>θ0, the sensitivity set has the
simple form of a one-sided interval from some θlow up to
infinity.
Back to contents
References
[1] E. L. Lehmann, "On likelihood ratio tests," arXiv:math/0610835v1 (math.ST) 27 Oct 2006
.
[2] R. D. Cousins, J. T. Linnemann, and J. Tucker, "Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process," arXiv:physics/0702156v4 (physics.data-an) 20 Nov 2008
.
[3] D. G. Mayo and D. R. Cox, "Frequentist statistics as a theory of inductive inference," arXiv:math/0610846v1 (math.ST) 27 Oct 2006
.
[4] A.L. Read, "Presentation of search results: the CLs technique", J. Phys. G: Nucl. Part. Phys. 28, 2693 (2002)
.
[5] G. Punzi, "Sensitivity of searches for new signals and its optimization," http://www.slac.stanford.edu/econf/C030908/papers/MODT002.pdf
.
-- MatthiasMozer - 2017-05-12