- 1. Calculating a significance
- 2. Claiming or failing to claim discovery
- 3. Constructing an interval
- 4. Summary of quantities to report
- 5. Sensitivity of a search procedure
- References

Search procedures typically consist of several steps of varying complexity.
Assuming that one has a specific physics signal in mind (as opposed to a
non-specific "fishing expedition"), the first step is to model the signal of
interest and identify a data selection method that will produce a potentially
signal-rich dataset. The magnitude and possibly also the shape of the
background content of that dataset must then be studied. Next, one must
calculate the statistical significance of any effect observed, decide whether
or not to claim discovery, and provide additional information to support and
further characterize the claims made. In this set of recommendations we
address the statistical aspects of the significance calculation, discovery claim,
and characterization of the claim.
Many of the methods described here are implemented in the *combine* package (described in detail here: SWGuideHiggsAnalysisCombinedLimit ). While the tool provides reasonable defaults for most cases, analysts should still have a reasonable understanding of the implemented methods, to choose the options most fitting of their analysis.

Another useful guideline is that if one hypothesis is simple (involves no unknown parameters) and the other composite (does involve unknown parameters), then the null hypothesis should be the simple one, since this simplifies the significance calculation.

Standard statistical terminology refers to the error of rejecting a true null hypothesis as a "Type-I" error, whereas the error of accepting a false null hypothesis is "Type-II". The probability of a Type-I error is also known as the significance level, or discovery threshold of the corresponding hypothesis test, and is represented by the greek letter α.

Q ≡ max_{H0} L(θ) ⁄ max_{H0+H1} L(θ), |
(1) |

The likelihood ratio Q, or a one-to-one function of it, is usually considered a good choice of test statistic. This is partly a consequence of the Neyman-Pearson lemma, which states that when both the null and alternative hypotheses are simple, then a test based on the likelihood ratio is optimal in the sense that it maximizes the probability of accepting the alternative hypothesis when it is true. Unfortunately this property does not generalize to the case where the null and/or alternative hypothesis is composite. Furthermore, in the latter case it is not even possible to give a fully general recipe for constructing optimal test statistics. For example, simple testing problems are known where optimality requires that the maximizations in equation (1) be replaced by integrations (see Ref. [1]). Nevertheless, since a general recipe is not available, the likelihood ratio is recommended as a starting point.

In these recommendations we will follow the standard convention of representing test statistics with upper-case letters when we wish to view them as random variables, and with lower-case letters when we wish to refer to their observed values.

p ≡ Pr( T ≥ t_{0} | H_{0} ). |

It can be shown that the supremum and confidence interval p-value constructions are both conservative: for a given value of α, the probability for the p-value to be smaller than α is itself smaller than α. In other words, the true probability for falsely claiming a discovery is never larger than stated.

Once a prior is given, there are two main methods for constructing a p-value. The first one is known as the prior-predictive method and consists in first calculating the p-value as if the nuisance parameters were known, and then averaging the result over the nuisance parameter prior. This method will not work if the prior is improper, i.e. if its integral diverges. In that case one may be able to use the data to calculate a proper posterior for the relevant nuisance parameter, and the p-value can then be averaged over the posterior. This is known as the posterior-predictive method.

By construction, the prior-predictive and posterior-predictive p-values are both smaller than the supremum p-value and therefore less conservative. However there is no guarantee that the predictive p-values are not somewhat liberal: there is a risk of overstating the true significance of an observation.

It can be shown that prior- and posterior-predictive p-values are in fact tail probabilities of the corresponding prior- and posterior-predictive distributions. In contrast, supremum and bootstrap p-values cannot be viewed as tail probabilities. This has implications for some numerical computations (see section 3.1).

The parametrization used to model the QCD dijet mass spectrum in the above example is to a large extent arbitrary. The only constraint it is subjected to is that it should be able to reproduce some Monte Carlo and pQCD calculations. The hope is that it will then be general enough to fit the true QCD spectrum, but there is of course no guarantee that this is the case. Thus, in principle one should introduce a systematic uncertainty for the choice of parametrization. This is however very difficult to do in a satisfactory way and is usually ignored.

Some of the methods for handling nuisance parameters discussed here are examined in greater detail in Ref. [2].

p = ½ [1 - erf(z/√2)] |

When interpreting what appears to be a discovery observation, it is important to keep in mind the distinction between scientific significance and statistical significance. Thus for example:

- Any discrepancy with the null hypothesis, no matter how insignificant from a scientific point of view, can yield an arbitrarily large statistical significance if the sample size is large enough. In other words, small systematic biases in the modeling of a null hypothesis will certainly be detected given sufficient data.
- If one tests a large enough number of hypotheses, or if one repeats a given hypothesis test on a large enough number of datasets, then one is bound to find a statistically significant effect even if there is no underlying scientific effect. This is the so-called "look-elsewhere effect", which requires that the p-value be multiplied by an appropriate "trials factor" before deciding whether a discovery has been made. By "appropriate" we mean that the trials factor should only correct for the look-elsewhere effect within a given search. So if relevant we include the effective number of histogram bins; the number of channels used; the number of histograms looked at; etc. However we do not include other searches in CMS, etc. This issue is discussed in more detail on a separate page: LookElsewhereEffect

Strictly speaking, one cannot claim discovery on the sole basis of a discrepant observation. What is needed is an alternative hypothesis that explains the data better than the null hypothesis. To quantify how much better the alternative hypothesis explains the data than the null hypothesis, one can report the corresponding likelihood ratio, or the Bayes factor. The likelihood ratio was defined in eq.(1) above; the Bayes factor is also a ratio, but it involves integrations over proper priors instead of maximizations:

B_{01} ≡ ∫_{H0} L(θ) π(θ|H_{0}) dθ ⁄ ∫_{H1} L(θ) π(θ|H_{1}) dθ |
(2) |

It sometimes happens that one wishes to test two hypotheses, but that the sample size and/or the measurement resolution are insufficient to make a strong statement with regard to either hypothesis. Nevertheless, one would like to make a statement at some level of significance, lower than the usual 3- or 5-sigma level. The purpose of this would not be to make a firm decision regarding the observation of a new effect, but rather to quantify what the data can and do say about this effect. How should one choose the level of significance α in this case? Our recommendation is to set the significance level to that value for which the power of the test (the probability of accepting the alternative hypothesis when it is true) is 50%. Requiring a higher significance level, i.e. a lower Type-I error rate, seems undesirable since it would lead to a better than even chance of a Type-II error. Of course if the Type-I error rate resulting from this rule is larger than, say, 20%, then one ought to conclude that the data simply don't have much to say about the hypothesis of interest.

β(δ) ≡ Pr[ p(T) ≤ p(t_{0}) | θ=θ_{0}+δ ], |
(3) |

β(δ) = Pr[ T ≥ t_{0} | θ=θ_{0}+δ ]. |

If for a given δ the probability β(δ) is high,
then one can say that the data provide strong evidence that the true value of
θ is less than θ_{0}+δ. Thus, a plot of β(δ) versus δ
provides a concise way of illustrating that evidence. Often however, one will
prefer to summarize that curve by a single number, say the δ value δ_{up}
such that

β(δ=δ_{up}) = 0.95. |
(4) |

Note that if the observed p-value p(t_{0}) is larger than 0.95, there will be no
solution to equation (4) since β(δ=0)=p(t_{0}) and β(δ) increases
with δ. In this case all values of θ>θ_{0} are excluded at the 95%
confidence level. If the null hypothesis is true, and the p-value is uniformly distributed,
there is a 5% chance that the observed dataset will yield evidence as strong as this in
favor of H_{0}.

It is important to keep in mind that the true purpose of calculating δ_{up}
is to quantify the strength of the evidence in favor of H_{0} obtained from the
test just performed. It would therefore be misguided:

- to reoptimize the analysis after failing to claim discovery, in order to obtain the best possible (i.e. lowest possible) upper limit; or
- to choose a method for calculating upper limits that is different from equation (4).

In that case that the test-statistic used is a likelihood, like equation (1), and both the null- and signal-hypthesis are simple, the probabilities don't have to be computed by pseudo-experiments. Instead Wilks' Theorem allows for a direct mapping of ratios of the likelihood to the probability β(δ), significantly simplifying computations. This method is, for example, implemented in the tool-set used by the SMP group to construct confidence intervals on anomalous couplings (AtgcRooStats).

EXO/B2G/SUSY often search in kinematic regions where the predictions for H_{0} are far from perfect. In these cases, the CL_{s} construction is used instead, that takes into account also ... probabilities here.
See Ref. [4]

0.16 ≤ β(θ-θ_{0}) ≤ 0.84. |
(5) |

Finally, for nested hypothesis testing of a continuous parameter:

- when no discovery is claimed, report the 95% confidence upper limit (4);
- when a discovery is claimed, report the 68% central confidence interval (5).

Pr[ p(T) ≤ α | θ ] ≥ 0.95, | (6) |

- if the true value of θ is in the set, there is a 95% probability of making a discovery;
- if no discovery can be claimed, it will be possible to exclude at least the entire sensitivity set with 95% confidence. This property follows immediately from the discussion in section 3.1.

For the case of testing H_{0}: θ=θ_{0} versus
H_{1}: θ>θ_{0}, the sensitivity set has the
simple form of a one-sided interval from some θ_{low} up to
infinity.

[2] R. D. Cousins, J. T. Linnemann, and J. Tucker, "Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process," arXiv:physics/0702156v4 (physics.data-an) 20 Nov 2008.

[3] D. G. Mayo and D. R. Cox, "Frequentist statistics as a theory of inductive inference," arXiv:math/0610846v1 (math.ST) 27 Oct 2006.

[4] A.L. Read, "Presentation of search results: the CLs technique", J. Phys. G: Nucl. Part. Phys. 28, 2693 (2002).

[5] G. Punzi, "Sensitivity of searches for new signals and its optimization," http://www.slac.stanford.edu/econf/C030908/papers/MODT002.pdf.

-- MatthiasMozer - 2017-05-12

Topic revision: r3 - 2017-05-14 - MatthiasMozer

Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback

or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback