Recommended Procedure for Searching for New Physics at CMS

DRAFT v1.0, 20 May 2009

Search procedures typically consist of several steps of varying complexity. Assuming that one has a specific physics signal in mind (as opposed to a non-specific "fishing expedition"), the first step is to model the signal of interest and identify a data selection method that will produce a potentially signal-rich dataset. The magnitude and possibly also the shape of the background content of that dataset must then be studied. Next, one must calculate the statistical significance of any effect observed, decide whether or not to claim discovery, and provide additional information to support and further characterize the claims made. In this set of recommendations we address the statistical aspects of the significance calculation, discovery claim, and characterization of the claim. Many of the methods described here are implemented in the combine package (described in detail here: SWGuideHiggsAnalysisCombinedLimit ). While the tool provides reasonable defaults for most cases, analysts should still have a reasonable understanding of the implemented methods, to choose the options most fitting of their analysis.

1. Calculating a significance

The numerical value of a significance depends on a number of choices that the user must make before looking at the data. Obviously the first of these choices is that of a hypothesis to test, also called null hypothesis. Next, and most importantly, the user must construct a test statistic, whose purpose is to quantify the discrepancy between the data and the null hypothesis in some interesting direction. The hypothesis pointed to by the interesting direction is referred to as the "alternative hypothesis". Finally, if the problem involves nuisance parameters, i.e. parameters that one is not interested in but that are needed in order to make meaningful inferences, the significance will also depend on how these are handled.

1.1 Choice of null hypothesis

When testing one hypothesis against another, a useful guideline is to choose as null hypothesis the one for which incorrect rejection is a worse error than incorrect acceptance. The reason for this guideline is that in the standard test procedure, one only has full control of the probability of incorrect rejection, and it is by making that probability small that one can make the test stringent. In HEP practice, this means that the null hypothesis will be the "background-only" one, or the "standard model" one in searches for new physics.

Another useful guideline is that if one hypothesis is simple (involves no unknown parameters) and the other composite (does involve unknown parameters), then the null hypothesis should be the simple one, since this simplifies the significance calculation.

Standard statistical terminology refers to the error of rejecting a true null hypothesis as a "Type-I" error, whereas the error of accepting a false null hypothesis is "Type-II". The probability of a Type-I error is also known as the significance level, or discovery threshold of the corresponding hypothesis test, and is represented by the greek letter α.

Back to contents

1.2 Choice of test statistic

In a simple example, common in HEP, the null hypothesis posits that the true cross section of a new physics process equals zero. In this case a reasonable test statistic is the event count in a channel expected to be signal-rich. In a more complex situation, the signal could manifest itself as a bump on top of a smooth background spectrum; here one would typically do a maximum-likelihood fit of the spectrum with and without a signal component, and take the ratio Q of the resulting likelihood values as test statistic. This second example illustrates the general definition of the likelihood ratio test statistic:
Q ≡ maxH0 L(θ) ⁄ maxH0+H1 L(θ),   (1)
where L(θ) is the likelihood function and depends on one or more parameters θ. The notations maxH0 and maxH0+H1 indicate that in the numerator of Q, the likelihood is maximized over all θ values allowed under the null hypothesis, whereas in the denominator of Q, the likelihood is maximized over all θ values allowed under either the null or alternative hypothesis. Thus, Q is a number between 0 and 1. In the above example, θ would be the vector of parameters needed to model the background and signal components of the fit to the data. Under the null hypothesis one component of θ, namely the signal magnitude, is zero; under the alternative hypothesis that component is constrained to be positive. This test statistic is used as default in the combine tool and should be suitable for the majority of analyses.

The likelihood ratio Q, or a one-to-one function of it, is usually considered a good choice of test statistic. This is partly a consequence of the Neyman-Pearson lemma, which states that when both the null and alternative hypotheses are simple, then a test based on the likelihood ratio is optimal in the sense that it maximizes the probability of accepting the alternative hypothesis when it is true. Unfortunately this property does not generalize to the case where the null and/or alternative hypothesis is composite. Furthermore, in the latter case it is not even possible to give a fully general recipe for constructing optimal test statistics. For example, simple testing problems are known where optimality requires that the maximizations in equation (1) be replaced by integrations (see Ref. [1]). Nevertheless, since a general recipe is not available, the likelihood ratio is recommended as a starting point.

In these recommendations we will follow the standard convention of representing test statistics with upper-case letters when we wish to view them as random variables, and with lower-case letters when we wish to refer to their observed values.

Back to contents

1.3 Calculating the p-value

In general, observed values of test statistics cannot be directly compared across testing problems since the latter may differ by their measurement resolution, by the number of fit parameters, by the dimensionality of the data, etc. It is therefore necessary to calibrate the evidence provided by the observed value t0 of a given test statistic T. One way to do this is to calculate the so-called p-value of the test; this is the probability of observing t0 or a "more extreme" value than t0 under the null hypothesis, where "more extreme" means "more in the direction of the alternative hypothesis". If we suppose that large values of T point to the alternative, the p-value is defined by:
p ≡ Pr( T ≥ t0 | H0 ).
As this equation implies, calculation of a p-value requires knowledge of the distribution of T under H0. Only in very simple cases can this distribution be written down explicitly, in analytical form. Much more common is the situation where the distribution of T must be estimated by generating toy Monte Carlo experiments, and p is then approximated by the appropriate tail probability of this distribution. The main difficulty in this calculation is the handling of parameters that are not fully specified by H0. Such parameters are known as nuisance parameters or systematic uncertainties. In the simple example of testing for the presence of a new physics signal by counting events, the expected background level may not be exactly known; the question is then, how does one take the uncertainty on the expected background into account when calculating p? How does one generate the toy experiments? To answer these questions in general it is helpful to classify the nuisance parameters in the problem according to the type of information that is available about them.

Back to contents

1.3.1 Nuisance parameters constrained by auxiliary measurements

We first consider the case where the nuisance parameters can be measured in an independent control sample or from sidebands in a fit to the main observation. We therefore assume here that there is a likelihood function describing the information about the nuisance parameters contained in the auxiliary data. Two general approaches are possible:

1.3.1.1 Bootstrap methods
The idea here is to use an estimate of the nuisance parameter(s), for example a maximum-likelihood estimate, to calculate the p-value or generate the toy experiments. Since the p-value is by definition a probability calculated under the null hypothesis, the estimate should also be calculated under the null hypothesis. Suppose for example that we are testing for the presence of a Gaussian signal peak on top of a smooth, polynomial background spectrum whose coefficients are unknown and therefore nuisance parameters. One way to determine the polynomial coefficients is by a fit to the data. However, this fit should not be restricted to the "sidebands", but it should include the signal region as well, since under the null hypothesis the signal region only contains background. In fact, if one were to exclude the signal region from the estimation of the background parameters, the resulting p-value would almost certainly be too "liberal", i.e. exaggerate the signal significance. Unfortunately, including the signal region in the background estimation has the corresponding disadvantage, that it makes the resulting p-value too conservative. Although there are ways to solve this problem (e.g. double bootstrapping), they are generally too CPU-intensive to be of practical use in most HEP problems.

1.3.1.2 Supremum methods
This is essentially a "worst-case analysis": since one does not know the true value of the nuisance parameters, one just quotes the maximum of the p-value over the entire space of nuisance parameters. This maximum is called the supremum p-value, notation psup. It may of course happen that psup=1, in which case the method is useless. Often however, the data contain some information about the nuisance parameters, allowing one to construct a γ-level confidence region Rγ for these parameters. The supremum method can then be refined as follows. Instead of maximizing the p-value over the entire nuisance parameter space, maximize it only over Rγ, and add 1-γ to the result to correct for the fact that the confidence level of Rγ is γ and not 100%. The resulting p-value is called a confidence interval p-value. For this method to work properly, 1-γ should be chosen much smaller than the discovery threshold α. In addition, to avoid biasing the p-value, the values of γ and α, and the form of Rγ (i.e. whether it is an upper limit, a central interval, a Feldman-Cousins interval...), must be chosen before looking at the data. It often pays to choose the form of Rγ carefully, since it affects the final value of the significance. Unfortunately this is not a trivial task when several nuisance parameters are present.

It can be shown that the supremum and confidence interval p-value constructions are both conservative: for a given value of α, the probability for the p-value to be smaller than α is itself smaller than α. In other words, the true probability for falsely claiming a discovery is never larger than stated.

Back to contents

1.3.2 Nuisance parameters constrained by Bayesian priors

For some nuisance parameters there are no real data available, only Monte Carlo calculations or theoretical estimates. Examples include the parametrization of a background shape, the modeling of parton showers, the amount of soft gluon radiation, etc. In such cases one may be able to construct a distribution representing one's degree of belief in various values of the nuisance parameters. A typical procedure is to assume a Gaussian shape for this distribution, with mean and width set equal to a reasonable central value and uncertainty, respectively. If a distribution with heavier tails than the Gaussian is deemed preferable, one would try a distribution from the Student family. For parameters constrained to be positive, the Gamma or lognormal family would be appropriate. For efficiency-type parameters, the beta distribution is a good choice. It may also happen that one hardly has any information about a given nuisance parameter. In this case one could choose a so-called "formal" prior: for example a flat prior in a carefully thought-out parametrization, or Jeffreys' prior.

Once a prior is given, there are two main methods for constructing a p-value. The first one is known as the prior-predictive method and consists in first calculating the p-value as if the nuisance parameters were known, and then averaging the result over the nuisance parameter prior. This method will not work if the prior is improper, i.e. if its integral diverges. In that case one may be able to use the data to calculate a proper posterior for the relevant nuisance parameter, and the p-value can then be averaged over the posterior. This is known as the posterior-predictive method.

By construction, the prior-predictive and posterior-predictive p-values are both smaller than the supremum p-value and therefore less conservative. However there is no guarantee that the predictive p-values are not somewhat liberal: there is a risk of overstating the true significance of an observation.

It can be shown that prior- and posterior-predictive p-values are in fact tail probabilities of the corresponding prior- and posterior-predictive distributions. In contrast, supremum and bootstrap p-values cannot be viewed as tail probabilities. This has implications for some numerical computations (see section 3.1).

Back to contents

1.3.3 Hybrid methods

It is not always possible to handle all the nuisance parameters in a given problem by the same method. Consider for example a search for new particles decaying into dijets. The first step in this analysis is to model the QCD production of dijets, more specifically the dijet mass spectrum. The latter must be represented in some parametric form that is general enough to be able to reproduce predictions from various Monte Carlo programs (e.g. Pythia, Herwig) as well as NLO pQCD calculations. Unfortunately there is usually no external (Bayesian or frequentist) information available about the true values of the parameters of this model for the QCD spectrum. One possible solution is to construct a formal prior for these parameters, and then adopt a prior- or posterior-predictive approach depending on whether or not the formal prior is proper. This approach is quite complex to implement however. A much easier solution is based on the bootstrap: simply use the parameter values obtained from a fit to the data, assuming that there are no new particles. Later in the analysis it will be necessary to take into account uncertainties on the jet energy resolution and QCD radiation. Usually these uncertainties are of a Bayesian nature, and are then incorporated via a prior-predictive method. The overall approach to the dijet search is therefore a "hybrid" of frequentist bootstrap and Bayesian prior-predictive methods. In general it is preferable to stick to one class of methods, frequentist or Bayesian, but this is not always possible.

The parametrization used to model the QCD dijet mass spectrum in the above example is to a large extent arbitrary. The only constraint it is subjected to is that it should be able to reproduce some Monte Carlo and pQCD calculations. The hope is that it will then be general enough to fit the true QCD spectrum, but there is of course no guarantee that this is the case. Thus, in principle one should introduce a systematic uncertainty for the choice of parametrization. This is however very difficult to do in a satisfactory way and is usually ignored.

Some of the methods for handling nuisance parameters discussed here are examined in greater detail in Ref. [2].

Back to contents

1.4 Converting the p-value into a z-value (a number of standard deviations)

Very small p-values, such as those leading to discovery claims, have little intuitive value. It is therefore traditional to convert p-values into z-values, where z is the number of standard deviations beyond which the integral of the tail on one side of a Gaussian distribution equals p. The relation between p and z can be expressed with the help of an error function:
p = ½ [1 - erf(z/√2)]
This is purely a convention, which is applied regardless of whether the underlying test is "one-sided" or "two-sided". Thus we say for example that a p-value of 2.87E-7 corresponds to 5 sigma's, and one of 1.35E-3 to 3 sigma's.

Back to contents

2. Claiming or failing to claim discovery

The standard for discovery in high energy physics is a z-value of at least 5. Nevertheless, observations with a z-value less than 5, but larger than 3, are often considered interesting enough to merit their own designation of "evidence". In the remainder of these recommendations we will only consider the discovery/non-discovery alternative, although the same comments apply to evidence situations.

When interpreting what appears to be a discovery observation, it is important to keep in mind the distinction between scientific significance and statistical significance. Thus for example:

  • Any discrepancy with the null hypothesis, no matter how insignificant from a scientific point of view, can yield an arbitrarily large statistical significance if the sample size is large enough. In other words, small systematic biases in the modeling of a null hypothesis will certainly be detected given sufficient data.
  • If one tests a large enough number of hypotheses, or if one repeats a given hypothesis test on a large enough number of datasets, then one is bound to find a statistically significant effect even if there is no underlying scientific effect. This is the so-called "look-elsewhere effect", which requires that the p-value be multiplied by an appropriate "trials factor" before deciding whether a discovery has been made. By "appropriate" we mean that the trials factor should only correct for the look-elsewhere effect within a given search. So if relevant we include the effective number of histogram bins; the number of channels used; the number of histograms looked at; etc. However we do not include other searches in CMS, etc. This issue is discussed in more detail on a separate page: LookElsewhereEffect

Strictly speaking, one cannot claim discovery on the sole basis of a discrepant observation. What is needed is an alternative hypothesis that explains the data better than the null hypothesis. To quantify how much better the alternative hypothesis explains the data than the null hypothesis, one can report the corresponding likelihood ratio, or the Bayes factor. The likelihood ratio was defined in eq.(1) above; the Bayes factor is also a ratio, but it involves integrations over proper priors instead of maximizations:

B01 ≡ ∫H0 L(θ) π(θ|H0) dθ ⁄ ∫H1 L(θ) π(θ|H1) dθ   (2)

It sometimes happens that one wishes to test two hypotheses, but that the sample size and/or the measurement resolution are insufficient to make a strong statement with regard to either hypothesis. Nevertheless, one would like to make a statement at some level of significance, lower than the usual 3- or 5-sigma level. The purpose of this would not be to make a firm decision regarding the observation of a new effect, but rather to quantify what the data can and do say about this effect. How should one choose the level of significance α in this case? Our recommendation is to set the significance level to that value for which the power of the test (the probability of accepting the alternative hypothesis when it is true) is 50%. Requiring a higher significance level, i.e. a lower Type-I error rate, seems undesirable since it would lead to a better than even chance of a Type-II error. Of course if the Type-I error rate resulting from this rule is larger than, say, 20%, then one ought to conclude that the data simply don't have much to say about the hypothesis of interest.

Back to contents

3. Constructing an interval

For nested hypothesis testing regarding a continuous parameter, it is useful to supplement the result of the test with a further characterization of the parameter space. In the discussion that follows we assume that the test is of the form H0: θ=θ0 versus H1: θ>θ0, where θ is the parameter of interest and θ0 is a constant. Simple adjustments can be made when H1 is of the form θ<θ0 or θ≠θ0 instead. We separately consider the two possible results of a significance test.

3.1 When failing to claim discovery

Failing to claim discovery does not imply that the null hypothesis is true, only that the experiment failed to detect a large enough deviation δ from it. However, one can argue that the higher the probability is of detecting a given deviation from the null, the stronger the evidence is in favor of H0 when the test fails to reject it (see Ref. [3]). Thus, one way to quantify the strength of evidence contained in the data in favor of H0 is to examine the probability of obtaining a smaller p-value than the one observed if the true value of θ is θ0+δ, as a function of δ:
β(δ) ≡ Pr[ p(T) ≤ p(t0) | θ=θ0+δ ],   (3)
where p(T) is the p-value evaluated at the test statistic T, and t0 is the observed value of T. If the p-value is a genuine tail probability (as is the case when there are no nuisance parameters in the problem or when the p-value is of the prior- or posterior-predictive kind), then p(T) is a one-to-one transformation of T and the definition of β(δ) can be simplified to:
β(δ) = Pr[ T ≥ t0 | θ=θ0+δ ].
For consistency, the probability "Pr" in the previous two equations should be calculated in the same way as for the p-value p(T) itself. In other words, the nuisance parameters should be handled in the same way in both calculations, and in the limit δ→0 one should obtain β(δ)→p(t0).

If for a given δ the probability β(δ) is high, then one can say that the data provide strong evidence that the true value of θ is less than θ0+δ. Thus, a plot of β(δ) versus δ provides a concise way of illustrating that evidence. Often however, one will prefer to summarize that curve by a single number, say the δ value δup such that

β(δ=δup) = 0.95.   (4)
It is conventional to interpret θ0up as an upper limit on the true value of θ and state that larger values of θ are excluded by the data at the 95% confidence level. Whether this confidence level is truly frequentist depends of course on how the p-value was calculated. For example, if a Bayesian method was used to eliminate nuisance parameters, then the confidence level will not be fully frequentist.

Note that if the observed p-value p(t0) is larger than 0.95, there will be no solution to equation (4) since β(δ=0)=p(t0) and β(δ) increases with δ. In this case all values of θ>θ0 are excluded at the 95% confidence level. If the null hypothesis is true, and the p-value is uniformly distributed, there is a 5% chance that the observed dataset will yield evidence as strong as this in favor of H0.

It is important to keep in mind that the true purpose of calculating δup is to quantify the strength of the evidence in favor of H0 obtained from the test just performed. It would therefore be misguided:

  1. to reoptimize the analysis after failing to claim discovery, in order to obtain the best possible (i.e. lowest possible) upper limit; or
  2. to choose a method for calculating upper limits that is different from equation (4).
In both cases one would end up characterizing a different test than the one performed.

In that case that the test-statistic used is a likelihood, like equation (1), and both the null- and signal-hypthesis are simple, the probabilities don't have to be computed by pseudo-experiments. Instead Wilks' Theorem allows for a direct mapping of ratios of the likelihood to the probability β(δ), significantly simplifying computations. This method is, for example, implemented in the tool-set used by the SMP group to construct confidence intervals on anomalous couplings (AtgcRooStats).

EXO/B2G/SUSY often search in kinematic regions where the predictions for H0 are far from perfect. In these cases, the CLs construction is used instead, that takes into account also ... probabilities here. See Ref. [4]

Back to contents

3.2 When claiming discovery

When a discovery is claimed, it is important to quantify the magnitude of the physical effect that caused the discrepancy with the null hypothesis. This can also be done with the β function defined in equation (3). A 68% confidence interval for example, can be defined as the set of θ values that satisfy:
0.16 ≤ β(θ-θ0) ≤ 0.84.   (5)
This is a central confidence interval since it excludes values of θ for which the probability of obtaining a p-value smaller than the one observed is either less than 16% or greater than (100-16)%. The same comment that was made in section 3.1 applies here: after a discovery claim the analysis should not be reoptimized to yield (for example) the shortest possible 68% confidence interval for θ; one should simply use equation (5).

Back to contents

4. Summary of quantities to report

Always report the observed p-value. It is also a good idea to report the likelihood ratio Q or the Bayes factor B01 in favor of the null hypothesis. Both quantities compare the explanatory power of the null hypothesis to that of the alternative in the case of the data at hand. When the testing problem is of the form H0: θ=θ0 versus H1: θ>θ0, the parameter of interest θ is not uniquely specified by the alternative. In this case one can report the minimum Bayes factor in favor of H0, which is the Bayes factor minimized over all values of θ allowed under H1.

Finally, for nested hypothesis testing of a continuous parameter:

  • when no discovery is claimed, report the 95% confidence upper limit (4);
  • when a discovery is claimed, report the 68% central confidence interval (5).

Back to contents

5. Sensitivity of a search procedure

For nested hypothesis testing of a continuous parameter θ, a good way to quantify the sensitivity of a search procedure before looking at data is to compute the set of θ values for which
Pr[ p(T) ≤ α | θ ] ≥ 0.95,   (6)
where α is the discovery threshold and p(T) is the p-value as before. The interpretation of this "sensitivity set" of θ values is twofold (see Ref. [5]):
  • if the true value of θ is in the set, there is a 95% probability of making a discovery;
  • if no discovery can be claimed, it will be possible to exclude at least the entire sensitivity set with 95% confidence. This property follows immediately from the discussion in section 3.1.

For the case of testing H0: θ=θ0 versus H1: θ>θ0, the sensitivity set has the simple form of a one-sided interval from some θlow up to infinity.

Back to contents

References

[1] E. L. Lehmann, "On likelihood ratio tests," arXiv:math/0610835v1 (math.ST) 27 Oct 2006.
[2] R. D. Cousins, J. T. Linnemann, and J. Tucker, "Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process," arXiv:physics/0702156v4 (physics.data-an) 20 Nov 2008.
[3] D. G. Mayo and D. R. Cox, "Frequentist statistics as a theory of inductive inference," arXiv:math/0610846v1 (math.ST) 27 Oct 2006.
[4] A.L. Read, "Presentation of search results: the CLs technique", J. Phys. G: Nucl. Part. Phys. 28, 2693 (2002).
[5] G. Punzi, "Sensitivity of searches for new signals and its optimization," http://www.slac.stanford.edu/econf/C030908/papers/MODT002.pdf.

-- MatthiasMozer - 2017-05-12

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2017-05-14 - MatthiasMozer
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback