BumpHunter

This is the twiki page for a ROOT implementation of the BumpHunter algorithm (see http://arxiv.org/pdf/1101.0390v2.pdf)

It is installed with Athena (the ATLAS software)

bhExample.png
Example output of tool. Top is the data vs background plot (in the search region), showing where the BumpHunter found the excess. Middle is the BumpHunter test statistic distribution, with the observed value of the test statistic shown. Bottom if the convergence graph, showing how the BumpHunter p-value (i.e. the global p-value) varies with the pseudo-experiments performed. The global significance of this excess is 2.86sigma.

Quick Start Guide

Setup an athena release (any release that has cmt in it), .e.g...
asetup AthAnalysisBase,2.4.40,here
Get and install the tool:
svn co svn+ssh://svn.cern.ch/reps/atlas-will/will/StatTools/trunk StatTools
cmt find_packages
cmt compile
Open root and use the tool. e.g. if myFile.root contains two histograms, one the background (the mc prediction) and the other the data
> root myFile.root
root [0] TBumpHunter b(background,data)
root [1] b.Run()
Depending on how many events are in your histograms and how many bins there are, things may take some time to run. See below for the some options to deal with this...

Tool options

Change the number of pseudo experiments:
b.SetNPseudoExperiments(1000); (default: at least 1000, no more than 1,000,000 and determined automatically from data significance)
Change the model used for interpretting the bin values of the background distribution
b.SetBinModel(model);
model=0 (default) A poisson distribution where the mean parameter has uncertainty and is distributed according to a gamma distribution. Bin value is the mean value of a gamma distribution (with variance equal to binError^2) that is convoluted with a poisson distribution. [Expert note: When evaluating a bump window wider than one bin width, this model adds the bin errors in quadrature to determine an overall uncertainty of the combined mean (the sum of the bin contents). This means it assumes the bin errors are uncorrelated to each other, which is optimistic. If desirable, I can add a new model to the tool to add the errors explicitly, i.e. treat as fully correlated. Just send me an email]. THIS MODEL CAN BE VERY SLOW!
model=1 A poisson distribution where the mean parameter is the bin value (the bin errors have no impact on result). THIS IS FASTER but gives you bigger significances
model=2 A gaussian distribution with the mean and sigma given by the bin's value and error
model=3 Like Model=0 but with the systematics treated in a correlated way when generating pseudo-data
Change the search range of the tool (smaller regions are quicker to search but naturally lead to larger global p-values (global significance decreases as you increase the search region):
b.SetSearchRegion(low,high);

Allow deficits to be included when hunting:

b.SetTestStatisticType(2);

Multichannel options

To do a multichannel bump hunt, you add additional mc and data plots to the tool like this:
b.AddChannelDistribution(mc,data);
mc and data are histograms for that channel.

When you run the bumphunter now, it will calculate the width of the common window of bumps from different channels (this is where bumps overlap). If at any point the quantity: commonWindowWidth/bumpWidth (for any of the channels) becomes less than a prechosen quantity called the bumpOverlapFactor, then the bumps are considered incompatible and the bumphunter will return a p-value of 1 (i.e. absolutely no bump). You can change the bumpOverlapFactor like this:

b.SetBumpOverlapFactor(..); 

It should be a number between 0 and 1. Zero would mean you are ok with bumps being completely non-overlapping. 0.5 would mean you need at least 50% of all the bumps to overlap with the common window.

The p-value used in multichannel bumphunting is effectively the product of the individual bump p-values.

How it works

The tool implements the following algorithm: 1. Devise a list of windows of various sizes and locations within the histogram range. This is done by starting from the left of the search region (use SetSearchLowEdge to define) and starting with a window size of twice the average bin width (or manually specify with SetMinWindowSize). The subsequent windows are determined by shifting to the right by half the window size until we reach the high edge of the search region (use SetSearchHighEdge to define). Then we return to the left of the search region and increase the window size by the average bin width and repeat this process until the window size becomes bigger than the maximum window size of half the search region (or manually specify with SetMaxWindowSize). Basically, come up with a load of different sized and located windows. This step is done in the tool with:
b.EvaluateSearchPattern();
2. Take the different sized windows and locations and in each one, calculate the p-value of the data in this window given the background prediction in this window. How this is done depends on the model selected, but with the default model which uses the total background as the mean of a poisson distribution and the error (in quadrature across multiple bins) as the uncertainty in the mean of this distribution, this is evaluated with:
b.GetPoissonConvGammaPValue(zVal,nObs,nExp,errExp);
3. Whatever the smallest p-value is (i.e. most discrepent data... note that the default is that only data > background is considered discrepent unless SetDipHunter(true) is done). Then the value of the test statistic for the observed data is taken as the negative log of this smallest p-value. This can be evaluated for a given "data" histogram with:
b.EvaluateTestStatistic(data);
4. Generate pseudo-data using the background distribution. This is done for each bin by randomly selecting a mean value for a poisson distribution and then randomly selecting a value from this poisson distribution. This can be generated with:
TH1* pseudoData = b.GenerateToyMC();
5. Evaluate the test statistic for this pseudo-data and use it to build up a PDF for the test statistic. 6. The global p-value of the observation will then be the p-value of the observed test statistic... i.e. the fraction of the test statistic PDF that is to the right of the observed test statistiv value.

Troublshooting

Just drop me an email! -- WillButtinger - 31-Aug-2012
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng bhExample.png r1 manage 15.4 K 2012-08-31 - 18:08 WillButtinger  
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2017-12-24 - WillButtinger
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback