FittingExercise
Introduction
In the following page a simple exercise is going to be presented, with goal to let us understand the theory behind the toy sampling method for the likelihood. The main reason for moving into this simpler exercise is due to a high complication that is present in our lifetime problem. (sources: Parameter of interest from the shape, bin to bin correlations, etc.).
Model and method
As a starting point a single region is being defined, which is supposed to resemble a single bin. In this region a picking PDF has been chosen which is not Gaussian in order to avoid approximations that may arise. The range, along with the various details entering the proof of principle exercise are being summarized in the following table:
Range |
Model |
Model Parameters |
Parameter of Interest (PoI) |
PoI range in toys |
PoI step |
#Toys per mean |
10-200 |
Landau |
Mean, Sigma (Fixed = 1) |
Mean = 50 |
10-200 |
0.19 |
1000 |
The fitting method can be summarized in the following steps:
- Generate 1000 events from a Landau with mean = 50 and sigma = 1 in the range 10-200 (called Reference PDF)
- Generate scan several mean values from 10-200 with a step 0.19 (1000 in summary)
- For each mean value generate 1000 toys and create a histogram with binning proportional to the RMS of the toy sample
- Normalize the histogram (using the ClopperPearson for correct error propagation) to show the binned PDF distribution
- Extract a random number from the Reference PDF and identify the bin value and error of this extraction in the PDF distribution of each mean
- Add the point to a new graph which is having on the x-axis the mean value and on the y-axis the PDF value taken from the PDF distribution
- This is the sampled likelihood graph which has a maximum for a certain value of mean
- If pulls and residuals need to be performed then repeat the two previous steps with different extractions from the Reference PDF using the same toy samples
- Take the difference between the fitted and the generation value for the residual plot, divide by the error if pulls are needed
The procedure defined above can be applied for any underlying picking PDF in a bin and "maybe" can also be used if multiple uncorrelated bins are present. For correlations something extra has to be defined which is beyond the scope of this twiki.
Procedure validation process
With the procedure defined above a value for the parameter of interest can be identified however several validations steps need to be taken, to ensure that no bias or unexpected effect is entering.
Reference PDF and sample for extractions
A projection of the PDF in the defined range is being shown here along with the sample where the events are being extracted from.
Reference PDF |
Reference sample |
Reference PDF in the defined range [10,200] |
Reference sample used to perform extractions for residual/pulls |
Both plots shown are validating that the distribution is meeting the expectation and the sample is derived from the same distribution.
PDF distribution from toys
The second step of the procedure described in section 2 (Model and method) requires the generation of toys from Landau distributions with different mean values but with the sigma of the distribution fixed to the same value as the Reference PDF (sigma = 1). The range of mean values used for the toys are 10-200 with a step proportional to the number of toys (currently 0.19). For each mean value 1000 samples are being generated in order to have enough statistics for the binned PDF distribution. The decision of the binning has been set to be calculated automatically by taking the RMS of toy sample and divide it by a constant factor defining the percentage of the RMS that has to be contained in each bin. The final step to obtain the binned PDF distribution is to normalize the toy histogram to unity. This process is rather tricky because the error estimation need to be calculated correctly, since the division of the bin content with the total entries is a highly correlated quantity. The TEfficiency class from
ROOT has been used where the error estimation is performed automatically with the
ClopperPearson method. Following the steps described above the plots below are obtained.
Binned PDF distribution of toys |
Binned PDF distribution of toys with variable binning |
Initial distribution of mean = 40, Total number of events in each bin and normalized distribution with correct error estimation |
Initial distribution of mean = 40, Total number of events in each bin and normalized PDF distribution with correct error estimation. A cutoff to merge bins with less than 10 events has been applied causing the variable binning |
In the plots it can be seen that the shape of the distribution is not being affected, which is a sign that the normalization was correct. Also, the integral of the lower distribution is 1 as expected leading to the desired binned PDF. The same procedure was applies for a few set of toys which can be seen in the following plot.
Binned PDF distribution for a range of mean values all with 1000 toys per value |
Binned PDF distribution for a range of mean values all with 1000 toys per value with variable binning due to minimum entries per bin requirement (10Events) |
Scale factor change for binning calculation of PDF distribution
In the distribution shown above a very rough binning has been chosen as a starting point. The binning choice affects the result because the shape of the PDF is being washed out. For the plots shown above we used 50% of the toy sample RMS. The plots below are the equivalent for 10% of the RMS which causes a much more fine binning leading to a better representation of the PDF shape
Binned PDF distribution for a single value |
Binned PDF distributions for the same set shown above |
Calculation of the PDF distribution with correct normalization and 10% of sample RMS binning, mean = 40 |
PDF distributions for various mean values and binning 10% of sample RMS |
Toy sampled likelihood plot
The next step is for each of those binned PDF distribution, to extract the value of the PDF given the extraction from the Reference sample (which is used as the outcome of our experiment). The procedure is the following, from the reference sample a random value is being extracted. The extracted value is being then used to find the value of the PDF from the binned distribution. As error on the value of the PDF the bin error is being used. In the same way for all the binned PDFs the value is being taken, and the following distribution is obtained.
Likelihood scan for different mean values for a given extraction from the reference sample wrong choice of PDF value (set to be the extrapolation between the two points) |
Likelihood scan for different mean values for a given extraction from the reference sample with the rebinning included in the PDF distribution wrong choice of PDF value (set to be the extrapolation between the two points) |
Likelihood scan for different mean values for a given extraction from the reference sample with corrected choice of the PDF value to be the bin content |
Likelihood scan for different mean values for a given extraction from the reference sample with the rebinning included in the PDF distribution with corrected choice of the PDF value to be the bin content |
Note: A bug was found in the value included for the likelihood, instead of using the bin content for the PDF value the extrapolation between the points has been used, but this clearly wrong therefore the two new plots have been added below to show how the likelihood distribution looks for a PDF binning of 50% of the RMS.
The effect that can be seen in the likelihood plot which look like big jumps is due to the very coarse binning that was chosen for the binned PDF distribution (50% of RMS). If this number is chosen to be finer (10% of RMS) then the result is the following plot which show a much more smoother representation of the likelihood curve.
Likelihood curve with 10% of RMS binning for the PDF distribution |
Histogram range troubleshooting (SOLVED)
The last point in the rebinned distribution which seems to have 0 uncertainty, is due to a bug introduced in the rebinning code. Bins below the mean value have 0 content since the Landau distribution is rapidly falling before it's mean position. The rebinning code does not account for those bins and therefore does not assign a Poisson uncertainty to them. This phenomenon is also clearly seen in the following two plots prepared to check the steps entering the calculation.
Histogram filled with 1000 toys and mean value = 58 and rebinning |
Rescaled version of histogram with mean = 58 and variable binning |
Distribution of PDF with bins that are at 0 which allows to keep the correct range of the histogram after the rebinning |
Toy sampled likelihood fit
Performing error estimation with the likelihood graph obtained from evaluating the likelihood in the binned PDF distribution required to fit the distribution locally around the maxima. For fitting an asymmetric gaussian has been chosen. The fit procedure followed can be summarized with the following steps:
- Generate 1000 toys in the whole range of the observable variable (x[10-200] step 0.019)
- For each toy create a binned PDF distribution with 1000 entries and binning 10% of the RMS
- From the reference sample a extract one single observation value
- For all the binned PDF obtain the value of the PDF along with the error from the bin closest to the extraction value
- Create a binned sampled with all the PDF values for the whole range
- Find the maximum of the sample
- Create a binned sample which is a zoom around the maximum with range arbitrarily chosen as a fraction of the sample RMS (in our case 10%)
- Fit the sample initially with a single Gaussian distribution
- With the use of the single Gaussian only fit result initiate an Asymmetric Gaussian (AG) fit
- The mean of the fit along with sigma low and sigma high are the desired quantities
Configuration:
Observable range |
Points in range |
Mean value step |
Binned PDF binning |
Rebinning cutoff |
Maximum range |
Single gaussian starting point (mu,sigma) |
AG starting point (mu, sigma, r) |
mToys (extractions) |
10-200 |
1000 |
0.019 |
10% of RMS |
10 events at least |
10% of toy sample RMS (~+/- 5) |
(MaxOfToySample,1) |
(SGMean, SGSigma, 2) |
1000 |
The corresponding plots for the steps described above can be seen in the following table of figures:
Full range Likelihood scan |
Zoomed likelihood scan around the maximum |
Single Gaussian fit on zoomed Likelihood scan |
AG fit on zoomed Likelihood scan |
Multiple Likelihood scans for different extractions |
Fitted Likelihood scans for different extractions |
Pulls/Residuals for different extractions |
Full range scan for a single extraction from the reference sample |
Zoomed likelihood scan around the binned sample maximum value |
Fit on zoomed likelihood scan around the maximum with a single gaussian fit |
Fit on zoomed likelihood scan around the maximum with an asymmetric gaussian after the single gaussian fit |
Likelihood scans for multiple extraction values from reference sample |
Likelihood scans for multiple extraction value from reference sample with AG fits applied on them |
Residual and pull plot for the 10 extractions shown in the previous likelihood scans |
Troubleshooting AG fit
A set of issues has been identified in the AG fit procedure. One of the major issues was the definition of the fit range. In the plots shown on the previous section what was used is an arbitrary range around the maximum of the datasample. This choice of course is prompt to cause bias and for this region a recursive approach has been used. The first step is to fit a single gaussian in the full range and then recursively perform fits to the region from 5 sigma to 1 sigma. Then the region defined by 2 sigma from the 1 sigma fit is used as a starting point of the AG fit. For the AG fit in order to converge in the correct definition of the maximum another recursive method is used so that the 2 sigma band defined by the AG is fitted multiple times until the fit stabilizes. The recursive fitting of the single gaussian can be seen in the following plot:
Single gaussian steps for identifying starting point of asymmetric gaussian fit |
for validating the AG fit on the samples likelihood scan, an additional step of including the analytical version of the likelihood has been added and illustrated in the following plot. The calculation of the analytical likelihood can be seen in the second plot where each point represents the value of the PDF for a given extraction. As a following step the TGraph showing the distribution is being transformed into a function which corresponds to the blue curve seen in the AG fits.
Likelihood scan fits with analytical likelihood version |
Analytical likelihood plot for a single extraction |
Analytical and numerical likelihood scans for different extractions (toys) |
Analytical likelihood model
For validating the fit procedure the use of the analytical PDF model has been used. The likelihood is being calculated by using the analytical PDF expression of the landau although it's being corrected by the binning which is used in the binned PDF scenario. For this reason the analytical PDF value for a given x is being replaced by the integral of the function in the given bin. In the next graphs the evolution of the PDF from the initial
RooFit step all the way to the corrected version of the binning is being shown.
RooFit step |
TF1 version |
Binned (uniformed binning) |
Binned (variable binning) |
Binned PDF (from toys) |
Binned (analytical) PDF vs Binned (toys) PDF |
Binned PDF vs Toy PDF for different PoI values |
RooFit plotting of Landau model used to generate toys |
TF1 version generated from RooFit PDF (normalized to 1) |
Binned version of PDF with uniform binning for cross checking the method - binning 10% of sample RMS |
Binned version of PDF with variable binning following the binning convention of the toy PDF distribution |
Binned PDF distribution generate from toys - variable binning |
Binned PDF (analytical model) vs Binned PDF (toys) with variable binning |
Binned PDF and toy PDF for two different values of the Parameter of Interest |
repeating the procedure shown on the previous graphs for a number of different
PoI values we are able to extract the likelihood distribution for both the toy PDFs and the analytical PDFs, with the corresponding residual plot between the analytical and the toy likelihood.
Binned vs toy likelihood plot with the residuals calculated for each point (zoomed around maximum) |
Residual evaluation
A method to evaluate the fitting procedures described above is to produce the residual plots for the different configurations. The configurations used can be summarized in the following table:
Method ID |
Model |
Binning |
Binning scheme |
Random extraction scheme |
Maximum identification |
Number of toys |
1 |
Analytical PDF |
Yes |
Constant |
Varying |
Likelihood sampling |
1000 |
2 |
Analytical PDF |
Yes |
Varying |
Constant |
Likelihood sampling |
1000 |
3 |
Analytical PDF |
No |
None |
Varying |
Analytical model maximum |
1000 |
4 |
Toy sample PDF |
Yes |
Constant |
Varying |
Likelihood sampling |
1000 |
5 |
Toy sample PDF |
Yes |
Varying |
Constant |
Likelihood sampling |
1000 |
6 |
Toy sample PDF |
Yes |
Constant |
Varying |
Fitting likelihood AG fit |
1000 |
the method described above produce subsequently the following residuals plots
Method |
1 |
2 |
3 |
4 |
5 |
6 |
|
Binned analytical model (same as toy based) varying extractions, 1000 toys |
Binned analytical model with variable binning and constant extraction 49.3733, 1000 Toys |
Analytical unbinned model 1000 toys |
Toy sampled model with variable binning (constant) different extractions 1000 toys |
Toy sampled model with varying binning for constant extraction 49.3733, 1000 Toys |
Toy sample model with constant binning but different extraction and fitted with AG model |
In addition to the plots shown the residual plots for different configuration the theoretical calculation of the residual plot for a gaussian has been performed using mathematica
https://twiki.cern.ch/twiki/pub/Main/FittingExercise/GaussianMeanValue_Unbinned.nb additionally the procedure shown for the gaussian case has been applied to an asymmetric distribution so that the calculation of the residual expectation value resembles closer the landau case
https://twiki.cern.ch/twiki/pub/Main/FittingExercise/GumbelDistribution_Unbinned.nb
Gumbel PDF
Since, the landau distribution has the issue with the mean value not being determined the validation of the procedure has been changed to the
Gambel distribution
. The benefits of using this distribution are multiple with the main two being the following: 1) Asymmetric distribution (not falling under gaussian assumptions) 2) The mean value can be calculated analytically (also by hand). With the use of the mathematica notebook linked above we are able to calculate the expectation value of the residuals for a Gumbel distribution with a = 50 and b = 1 which is:
![$ E[res] = \int_{-\infty}^{+\infty} (max(L(a|x))(x)-a_{truth})*PDF(x) dx = -0.577216$](/twiki/pub/Main/FittingExercise/latexff70fb5039371ea9f6eb6bf2f405b80a.png)
. Since, the theoretical prediction has been calculated the next step was to repeat some of the configurations noted in the table above which would produce the toy based calculation of the residual expectation value.
Method |
1 |
3 |
4 |
|
Analytical likelihood binned according to toy binning for 10k toys |
Analytical likelihood without any binning for 10k toys |
Variable binned toy sampled likelihood for 10k toys |
Binning of PDF optimization - in connection to AG fit on likelihood scan fails
In many situations the choice of the binning on the PDF affected the performance of the asymmetric Gaussian fit around the maximum in the likelihood scan plot. The main reason for the failures of some of the AG fits was due to the binning biasing the likelihood in the ranges far from the maximum but also provide few points around the maximum. That cleary was providing a few point only for the fitting to converge and leading it to unphysical states. For this reason, the decision was made to leave aside the variable binning method with a hard cutoff on the minimum entries, and replace it with an iterative process which would make the decision of the best binning with the use of the significance calculated for a different binning scheme. It is understandable that the allowance of variable binning might lead to a better performance in the the likelihood extraction, but since the biggest effect of the binning is on the regions away from the maximum at this point we are leaving it aside. The significance plots produced for the iterative process are the following with the configuration shown in the table:
to choose the best binning out of the following iterations the projection of the binned significance has been made to the y axis, so that the histogram with most entries that are significant is going to be chosen as the best binning. The following steps can be seen in the muli-plots shown below:
the likelihood scan along with the fit on the maximum:
Likelihood |
Fit steps Gaussian only |
Fit step AG |
Likelihood scan for a given extraction with the new binning method |
Signle gaussian fit on likelihood scan |
AG fit on likelihood scan |
Fit procedure bias checking with residuals
The following configuration have been applied on our procedure in order to test the bias on each of the effects appearing in it.
* The last plot takes AGES !!!
Debugging residual plots
In the binned fit with the asymmetric gaussian there are two events that underflowed in the residual plot. Those fits are being investiaged to check why this happened:
The first fit producing this issue is toy 335 out of 1000 with the following conifugation and plots:
Toy for binned PDF |
Single gaussian range |
AG fit range |
AG iteration Cut 1 |
AG iteration Cut 2 |
AG Iteration Cut 3 |
Plot gaussian step |
Plot AG step |
Paremeter evolution in iterative process |
1000 |
Up to 2 sigma |
2 sigma range |
Fit moving withing same bin on both edges |
Lower or upper sigma change by 20% of previous step value |
10 fits |
|
|
Parameter evolution in iterative AG fit process |
* OUTCOME: SOLVED BY CHECKING THE INITIAL NORMALIZATION VALUE. THE STARTING POINT FOR THE AG FIT NORMALIZATION HAD A DIFFERENT VALUE FROM THE POINT WHERE THE SINGLE GAUSSIAN FIT CONVERGED
The next issue to be explored is the fact that the single gaussian fit produces less biased results on the parameter estimation than the asymmetric Gaussian fit. For this reason a use case has been picked where the residual value obtained from the single Gaussian fit is better than the value from the asymmetric Gaussian. The steps for both fits can be seen in the following plots along with the residual value and the corresponding fit logs.
An additional attempt to improve the final asymmetric Gaussian fit is to include a 1 sigma band fit defined from the last successful 2 sigma asymmetric Gaussian fit. The result can be seen in the following table:
--
IoannisXiotidis - 2020-10-07