TWiki> Main Web>TWikiUsers>IoannisXiotidis>BsPlot>NumericalLikelihood>FittingExercise (2020-11-23, IoannisXiotidis) EditAttachPDF

In the following page a simple exercise is going to be presented, with goal to let us understand the theory behind the toy sampling method for the likelihood. The main reason for moving into this simpler exercise is due to a high complication that is present in our lifetime problem. (sources: Parameter of interest from the shape, bin to bin correlations, etc.).

As a starting point a single region is being defined, which is supposed to resemble a single bin. In this region a picking PDF has been chosen which is not Gaussian in order to avoid approximations that may arise. The range, along with the various details entering the proof of principle exercise are being summarized in the following table:

Range | Model | Model Parameters | Parameter of Interest (PoI) | PoI range in toys | PoI step | #Toys per mean |

10-200 | Landau | Mean, Sigma (Fixed = 1) | Mean = 50 | 10-200 | 0.19 | 1000 |

The fitting method can be summarized in the following steps:

- Generate 1000 events from a Landau with mean = 50 and sigma = 1 in the range 10-200 (called Reference PDF)
- Generate scan several mean values from 10-200 with a step 0.19 (1000 in summary)
- For each mean value generate 1000 toys and create a histogram with binning proportional to the RMS of the toy sample
- Normalize the histogram (using the ClopperPearson for correct error propagation) to show the binned PDF distribution

- Extract a random number from the Reference PDF and identify the bin value and error of this extraction in the PDF distribution of each mean
- Add the point to a new graph which is having on the x-axis the mean value and on the y-axis the PDF value taken from the PDF distribution
- This is the sampled likelihood graph which has a maximum for a certain value of mean

- If pulls and residuals need to be performed then repeat the two previous steps with different extractions from the Reference PDF using the same toy samples
- Take the difference between the fitted and the generation value for the residual plot, divide by the error if pulls are needed

The procedure defined above can be applied for any underlying picking PDF in a bin and "maybe" can also be used if multiple uncorrelated bins are present. For correlations something extra has to be defined which is beyond the scope of this twiki.

With the procedure defined above a value for the parameter of interest can be identified however several validations steps need to be taken, to ensure that no bias or unexpected effect is entering.

A projection of the PDF in the defined range is being shown here along with the sample where the events are being extracted from.

Reference PDF | Reference sample |

Both plots shown are validating that the distribution is meeting the expectation and the sample is derived from the same distribution.

The second step of the procedure described in section 2 (Model and method) requires the generation of toys from Landau distributions with different mean values but with the sigma of the distribution fixed to the same value as the Reference PDF (sigma = 1). The range of mean values used for the toys are 10-200 with a step proportional to the number of toys (currently 0.19). For each mean value 1000 samples are being generated in order to have enough statistics for the binned PDF distribution. The decision of the binning has been set to be calculated automatically by taking the RMS of toy sample and divide it by a constant factor defining the percentage of the RMS that has to be contained in each bin. The final step to obtain the binned PDF distribution is to normalize the toy histogram to unity. This process is rather tricky because the error estimation need to be calculated correctly, since the division of the bin content with the total entries is a highly correlated quantity. The TEfficiency class from ROOT has been used where the error estimation is performed automatically with the ClopperPearson method. Following the steps described above the plots below are obtained.

Binned PDF distribution of toys | Binned PDF distribution of toys with variable binning |

In the plots it can be seen that the shape of the distribution is not being affected, which is a sign that the normalization was correct. Also, the integral of the lower distribution is 1 as expected leading to the desired binned PDF. The same procedure was applies for a few set of toys which can be seen in the following plot.

In the distribution shown above a very rough binning has been chosen as a starting point. The binning choice affects the result because the shape of the PDF is being washed out. For the plots shown above we used 50% of the toy sample RMS. The plots below are the equivalent for 10% of the RMS which causes a much more fine binning leading to a better representation of the PDF shape

Binned PDF distribution for a single value | Binned PDF distributions for the same set shown above |

The next step is for each of those binned PDF distribution, to extract the value of the PDF given the extraction from the Reference sample (which is used as the outcome of our experiment). The procedure is the following, from the reference sample a random value is being extracted. The extracted value is being then used to find the value of the PDF from the binned distribution. As error on the value of the PDF the bin error is being used. In the same way for all the binned PDFs the value is being taken, and the following distribution is obtained.

**Note**: A bug was found in the value included for the likelihood, instead of using the bin content for the PDF value the extrapolation between the points has been used, but this clearly wrong therefore the two new plots have been added below to show how the likelihood distribution looks for a PDF binning of 50% of the RMS.

The effect that can be seen in the likelihood plot which look like big jumps is due to the very coarse binning that was chosen for the binned PDF distribution (50% of RMS). If this number is chosen to be finer (10% of RMS) then the result is the following plot which show a much more smoother representation of the likelihood curve.

The last point in the rebinned distribution which seems to have 0 uncertainty, is due to a bug introduced in the rebinning code. Bins below the mean value have 0 content since the Landau distribution is rapidly falling before it's mean position. The rebinning code does not account for those bins and therefore does not assign a Poisson uncertainty to them. This phenomenon is also clearly seen in the following two plots prepared to check the steps entering the calculation.

Performing error estimation with the likelihood graph obtained from evaluating the likelihood in the binned PDF distribution required to fit the distribution locally around the maxima. For fitting an asymmetric gaussian has been chosen. The fit procedure followed can be summarized with the following steps:

- Generate 1000 toys in the whole range of the observable variable (x[10-200] step 0.019)
- For each toy create a binned PDF distribution with 1000 entries and binning 10% of the RMS
- From the reference sample a extract one single observation value
- For all the binned PDF obtain the value of the PDF along with the error from the bin closest to the extraction value
- Create a binned sampled with all the PDF values for the whole range
- Find the maximum of the sample

- Create a binned sample which is a zoom around the maximum with range arbitrarily chosen as a fraction of the sample RMS (in our case 10%)
- Fit the sample initially with a single Gaussian distribution
- With the use of the single Gaussian only fit result initiate an Asymmetric Gaussian (AG) fit
- The mean of the fit along with sigma low and sigma high are the desired quantities

Configuration:

Observable range | Points in range | Mean value step | Binned PDF binning | Rebinning cutoff | Maximum range | Single gaussian starting point (mu,sigma) | AG starting point (mu, sigma, r) | mToys (extractions) |

10-200 | 1000 | 0.019 | 10% of RMS | 10 events at least | 10% of toy sample RMS (~+/- 5) | (MaxOfToySample,1) | (SGMean, SGSigma, 2) | 1000 |

The corresponding plots for the steps described above can be seen in the following table of figures:

Full range Likelihood scan | Zoomed likelihood scan around the maximum | Single Gaussian fit on zoomed Likelihood scan | AG fit on zoomed Likelihood scan | Multiple Likelihood scans for different extractions | Fitted Likelihood scans for different extractions | Pulls/Residuals for different extractions |

A set of issues has been identified in the AG fit procedure. One of the major issues was the definition of the fit range. In the plots shown on the previous section what was used is an arbitrary range around the maximum of the datasample. This choice of course is prompt to cause bias and for this region a recursive approach has been used. The first step is to fit a single gaussian in the full range and then recursively perform fits to the region from 5 sigma to 1 sigma. Then the region defined by 2 sigma from the 1 sigma fit is used as a starting point of the AG fit. For the AG fit in order to converge in the correct definition of the maximum another recursive method is used so that the 2 sigma band defined by the AG is fitted multiple times until the fit stabilizes. The recursive fitting of the single gaussian can be seen in the following plot:

for validating the AG fit on the samples likelihood scan, an additional step of including the analytical version of the likelihood has been added and illustrated in the following plot. The calculation of the analytical likelihood can be seen in the second plot where each point represents the value of the PDF for a given extraction. As a following step the TGraph showing the distribution is being transformed into a function which corresponds to the blue curve seen in the AG fits.

For validating the fit procedure the use of the analytical PDF model has been used. The likelihood is being calculated by using the analytical PDF expression of the landau although it's being corrected by the binning which is used in the binned PDF scenario. For this reason the analytical PDF value for a given x is being replaced by the integral of the function in the given bin. In the next graphs the evolution of the PDF from the initial RooFit step all the way to the corrected version of the binning is being shown.

RooFit step | TF1 version | Binned (uniformed binning) | Binned (variable binning) | Binned PDF (from toys) | Binned (analytical) PDF vs Binned (toys) PDF | Binned PDF vs Toy PDF for different PoI values |

TF1 version generated from RooFit PDF (normalized to 1) |

repeating the procedure shown on the previous graphs for a number of different PoI values we are able to extract the likelihood distribution for both the toy PDFs and the analytical PDFs, with the corresponding residual plot between the analytical and the toy likelihood.

A method to evaluate the fitting procedures described above is to produce the residual plots for the different configurations. The configurations used can be summarized in the following table:

Method ID | Model | Binning | Binning scheme | Random extraction scheme | Maximum identification | Number of toys |

1 | Analytical PDF | Yes | Constant | Varying | Likelihood sampling | 1000 |

2 | Analytical PDF | Yes | Varying | Constant | Likelihood sampling | 1000 |

3 | Analytical PDF | No | None | Varying | Analytical model maximum | 1000 |

4 | Toy sample PDF | Yes | Constant | Varying | Likelihood sampling | 1000 |

5 | Toy sample PDF | Yes | Varying | Constant | Likelihood sampling | 1000 |

6 | Toy sample PDF | Yes | Constant | Varying | Fitting likelihood AG fit | 1000 |

the method described above produce subsequently the following residuals plots

Method | 1 | 2 | 3 | 4 | 5 | 6 |

In addition to the plots shown the residual plots for different configuration the theoretical calculation of the residual plot for a gaussian has been performed using mathematica https://twiki.cern.ch/twiki/pub/Main/FittingExercise/GaussianMeanValue_Unbinned.nb additionally the procedure shown for the gaussian case has been applied to an asymmetric distribution so that the calculation of the residual expectation value resembles closer the landau case https://twiki.cern.ch/twiki/pub/Main/FittingExercise/GumbelDistribution_Unbinned.nb

Since, the landau distribution has the issue with the mean value not being determined the validation of the procedure has been changed to the Gambel distribution. The benefits of using this distribution are multiple with the main two being the following: 1) Asymmetric distribution (not falling under gaussian assumptions) 2) The mean value can be calculated analytically (also by hand). With the use of the mathematica notebook linked above we are able to calculate the expectation value of the residuals for a Gumbel distribution with a = 50 and b = 1 which is: . Since, the theoretical prediction has been calculated the next step was to repeat some of the configurations noted in the table above which would produce the toy based calculation of the residual expectation value.

Method | 1 | 3 | 4 |

In many situations the choice of the binning on the PDF affected the performance of the asymmetric Gaussian fit around the maximum in the likelihood scan plot. The main reason for the failures of some of the AG fits was due to the binning biasing the likelihood in the ranges far from the maximum but also provide few points around the maximum. That cleary was providing a few point only for the fitting to converge and leading it to unphysical states. For this reason, the decision was made to leave aside the variable binning method with a hard cutoff on the minimum entries, and replace it with an iterative process which would make the decision of the best binning with the use of the significance calculated for a different binning scheme. It is understandable that the allowance of variable binning might lead to a better performance in the the likelihood extraction, but since the biggest effect of the binning is on the regions away from the maximum at this point we are leaving it aside. The significance plots produced for the iterative process are the following with the configuration shown in the table:

Range | Initial binning | Loop step | Plots |

10-500 | ~38k | InitBins/2. | Significance plots |

to choose the best binning out of the following iterations the projection of the binned significance has been made to the y axis, so that the histogram with most entries that are significant is going to be chosen as the best binning. The following steps can be seen in the muli-plots shown below:

PDF for different binning | Binned significance following PDF binning | Significance projection and entries above threshold plot |

the likelihood scan along with the fit on the maximum:

Likelihood | Fit steps Gaussian only | Fit step AG |

The following configuration have been applied on our procedure in order to test the bias on each of the effects appearing in it.

Type | Toys | Residual mean | Residual sigma | Plot |

Analytical | - | -0.57721 | - | AnalyticalCalculations |

Unbinned analytical | 1000 | -0.5398 +/- 0.04046 | 1.279 +/- 0.02861 | UnbinnedAnalyticalMethod |

Binned analytical | 1000 | -0.5297 +/- 0.041 | 1.297 +/- 0.02899 | BinnedAnalyticalMethod |

Binned toy sampled | 1000 | -0.5109 +/- 0.04355 | 1.377 +/- 0.03079 | BinnedToySampleMethod |

Binned toy sample with Gaussian fit | 10000 | -0.2423 +/- 0.01302 | 1.302 +/- 0.009206 | BinnedToySampleMethodWithGaussianFit |

Binned toy sample with AG fit | 1000 | -0.6736 +/- 0.04598 | 1.453 +/- 0.03252 | BinnedToySampleMethodWithAGFit |

Regeneration of binning * | 100 | -0.648 +/- 0.07889 | 0.7889 +/- 0.05578 | BinningVaryingMethod |

* The last plot takes AGES !!!

In the binned fit with the asymmetric gaussian there are two events that underflowed in the residual plot. Those fits are being investiaged to check why this happened:

The first fit producing this issue is toy 335 out of 1000 with the following conifugation and plots:

Toy for binned PDF | Single gaussian range | AG fit range | AG iteration Cut 1 | AG iteration Cut 2 | AG Iteration Cut 3 | Plot gaussian step | Plot AG step | Paremeter evolution in iterative process |

1000 | Up to 2 sigma | 2 sigma range | Fit moving withing same bin on both edges | Lower or upper sigma change by 20% of previous step value | 10 fits |

* OUTCOME: SOLVED BY CHECKING THE INITIAL NORMALIZATION VALUE. THE STARTING POINT FOR THE AG FIT NORMALIZATION HAD A DIFFERENT VALUE FROM THE POINT WHERE THE SINGLE GAUSSIAN FIT CONVERGED

The next issue to be explored is the fact that the single gaussian fit produces less biased results on the parameter estimation than the asymmetric Gaussian fit. For this reason a use case has been picked where the residual value obtained from the single Gaussian fit is better than the value from the asymmetric Gaussian. The steps for both fits can be seen in the following plots along with the residual value and the corresponding fit logs.

Fit name | Extraction value | Residual value | Fit result (mean,sigmaL,sigmaH) | Fit log | Fit plots | Fit starting point |

Single Gaussian | 49.227592 | -0.328194 | (49.671806,1.212885,1.212885) | SingleGaussianFitLog | SingleGaussianPlots | |

Asymmetric Gaussian | 49.227592 | -1.273056 | (48.726944,0.509157,1.838562) | AsymGaussianFitLog | AsymGaussianPlots | AsymGaussianStartingPoint |

An additional attempt to improve the final asymmetric Gaussian fit is to include a 1 sigma band fit defined from the last successful 2 sigma asymmetric Gaussian fit. The result can be seen in the following table:

Fit name | Extraction value | Initial fit step | Fit result | Fit log |

Asymmetric Gaussian | 49.227592 | Initial fit points plots | Fit result plots | Fit log |

-- IoannisXiotidis - 2020-10-07

Topic revision: r22 - 2020-11-23 - IoannisXiotidis

**Webs**

- ABATBEA
- ACPP
- ADCgroup
- AEGIS
- AfricaMap
- AgileInfrastructure
- ALICE
- AliceEbyE
- AliceSPD
- AliceSSD
- AliceTOF
- AliFemto
- ALPHA
- ArdaGrid
- ASACUSA
- AthenaFCalTBAna
- Atlas
- AtlasLBNL
- AXIALPET
- CAE
- CALICE
- CDS
- CENF
- CERNSearch
- CLIC
- Cloud
- CloudServices
- CMS
- Controls
- CTA
- CvmFS
- DB
- DefaultWeb
- DESgroup
- DPHEP
- DM-LHC
- DSSGroup
- EGEE
- EgeePtf
- ELFms
- EMI
- ETICS
- FIOgroup
- FlukaTeam
- Frontier
- Gaudi
- GeneratorServices
- GuidesInfo
- HardwareLabs
- HCC
- HEPIX
- ILCBDSColl
- ILCTPC
- IMWG
- Inspire
- IPv6
- IT
- ItCommTeam
- ITCoord
- ITdeptTechForum
- ITDRP
- ITGT
- ITSDC
- LAr
- LCG
- LCGAAWorkbook
- Leade
- LHCAccess
- LHCAtHome
- LHCb
- LHCgas
- LHCONE
- LHCOPN
- LinuxSupport
- Main
- Medipix
- Messaging
- MPGD
- NA49
- NA61
- NA62
- NTOF
- Openlab
- PDBService
- Persistency
- PESgroup
- Plugins
- PSAccess
- PSBUpgrade
- R2Eproject
- RCTF
- RD42
- RFCond12
- RFLowLevel
- ROXIE
- Sandbox
- SocialActivities
- SPI
- SRMDev
- SSM
- Student
- SuperComputing
- Support
- SwfCatalogue
- TMVA
- TOTEM
- TWiki
- UNOSAT
- Virtualization
- VOBox
- WITCH
- XTCA

Welcome Guest

Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback

or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback