Background from Data

Minimum Log Likelihood

The idea of this study is to explore our capability of discriminate background photons (from pi0s, etas, etc) from direct photons, using a Maximum LogLikelihood fit to find the best linear combination of MC signal and background samples that match a given data sample.

For that we have used the following datasets as MC 'reference samples'

PandaIdJob Process pT Dataset AOD
88 (Xabier) single gamma 20 GeV 7040 trig1_misal1_mc12.007040.singlepart_gamma_Et20.recon.AOD.v13003001
91 (Xabier) single gamma 60 GeV 7042 trig1_misal1_mc12.007042.singlepart_gamma_Et60.recon.AOD.v13003001
84 (Xabier) single pi0 20 GeV 7140 trig1_misal1_mc12.007140.singlepart_pi0_Et20.recon.AOD.v13003001
85 (Xabier) single pi0 60 GeV 7142 trig1_misal1_mc12.007142.singlepart_pi0_Et60.recon.AOD.v13003001

and ran the analysis over the output AANs from EventView rel 13.0.40.
Pointing hand Despite all the results presented from here on are those for 20GeV samples, the performance is similar for 60GeV datasets and the same conclusions are valid.

The variable used so far was fracs1 = shower shape in the shower scope: [E(+-3) - E(+-1)/E(+-1)], where E(+-n) is the energy in +-n strips around the strip with highest energy. The MC distributions for both signal and background photons candidates (20GeV, after IsEM==0 selection) are shown next.


The Method

The method of analysis is based on the so-called binned maximum log likelihood. We have implemented a ROOT algorithm which uses TMinuit to minimize -2*ln L, where L is the likelihood function defined in our case as
L =  PI(i){e^{-mu_i}*mu_i^{n_i}/n_i!}


n_i : number of events observed in bin i , (see the two alternatives below)
mu_{i} : N*(purity*signal + (1-purity)*background) , signal & background are the events in bin i taken from the MC pure samples.

The method then look for the two best parameters (N & purity) that lead to the minimum value of -2*lnL=.
Speech bubble the factor -2 allows MINUIT to get errors using the same recipe as for least squares, i.e. go up from the minimum by 1.

To study the capability of the method to extract the purity two different samples have been tried out:

  • an ad hoc combination of pure MC samples and
  • the remaining JF17 dijet sample after offline selection.

Ad hoc combination

In this approach, one half of the sample has been kept as 'truth' reference and the other used to make a new 'data' sample with a given purity. In this way we can study the robustness of the method for different signal & background ratios.
Several configurations have been set-up (initialization parameters,# of bins, fracs1 range, etc). First, an outcome example is given together with the corresponding distributions (for 10 & 50 bins).

*True Purity = 0.4*
True Normalization = 8.97E-001
Signal Entries = 1712    ,      Background Entries = 1712     ,        Mixed Entries = 1536  

    NO.   NAME         VALUE      STEP SIZE      LIMITS
     1 purity       0.00000e+00  1.00000e-02    0.00000e+00  1.00000e+00
     2 N            5.00000e-01  1.00000e-02     no limits
 **    1 **MIGRAD
 FCN=136.874 FROM MIGRAD    STATUS=CONVERGED      53 CALLS          54 TOTAL
                     EDM=2.13794e-08    STRATEGY= 1      ERROR MATRIX ACCURATE
  EXT PARAMETER                                   STEP         FIRST
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE
   1  purity       3.53865e-01   9.98784e-02   1.20858e-03  -2.65085e-05
   2  N            8.97194e-01   2.28924e-02   1.31226e-04  -9.02950e-03
  1.012e-02  2.287e-11
  2.287e-11  5.241e-04
       NO.  GLOBAL      1      2
        1  0.00000   1.000  0.000
        2  0.00000   0.000  1.000

10 bins MLL.MC.p04.10b.gif

50 bins MLL.MC.p04.50b.gif

Now, this analisis was extended to several true input purities. The results are summed up in the following table.

      10 bins 30 bins 50 bins
Mixed Entries True Purity True N Purity MLL Value N MLL Value Purity MLL Value N MLL Value Purity MLL Value N MLL Value
922 0. 5.38E-1 1.35E-18 +- 5.94E-2 5.38E-1 +- 1.77E-2 1.33E-18 +- 5.76E-2 5.38E-1 +- 1.77E-2 1.13E-18 +- 8.02E-2 5.38E-1 +- 1.77E-2
1024 0.1 5.98E-1 1.17E-9 +- 7.33E-1 5.98E-1 +- 1.87E-2 1.18E-9 +- 7.88E-1 5.98E-1 +- 1.87E-2 4.29E-2 +- 1.10E-1 5.98E-1 +- 1.87E-2
1152 0.2 6.73E-1 4.21E-2 +- 1.23E-1 6.73E-1 +- 1.98E-2 7.28E-2 +- 1.11E-1 6.73E-1 +- 1.98E-2 1.35E-1 +- 1.08E-1 6.73E-1 +- 1.98E-2
1317 0.3 7.69E-1 1.32E-1 +- 1.23E-1 7.69E-1 +- 2.12E-2 1.57E-1 +- 1.07E-1 7.69E-1 +- 2.12E-2 2.24E-1 +- 1.05E-1 7.69E-1 +- 2.12E-2
1536 0.4 8.97E-1 2.73E-1 +- 1.19E-1 8.97E-1 +- 2.29E-2 2.91E-1 +- 1.2E-1 8.97E-1 +- 2.29E-2 3.54E-1 +- 1.00E-1 8.97E-1 +- 2.29E-2
1844 0.5 1.07E0 4.41E-1 +- 1.11E-1 1.07E0 +- 2.50E-2 4.48E-1 +- 9.68E-2 1.07E0 +- 2.50E-2 4.84E-1 +- 9.27E-2 1.07E0 +- 2.50E-2
1536 0.6 8.97E-1 5.47E-1 +- 1.23E-1 8.97E-1 +- 2.29E-2 5.55E-1 +- 1.06E-1 8.97E-1 +- 2.29E-2 6.07E-1 +- 1.03E-1 8.97E-1 +- 2.29E-2
1317 0.7 7.69E-1 6.92E-1 +- 1.35E-1 7.69E-1 +- 2.12E-2 6.55E-1 +- 1.14E-1 7.69E-1 +- 2.12E-2 7.03E-1 +- 1.09E-1 7.69E-1 +- 2.12E-2
1152 0.8 6.73E-1 8.09E-1 +- 1.47E-1 6.73E-1 +- 1.98E-2 7.29E-1 +- 1.22E-1 6.73E-1 +- 1.98E-2 8.00E-1 +- 1.16E-1 6.73E-1 +- 1.98E-2
1024 0.9 5.98E-1 9.78E-1 +- 6.96E-1 5.98E-1 +- 1.87E-2 8.69E-1 +- 1.27E-1 5.98E-1 +- 1.87E-2 9.09E-1 +- 1.15E-1 5.98E-1 +- 1.87E-2
922 1.0 5.38E-1 1.00E0 +- 9.88E-2 5.38E-1 +- 1.77E-2 9.99E-1 +- 8.93E-1 5.38E-1 +- 1.77E-2 1.00E0 +- 7.13E-1 5.38E-1 +- 1.77E-2

As you can appreciate from this table, as we could expected is worst in the case of few bins when your shape ain't well resolved.
Speech bubble The normalization factor does not depend on the number of bins, as it should be.

Last results can be also summed up in a plot like this...


It can be seen a big error in a couple of bins, near both end limits of purity. Further studies are on course to find out the reason.

Taking into account the independence of the normalization with respect to the binning and the good agreement in the whole purity range, we have also tried to perform the minimization fixing N to its 'true' known value. Any improvement has been achieved by doing that though.

Different input parameters have been also tried, but without any observable effect on the final estimation which is a desirable feature of any method. The extreme case when both parameters are allowed to vary freely is shown in this table (as we've said before N doesn't depend on this so it ain't shown here)

50 bins
True Purity MLL purity
0. -8.81E-2 +- 1.17E-1
0.1 4.30E-2 +- 1.16E-1
0.2 1.35E-1 +- 1.10E-1
0.3 2.25E-1 +- 1.06E-1
0.4 3.54E-1 +- 1.01E-1
0.5 4.84E-1 +- 9.32E-2
0.6 6.07E-1 +- 1.04E-1
0.7 7.03E-1 +- 1.10E-1
0.8 8.00E-1 +- 1.18E-1
0.9 9.09E-1 +- 1.18E-1
1.0 1.00E-1 +- 1.15E-1

Besides the negative approach to zero purity (but positive within the error) the method is rather stable. Even more in this way we avoid the border effect we have seen before (at least in the upper end).

Similar performance have been found for 10 and 30 bins in the whole purity range except in the two lowest values where the method returns negative purities (positive within the error though).

We could summarise this MC analysis into a few items to have in mind:

  • the binning has to be small enough to discriminate between both signal & background distributions. Amid the cases studied so far 50 bins seems to be the best choice.
  • the normalization factor doesn't depend on neither the binning nor the purity range allowed. The purity estimation is exactly the same either fixing the norma or allowing this to be a free parameter. Thus any of this configurations could be chosen.
  • by setting non limits on the values purity can take, we get rid of some border effects at least in the upper end with a good estimation of the true value. In case of no signal present in the sample (purity=0.) we obtain a negative limit to zero although it is positive within the statistical error.
  • the method seems to be quite independent of the input parameter configuration (within reasonable!).

JF17 dijet sample

Having studied the reliability of the method on MC ad-hoc mixes, the next step is to use a more 'real' background sample. This other approach uses then as evaluation sample those reconstructed photons remaining in the JF17 sample after offline selection (/space1/data/J17NTUP-* --> MixedSample.JF17.root). As we have to compare this 'data' against MC distributions at 20 GeV a cut on the highest pt photon have been applied ( 15GeV < pt < 25 GeV)

In order to compute its 'true' purity each good offline photon (|eta|<2.5,pt[15GeV,25GeV],IsEM==0) was matched with a truth object (closest matching in a R=0.1 cone) and classified by mother id when the associated is a photon.

The final fracs1 spectra is shown is this figure, for all the reco photons found (left) and discriminated by mother id (right). (no geant particles have been considered).


True Particle Matched from... #entries
  pi0,eta 345
gamma DP 53
  q/g line 98
fakes   20

The photon-fake subsample has the following composition

ph-Fakes composition
e 3
p 2
pi 11
K 1
K^0_s 1
K^0_L 1
total 19

The converted photons (if we look now @ geant particle level and the best match is then one of the final electrons) are coming mainly from pi0 decays as it can be observed in this table

Conversion Mothers
#entries detailed
pi0,eta 50 pi0(42), eta(8)
DP 8  
q/g line 13 u(4), d(4), s(1), c(2), g(2)
e 9  
others 4 p(2), w(2)
total 84  

thus how we treat the conversion recovery at the end will have more impact on the background distribution.

Running the MLL algorithm in blind mode (i.e. just with the same configuration as before) on it we got then

*50 bines, free varying configuration space*
Signal Entries = 1712
Background Entries = 1712
Mixed Entries = 518

    NO.   NAME         VALUE      STEP SIZE      LIMITS
     1 purity       0.00000e+00  1.00000e-02     no limits
     2 N            5.00000e-01  1.00000e-02     no limits
 **    1 **MIGRAD
 FCN=109.642 FROM MIGRAD    STATUS=CONVERGED      44 CALLS          45 TOTAL
                     EDM=4.99867e-09    STRATEGY= 1      ERROR MATRIX ACCURATE
  EXT PARAMETER                                   STEP         FIRST
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE
   1  purity       9.91902e-02   1.71884e-01   8.82752e-04  -5.61509e-04
   2  N            3.01402e-01   1.32685e-02   6.81398e-05   1.96871e-03
  2.954e-02 -2.458e-12
 -2.458e-12  1.761e-04
       NO.  GLOBAL      1      2
        1  0.00000   1.000 -0.000
        2  0.00000  -0.000  1.000

Info As we don't have to mix the MC photons here, we have used all the MC sample as reference.


So the MLL parameters we would have to compare with the true ones are

  • MLL_purity = 9.92E-02 +- 1.72E-01
  • MLL_N = 3.01E-01 +- 1.33E-02

From one hand, the true normalization factor in our case was truthN = 3.02E-01 showing a great agreement with the MLL value.
On the other, the puzzling thing is how to define our true purity. We could think in at least two options

  1. signal == DP+q/g-line photons: =truepurity = #(DP+q/g) / #total photons = (53+98)/518 = 2.91E-1 +- 2.4E-2
  2. signal == only DP photons =truepurity = #DP / #total photons = 53/518 = 1.02E-1 +- 1.40E-2

It can be seen that the agreement is rather good in the last case as we could have expected. Since our MC signal are monochromatic photons, the method is taking as 'signal' only those quite alike to them (i.e. DP photons in the sample as they are quite isolated).
To check this, the effect of isolation at calocluster level for photons from different sources have been studied in detail here.

Finally, assuming we want only the amount of DP photons in our 'data', the MLL can extract it within an error of ~10-15%. We foreseen a further reduction of this percentage, while the method depends on the MC statistics available.

  • Drawbacks??: This method (as almost everyone else) relies on our MC description of the shower shapes. The tunning of the Montecarlo is gonna be done as soon as enough data is gathered in the first period of LHC running. Thus this analysis should be quite efficient for S/B discrimination.

  • Et60 GeV The same analysis have been performed on 60GeV sample. The main results are shown here below.

MC performance

      50 bins, free varying config space
Mixed Entries True Purity True N Purity MLL Value N MLL Value
368 0. 5.38E-1 -1.41E-1 +- 1.3E-1 5.40E-1 +- 2.82E-2
408 0.1 5.98E-1 3.39E-2 +- 1.14E-2 5.99E-1 +- 2.97E-2
460 0.2 6.73E-1 9.63E-2 +- 1.13E-1 6.75E-1 +- 3.15E-2
525 0.3 7.69E-1 1.59E-1 +- 1.08E-1 7.70E-1 +- 3.36E-2
613 0.4 8.97E-1 2.83E-1 +- 9.84E-2 9.0E-1 +- 3.63E-2
736 0.5 1.07E0 3.65E-1 +- 8.82E-2 1.08E0 +- 3.98E-2
613 0.6 8.97E-1 4.8E-1 +- 9.56E-2 9.0E-1 +- 9.56E-2
525 0.7 7.69E-1 5.28E-1 +- 1.02E-1 7.71E-1 +- 3.36E-2
460 0.8 6.73E-1 6.86E-1 +- 1.04E-1 6.73E-1 +- 3.14E-2
408 0.9 5.98E-1 7.34E-1 +- 1.08E-1 5.99E-1 +- 2.96E-2
368 1.0 5.38E-1 8.26E-1 +- 9.99E-2 5.40E-1 +- 2.82E-2

The lack of statistics is clearly an issue here, however the estimation of N remains insensitive to that.

Further plans TODO

  • Estimate errors
We have studied the effect of statistics in the purity error. For that we have used one half of the sample as control sample (1317 events) and taken different subsamples from the other. The purity space was restricted to [0,1] and the norm set as a free parameter. In the next plot (and the table below) is shown the purity error as compute by the MLL method vs the number of events in the mixed MC sample. In red is shown the Sqrt(N) to compare the statistical error with the rest at a given N.


The "weird" points (in red below) are being looked more carefully now to understand its odd behaviour.

N=500 N=700 N=900 N=1317
0.06 0.08 0.05 0.06
0.14 0.66 0.65 0.08
0.18 0.15 0.12 0.09
0.15 0.13 0.13 0.1
0.16 0.15 0.13 0.1
0.17 0.15 0.14 0.11
0.19 0.16 0.14 0.12
0.19 0.16 0.14 0.12
0.18 0.16 0.14 0.12
0.18 0.69 0.18 0.17
0.16 0.07 0.05 0.04

Confused How is the error affected in the low&high end?? Is somehow the limited parameter range playing a role here? (i.e. the error is smaller in both 0 & 1 purity values which are the limits of the allowed parameter space).

  • Extend the method to more to multiple variables
  • Pt dependency: studies of performance in different pt bins (provided enough statistics is available)

-- MartinTripiana - 28 Aug 2008

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif Et20.fracs1.IsEM.gif r1 manage 9.3 K 2008-08-28 - 11:45 MartinTripiana  
GIFgif JF17.fracs1.IsEM.convrecov.gif r1 manage 11.1 K 2008-09-03 - 13:33 MartinTripiana fracs1 distribution from selected JF17 photons. all(left), by mother(right). Conversion recovered.
GIFgif JF17.fracs1.IsEM.gif r1 manage 10.8 K 2008-08-29 - 11:03 MartinTripiana fracs1 distribution from selected JF17 photons. all(left), by mother(right)
GIFgif JF17.fracs1.IsEM.nogeant.gif r1 manage 10.8 K 2008-09-03 - 15:27 MartinTripiana fracs1 distribution from selected JF17 photons. all(left), by mother(right) . No geant particles considered
GIFgif MLL.JF17.50b.gif r1 manage 9.8 K 2008-08-29 - 12:24 MartinTripiana  
GIFgif MLL.JF17.50b.nogeant.gif r1 manage 9.8 K 2008-09-03 - 15:28 MartinTripiana output ProfileMethod.MLL.cxx on JF17 dijets. No geant particles.
GIFgif MLL.MC.all2gether.gif r2 r1 manage 8.7 K 2008-09-02 - 21:37 MartinTripiana  
GIFgif MLL.MC.p04.10b.gif r1 manage 9.2 K 2008-08-28 - 17:47 MartinTripiana output ProfileMethod.MLL.cxx
GIFgif MLL.MC.p04.50b.gif r1 manage 9.9 K 2008-08-28 - 17:48 MartinTripiana output ProfileMethod.MLL.cxx
GIFgif MLL.errorEvolution.gif r1 manage 9.2 K 2008-09-15 - 20:20 MartinTripiana  
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2008-09-15 - MartinTripiana
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback