Decorrelated Adversarial Neural Network

In this project the decorrelation was achieved by introducing a second network

which tries to estimate the mass of an event based on the output of the classifier and the

$p_T$ of the event. The loss functions of the combined net,

which we will call the adversarial net from now on, is given by

\[\mathcal{L}=L_{clf}-\lambda L_{reg}\]

where $L_{reg}$ is the loss function for the neural net,

that tries to estimate the mass and $L_{clf}$ the one for the classifier.

The classifier uses cross entropy as a loss function while the regression uses mse.

At first only the classifier is trained (as of now) just for fix 50 epochs,

from epoch 50 to 100 then only the regression is trained, after epoch 100 both nets are trained simultaneously.

The decorrelated neural Network can be used by setting the parameter 'massless' to 'adversarial' .

The following hyperparameters can then be set:

'massless_importance': This is the term $\lambda$ in combined the loss function

Chi Square Distance for the Histograms

In order to get a quantisation of the sculpting a chi-square distance

as in is used.

The two histograms to be compared are the background without any cuts applied on the DNN variable

and the background where some cuts on the DNN are applied. The distance is averaged over 4 quantiles of the dnn score.

In the following two extreme cases can be seen:

chisquare is 0chisquarehigh.png
On the left $\chi^2$ is equal to 0, this was because in this plot there are no entries, because of the cut on a high DNN Score
On the right $\chi^2$% is high and we can see the sculpting clearly

However we want to balance the two extreme cases, one example for that can be seen in the following figure:

$\chi^2$ is greater than zero, but still close to zero, that is what we would like to see

---++ Kolmogorov Distance

Another measure of distance was also used , namely Kolmogorov distance for more see here . It is bound between 0 and 1,

with higher values meaning more compatability.

However it is not as discrimating as the $\chi^2$, the reason can be in the following plots. For 2 values of 0 of the kolmogorov distance, there can be seen a big difference in the sculpting:


Issues with the current implementation of the adversarial neural network

Currently there is one major issue with the adversarial implementation. Namely that the network mostly only choses to converge into two extreme cases.

Either it does not decorrelate at all or it gives every event the same score, and therefore removing any information about the mass fromt he output. These can be seen in the following figures:

On the left therese a heatmap for the significance and on the right one for the $\chi^2$

Another problem is a conceptual one. if we maximize the loss of the regression network, why shouldn't the regression network just estimate values that are far off the mass range?

This is what happens if we choose to just simple minimize the loss $L=L_{clf}-\lambda L_{reg}$:

div loss adv.png

Decorrelation by DisCo

Since the network is not easy to get to converge, another method was looked at,

namely Distance Correlation. It adds the Distance correlation ( DisCo ) between the output variable

and the mass together with a hyperparameter as multiplier to the loss function. The DisCo is a lot easier to implement and gives promising results:

best run quantile top.png
Decorrelation with DisCo, little Sculpting observable in highest DNN quantile

Also the significance did not reduce dramatically:

best run sig.png
Decorrelation with DisCo

Scanned Hyperparameters

There has been conducted a grid search with the following parameters:

'massless_importance': over the range of 1 -100

'nEpochs': [50,200,400,600]

The result in this grid search can be seen here, one heatmap of the significance on the left and one for the %log($\chi^2 $)% on the right:


To use the decorrelation with the use of Distance Correlation, the parameter 'massless' has to be set to 'disco' and a 'massless_importance' has to be chosen.

The importance was also scanned for different network configurations, one can see that it saturates at 100:


Comparing Decorrelation with $p_T$

The effects of adding an additional decorrelation term, namely the disco term between the ouput and the variable pt have been investigated.

On the left there is a grid search with decorreletion wrt the mass only, on the right a term with $p_T$ has been added. In the upper row its for the significances and in

the bottom row for the $\chi^2$



For the most optimal run of both, namely (importance=120,epochs=600) for only mass decorr and (imp=100,epochs=600) with pt decorr on the right, we get a value for the significance of 1.89 in both cases.

This is when we want to see the diboson peak, otherwise there is a slightly better run, when we only decorrelate the mass, but there you can't recognize the diboson peak. However the diboson peaks only differs from the higgs peak in one feature which is the mass, meaning that it is not completly mass decorrelated.


DisCo Adversarial

Since the biggest problem was to get the Adversarial net to converge, another idea was implemented. It is a network that is first decorrelated for n Epochs, using the disco correlation. After that it is trained adversarially vs a mass regression. A grid search was conducted in the importances of both decorrelation methods. The results can be seen in the following:

On the left is the significance scan and on the right the log($\chi^2$) scan, note that the cases with a $\chi^2$ bigger than 10 were given the significance of 1.5 - so that there is no confusion since they are clearly sculpting the mass.


On the left there is the run where the diboson peak is visible and it has the maximum significance, this is with a importance of 0.001 and a disco importance of 60, resulting in a significance of 1.91. For the one on the right, a high importance as well as disco importance was used, resulting in a significance of 1.71.


on the left there is the score plot from 0 to 1, this can be interpretted incorrectly, further inverstigations show that the signal and background have different distributions, just at low values of the DNN score. This is due to incoperating the adversarial loss. When using softmax as usual in a classifier, the score represents the probability that it is either one of the classes, however now this is no longer the case when using an additional loss term as in the adversarial.


Disco Adversarial with different learnrates

In it is recommanded to train both networks with 2 different learning rates, the one of the regression being bigger. An Implementation of this has been done and about 500 random grid points in a 4 dimensional hyperparameter space have been tested. The results are non conclusive. The significance reached its max as 1.85. For the most optimal run, it has a significance of 1.82 and the following mass plot decorrelation is observed, the configuration for this run was: imp_1_epochs_800_discoepochs_0.7_learnrate_0.01-0.0005-0.1-0.001_discoimp_180


Final Comparison of all three methods

In one final grid search all methods of decorrelation were compared. When using just the adversarial decorrelation or disco decorrelation, only the hyperparameter $\lambda$ was scanned together with different learnrate configurations. For the combined method, first decorrelation with disco and then an adversarial net $\lambda$ and $\lambda_{disco}$ was scanned, as well as the learnrate configuration and the ratio between disco epochs and adversarial epochs, 1 = Only disco.

Just adversarial decorrelation


As we expect the $\chi^2$ as well as the significance drops with increasing $\lambda$. On the right there are the mass plots with the highest significance (significances=1.7,1.69) were we dont see significant sculpting, not that the $\chi^2$ in the left plot is averaged over all quantiles, meanwhile in the right plot it is just the $\chi^2$ of the highest DNN score quantile. Generally everything with a $\chi^2$ bigger than 5 has significant sculpting if we look at the averaged $\chi^2$. But most importantly in none of the just adversarial decorrelation we can observe the dibson peak.

Also note that in this training the regression network and the classifier network had different learnrates, where the classifier learnrate was smaller by a factor of $10^3$ than the regression learnrate

Just disco decorrelation


For the disco decorrelation we see that the significance ist sitributed on a narrower scale, this is also due to the fact that a narrower range was scanned for the hyperparameter %\lambda%. The plots on the right again show the massdistribution, the first one corresponds to a significance of 1.99, but we can recognize minor sculpting, the right is the first one where we dont see such increase with mass, it yiels then a significane of 1.94.

Note that the diboson peak can not be seen in any of the mass plots, although in previous scans it could be seen. This might be due to the fact than in previous scans around 400 grid points were scanned, resulting in few configuarations that show the diboson peak. In this scan only 50 points were used.

Disco adversarial decorrelation


The plot on the left shows a heatmap of the $\chi^2$ and significance for ($\lambda,\lambda_{disco}$), the middle one as (%\lambda%, amountof disco epochs). Note that it was always chosen the maximal significance for a given ($\lambda,\lambda_{disco}$,amount of disco) configuration, respectively ($\lambda$,amount of disco,$\lambda_{disco}$) in the middle plot. On the left plot we can recognize the diboson peak, it corresponds to a significance of 1.87.

Not that the the diboson peak can be seen in nine of the top twelve grid points ranked in significance, where we exclude the cases with clear sculpting.

There is a presentation giving a quick overview of the steps that have been conducted until now: presentation_tfVHbb.pdf

Importance of the diboson peak

The diboson peak leaves the same trace in the detector as the higgs signal if we exclude the invariant mass as a feature. This means a complete mass decorrelated network should not assign low scores to events coming from the diboson contribution. Therefore an important step in the studies was to check whether we can observe the diboson peak in the highest signal quantile. Also a sanity test was applied in every case were we see the diboson peak: If we remove the diboson contribution from the sample the peak needs to disappear.


On the left we can see the evaluation on the test set using the diboson contribution, in the middle without. Note that the excess of events around 90 GeV disappears. On the right there is the distribution of scores for the dibosn sample and signal sample. Note that as expected there are few events where the net assigns low scores to diboson events.

All of these nets were trained with the diboson contribution in sample. Another option would be to train without the diboson sample and check whether the neural net assigns the same scores to diboson eveents are for signal. The idea behind this would then be that since the classifier has never seen background events that leave exactly the same trace in the detector as the signal it should then classify all of them as signal. However the eventual classifier would the carry a slight bias, since it was not trained on a sample that is distributed as the data we will finally feed it.

Evaluation of the neural nets

Note: All of the previous studies have been conducted on the SR_medhigh_Znn region as well as the following unless specified otherwise

The evaluation of the neural nets was done with the Xbb framework, one problem was that tensorflow version that has been used to conduct the previous studies was V1.13, however the CMSSW/Xbb framework only runs tensorflow V1.5.0, unfortunately the models from version of tensorflow >V1.12 are incompatible with tensorflow V1.5.0. Therefore the models had to be run again, were the optimal configuration from previous runs was used. However not a big random search for all hyper parameters has been conducted in order to move forward, to build the machinery for the 2D fit. For the configuration used a significance of 1.891 was obtained. However this will surely not be the configuration used in the final investigation when the machinery is complete.

Evaluation of the networks is done by using the following command:

./ -T Zvv2017 -F eval-tf_v1 -J run --addCollections Eval.SR_medhigh_Znn --set='Directories.samplefiles:=<!Directories|samplefiles_split!>' --input SYSout --output EVALout -i --set='systematics.systematics=Nominal' --force

There are few things to mention:

Prior to this training.ini has to be edited to contain the following block for the --addCollections parameter:

MVAtype = <!MVAGeneral|type!>
signals = [<!Plot_general|allSIG!>]
backgrounds = [<!Plot_general|allBKG!>]
treeVarSet = ZvvBDTVarsWP
branchName = DNN
checkpoint = /work/kaechb/classifier/results/tfZllDNN/Zvv2017_SR_medhigh_Znn_200310_V11-Dec9.h5/adversarial_imp_0.1_epochs_800_discoepochs_0.75_learnrate_0.05-0.01-0.005-0.0001_discoimp_160/checkpoints
signalIndex = 0
bins = [0.0000, 0.0787, 0.1353, 0.2000, 0.2726, 0.3515, 0.4343, 0.5256, 0.6276, 0.7186, 0.7919, 0.8477, 0.8913, 0.9237, 0.9495, 1.0001]

In general.ini this line has to be included:q for the region to evaluate:

SR_medhigh_Znn = tensorflowEvaluator_fromCheckpoint.tensorflowEvaluator(mvaName='SR_medhigh_Znn',condition='hJidx[0]>-1&&hJidx[1]>-1')


--set='Directories.samplefiles:=<!Directories|samplefiles_split!>' -> this makes use of the individual samples and not the merged one

--input SYSout -> this just sets the input folder (SYSout in paths.ini)

--output EVALout-> this just sets the output folder (EVALout in paths.ini)

-i -> this uses the interactive command for the

--set='systematics.systematics=Nominal' -> this excludes the use of systematics

--force -> this overwrites the output files if they already exist

Two Dimensional Fit

1) Add all Control/Signal regions to training.ini

2) ./ -T Zvv2017 -F cachetraining-v1 -J cachetraining --set='Weights.useSpecialWeight:=False' -i

3) ./ -T Zvv2017 -J export_h5 -i --force

4) ..?

5) Profit

Scale Factors

Since our simulation is still a simulation and we can not perfectly simulate nature we have to apply some corrections to our MC data. This are the so called scale factors: For every major background we look at a region that is mostly dominated by one of these backgrounds. The pT distributon of the backgrounds can be seen in the following:


The data was compared to mc in one bin:


Just out of pure interested, a check on whether the scalefactor is dependent on m was done as well:


In that region we compare data and our simulation and fit the scale factor with a max likelihood fit where the likelihood is given by:


For numerical reasons we do not maximize this term but rather minimize the negative log likelihood (nLL). We then fit the scale factors and get the uncertainty from nLL+0.5, assuming gaussian distribution of the likelihood.


1d Histograms of Data/MC agreement in Signal Region

To check the MC Data agreement in the Signal region, histograms were made. Note that around higher DNN values and around the highs peak the histograms were blinded.


For better visibility of the Higgs contribution a log scale was also employed:


2d Histograms

Applied decorrelated neural net to signal region, Binned in $m_{jj}$ and the score. The bin range was scanned from 30 x 30 to 300x300.

Histograms were made in the inclusive signal and background, signal only, background only, Zlf, Zhf, tt, and all the other backgrounds only. This can be seen in the following:


Note that all of these histograms have the sample_weight of the ntuples applied

Fit signal strength

For the signal strength we apply a similar fit, in this case we just bin in two variables, namely the mass and score. We then again maximize the likelihood but here note that there are some extra steps. We go through 2 dimension where we also bin in mjj and DNN, but also we apply the scalefactors to MC data, where we multiply the scale factor with the yield of the process per bin.

The likelihood then looks like this:

$\Pi_i L(\mu|n_{i,obs},n_{i,s}+n_{i,tt}sf_{tt}+n_{i,Zlf}sf_{Zlf}+n_{i,Zhf}sf_{Zhf}+n_{other})=(\mu n_{i,s}+n_{i,tt}*sf_{tt}+n_{i,Zlf}sf_{Zlf}+n_{i,Zhf}sf_{Zhf}+n_{other})^{n_{i,obs}}\frac{e^{-\mu n_{i,s}+n_{i,tt}sf_{tt}+n_{i,Zlf}sf_{Zlf}+n_{i,Zhf}sf_{Zhf}+n_{other}}}{n_{i,obs}!}$

Note that currently we are only Asimov fitting , i.e. $n_{i,obs}=n_{i,s}+n_{i,tt}sf_{tt}+n_{i,Zlf}sf_{Zlf}+n_{i,Zhf}sf_{Zhf}+n_{other}$. This gives us the sensitivity of our analysis. We receive the following values (this is with 30 bins in each dimension):


BennoKach - 2020-03-23

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 2d_bkg.png r1 manage 19.0 K 2020-06-25 - 14:08 UnknownUser  
PNGpng 2d_other.png r1 manage 19.3 K 2020-06-25 - 14:08 UnknownUser  
PNGpng 2d_sig.png r1 manage 18.2 K 2020-06-25 - 14:08 UnknownUser  
PNGpng 2d_sigbkg.png r1 manage 19.5 K 2020-06-25 - 14:08 UnknownUser  
PNGpng 2d_tt.png r1 manage 19.5 K 2020-06-25 - 14:08 UnknownUser  
PNGpng 2d_zhf.png r1 manage 21.3 K 2020-06-25 - 14:08 UnknownUser  
PNGpng 2d_zlf.png r1 manage 18.6 K 2020-06-25 - 14:08 UnknownUser  
PNGpng CR_hf.png r1 manage 29.8 K 2020-06-25 - 14:41 UnknownUser  
PNGpng CR_lf.png r1 manage 32.1 K 2020-06-25 - 14:41 UnknownUser  
PNGpng CR_tt.png r1 manage 29.8 K 2020-06-25 - 14:41 UnknownUser  
PNGpng DNN.png r1 manage 30.9 K 2020-06-27 - 09:44 UnknownUser  
PNGpng DNN_log.png r1 manage 28.4 K 2020-06-27 - 09:53 UnknownUser  
PNGpng adversarial_2_bins.png r1 manage 10.8 K 2020-05-18 - 08:47 UnknownUser  
PNGpng amountdisco_adversarial.png r1 manage 28.7 K 2020-06-01 - 08:10 UnknownUser  
PNGpng best_run_quantile_top.png r1 manage 11.3 K 2020-04-30 - 10:18 UnknownUser DisCo Decorrelation
PNGpng best_run_sig.png r1 manage 11.0 K 2020-04-30 - 10:21 UnknownUser  
PNGpng bestrundiscoadv.png r2 r1 manage 11.4 K 2020-05-18 - 09:19 UnknownUser  
PNGpng bestrundiscodnn.png r1 manage 20.6 K 2020-06-12 - 18:41 UnknownUser  
PNGpng bestrundiscowith.png r1 manage 29.7 K 2020-06-12 - 18:39 UnknownUser  
PNGpng bestrundiscowo.png r1 manage 28.5 K 2020-06-12 - 18:39 UnknownUser  
PNGpng bestrunrandomdiscoadv.png r1 manage 11.2 K 2020-05-20 - 11:02 UnknownUser  
PNGpng chi_saturate.png r1 manage 13.1 K 2020-05-09 - 10:44 UnknownUser  
PNGpng chisquare0.png r1 manage 10.1 K 2020-03-26 - 18:16 UnknownUser  
PNGpng chisquarehigh.png r1 manage 10.8 K 2020-03-26 - 18:16 UnknownUser  
PNGpng chisquareinbetween.png r1 manage 11.3 K 2020-03-31 - 01:26 UnknownUser  
PNGpng disco.png r1 manage 18.2 K 2020-06-01 - 07:52 UnknownUser  
PNGpng discoadvchigrid.png r1 manage 74.2 K 2020-05-18 - 08:42 UnknownUser  
PNGpng discoadvsiggrid.png r1 manage 69.6 K 2020-05-18 - 08:42 UnknownUser  
PNGpng div_loss_adv.png r1 manage 16.8 K 2020-05-07 - 13:56 UnknownUser  
PNGpng firstgridsearch.png r1 manage 8.7 K 2020-03-26 - 17:29 UnknownUser First Grid search, empty fields mean that the DNN only classified everything as the same class
PNGpng grid_adv_chi.png r1 manage 78.5 K 2020-05-07 - 13:08 UnknownUser  
PNGpng grid_adv_sig.png r1 manage 65.5 K 2020-05-07 - 13:08 UnknownUser  
PNGpng grid_adversarial.png r1 manage 28.9 K 2020-06-01 - 07:30 UnknownUser  
PNGpng grid_chi_disco_with_pt.png r1 manage 77.3 K 2020-05-12 - 15:21 UnknownUser  
PNGpng grid_chi_disco_wo_pt.png r1 manage 76.7 K 2020-05-12 - 15:21 UnknownUser  
PNGpng grid_sig_disco_with_pt.png r1 manage 76.5 K 2020-05-12 - 15:17 UnknownUser  
PNGpng grid_sig_disco_wo_pt.png r1 manage 76.6 K 2020-05-12 - 15:18 UnknownUser  
PNGpng heatmap_log_chi.png r1 manage 119.1 K 2020-05-07 - 11:39 UnknownUser  
PNGpng heatmap_sig.png r1 manage 110.4 K 2020-05-07 - 11:39 UnknownUser  
PNGpng hist_tt.png r1 manage 25.8 K 2020-06-25 - 14:08 UnknownUser  
PNGpng hist_zhf.png r1 manage 27.8 K 2020-06-25 - 14:08 UnknownUser  
PNGpng hist_zlf.png r1 manage 27.7 K 2020-06-25 - 14:08 UnknownUser  
PNGpng imp0.png r1 manage 11.4 K 2020-05-12 - 14:41 UnknownUser Kolmogorov is 0
PNGpng justadv.png r1 manage 13.7 K 2020-06-01 - 07:34 UnknownUser  
PNGpng mass_sculpting_adv.png r1 manage 11.1 K 2020-05-18 - 08:41 UnknownUser  
PNGpng max_decorr_adv.png r1 manage 11.1 K 2020-05-18 - 08:45 UnknownUser  
PNGpng middlekol.png r1 manage 11.4 K 2020-05-12 - 14:41 UnknownUser Kolmogorov is 0.5
PNGpng mjj.png r1 manage 29.9 K 2020-06-27 - 09:44 UnknownUser  
PNGpng mjj_log.png r1 manage 27.4 K 2020-06-27 - 09:53 UnknownUser  
PNGpng mu_asimov.png r1 manage 24.7 K 2020-06-25 - 14:09 UnknownUser  
PNGpng mu_asimov_corr.png r1 manage 22.7 K 2020-06-27 - 11:10 UnknownUser  
PDFpdf presentation_tfVHbb.pdf r2 r1 manage 419.0 K 2020-06-12 - 18:50 UnknownUser Presentaiton at an phd interview
PNGpng scores_adv.png r1 manage 14.2 K 2020-05-18 - 08:47 UnknownUser  
PNGpng sf_tt.png r1 manage 25.4 K 2020-06-25 - 14:07 UnknownUser  
PNGpng sf_tt_1.png r1 manage 28.1 K 2020-06-27 - 11:06 UnknownUser  
PNGpng sf_tt_1bin.png r2 r1 manage 28.1 K 2020-06-27 - 10:59 UnknownUser  
PNGpng sf_zhf.png r1 manage 26.6 K 2020-06-25 - 14:07 UnknownUser  
PNGpng sf_zhf_1.png r3 r2 r1 manage 28.9 K 2020-06-27 - 11:04 UnknownUser  
PNGpng sf_zhf_1_.png r1 manage 28.9 K 2020-06-27 - 11:06 UnknownUser  
PNGpng sf_zlf.png r1 manage 25.4 K 2020-06-25 - 14:07 UnknownUser  
PNGpng sf_zlf_1.png r3 r2 r1 manage 26.6 K 2020-06-27 - 11:04 UnknownUser  
PNGpng sf_zlf_1_.png r1 manage 26.6 K 2020-06-27 - 11:06 UnknownUser  
PNGpng sig_1.692justadv_imp_10_epochs_800_discoepochs_0.95_learnrate_0.05-0.01-0.005-0.0001_discoimp_160.png r1 manage 11.3 K 2020-06-01 - 07:47 UnknownUser  
PNGpng sig_1.702justadv_imp_10_epochs_800_discoepochs_0.95_learnrate_0.05-0.01-0.005-0.0001_discoimp_180.png r1 manage 11.1 K 2020-06-01 - 07:45 UnknownUser  
PNGpng sig_1.871adversarial_imp_0.01_epochs_800_discoepochs_0.95_learnrate_0.05-0.01-0.005-0.0001_discoimp_160.png r1 manage 11.1 K 2020-06-01 - 08:11 UnknownUser  
PNGpng sig_1.941disco_imp_0.01_epochs_800_discoepochs_0.9_learnrate_0.05-0.01-0.005-0.0001_discoimp_120.png r1 manage 11.1 K 2020-06-01 - 07:53 UnknownUser  
PNGpng sig_1.993disco_imp_0.01_epochs_800_discoepochs_0.85_learnrate_0.05-0.01-0.005-0.0001_discoimp_140.png r1 manage 11.3 K 2020-06-01 - 07:53 UnknownUser  
PNGpng sig_saturate.png r1 manage 13.1 K 2020-05-09 - 10:15 UnknownUser  
PNGpng smallkol.png r1 manage 11.2 K 2020-05-12 - 14:41 UnknownUser Kolmogorov is 0
PNGpng with_pt_mass.png r1 manage 11.0 K 2020-05-12 - 15:41 UnknownUser  
PNGpng with_pt_mass_sig_189.png r1 manage 10.9 K 2020-05-12 - 15:41 UnknownUser  
PNGpng wo_pt.png r1 manage 11.3 K 2020-05-12 - 15:41 UnknownUser  
PNGpng wo_pt_mass_sig_189.png r1 manage 11.3 K 2020-05-12 - 15:40 UnknownUser  
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2020-06-27 - unknown
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback