In this project the decorrelation was achieved by introducing a second network
which tries to estimate the mass of an event based on the output of the classifier and the
of the event. The loss functions of the combined net,
which we will call the adversarial net from now on, is given by
where is the loss function for the neural net,
that tries to estimate the mass and the one for the classifier.
The classifier uses cross entropy as a loss function while the regression uses mse.
At first only the classifier is trained (as of now) just for fix 50 epochs,
from epoch 50 to 100 then only the regression is trained, after epoch 100 both nets are trained simultaneously.
The decorrelated neural Network can be used by setting the parameter 'massless' to 'adversarial' .
The following hyperparameters can then be set:
'massless_importance': This is the term in combined the loss function
In order to get a quantisation of the sculpting a chisquare distance
as in https://root.cern.ch/doc/master/classTH1.html#a6c281eebc0c0a848e7a0d620425090a5 is used.
The two histograms to be compared are the background without any cuts applied on the DNN variable
and the background where some cuts on the DNN are applied. The distance is averaged over 4 quantiles of the dnn score.
In the following two extreme cases can be seen:
On the left is equal to 0, this was because in this plot there are no entries, because of the cut on a high DNN Score
On the right $\chi^2$% is high and we can see the sculpting clearly
However we want to balance the two extreme cases, one example for that can be seen in the following figure:
Another measure of distance was also used , namely Kolmogorov distance for more see here https://root.cern.ch/doc/master/classTH1.html#aeadcf087afe6ba203bcde124cfabbee4 . It is bound between 0 and 1,
with higher values meaning more compatability.
However it is not as discrimating as the , the reason can be in the following plots. For 2 values of 0 of the kolmogorov distance, there can be seen a big difference in the sculpting:
Currently there is one major issue with the adversarial implementation. Namely that the network mostly only choses to converge into two extreme cases.
Either it does not decorrelate at all or it gives every event the same score, and therefore removing any information about the mass fromt he output. These can be seen in the following figures:
On the left therese a heatmap for the significance and on the right one for the
Another problem is a conceptual one. if we maximize the loss of the regression network, why shouldn't the regression network just estimate values that are far off the mass range?
This is what happens if we choose to just simple minimize the loss :
Since the network is not easy to get to converge, another method was looked at,
namely Distance Correlation. It adds the Distance correlation ( DisCo ) between the output variable
and the mass together with a hyperparameter as multiplier to the loss function. The DisCo is a lot easier to implement and gives promising results:
Also the significance did not reduce dramatically:
There has been conducted a grid search with the following parameters:
'massless_importance': over the range of 1 100
'nEpochs': [50,200,400,600]
The result in this grid search can be seen here, one heatmap of the significance on the left and one for the %log($\chi^2 $)% on the right:
To use the decorrelation with the use of Distance Correlation, the parameter 'massless' has to be set to 'disco' and a 'massless_importance' has to be chosen.
The importance was also scanned for different network configurations, one can see that it saturates at 100:
The effects of adding an additional decorrelation term, namely the disco term between the ouput and the variable pt have been investigated.
On the left there is a grid search with decorreletion wrt the mass only, on the right a term with has been added. In the upper row its for the significances and in
the bottom row for the
For the most optimal run of both, namely (importance=120,epochs=600) for only mass decorr and (imp=100,epochs=600) with pt decorr on the right, we get a value for the significance of 1.89 in both cases.
This is when we want to see the diboson peak, otherwise there is a slightly better run, when we only decorrelate the mass, but there you can't recognize the diboson peak. However the diboson peaks only differs from the higgs peak in one feature which is the mass, meaning that it is not completly mass decorrelated.
Since the biggest problem was to get the Adversarial net to converge, another idea was implemented. It is a network that is first decorrelated for n Epochs, using the disco correlation. After that it is trained adversarially vs a mass regression. A grid search was conducted in the importances of both decorrelation methods. The results can be seen in the following:
On the left is the significance scan and on the right the log() scan, note that the cases with a bigger than 10 were given the significance of 1.5  so that there is no confusion since they are clearly sculpting the mass.
On the left there is the run where the diboson peak is visible and it has the maximum significance, this is with a importance of 0.001 and a disco importance of 60, resulting in a significance of 1.91. For the one on the right, a high importance as well as disco importance was used, resulting in a significance of 1.71.
on the left there is the score plot from 0 to 1, this can be interpretted incorrectly, further inverstigations show that the signal and background have different distributions, just at low values of the DNN score. This is due to incoperating the adversarial loss. When using softmax as usual in a classifier, the score represents the probability that it is either one of the classes, however now this is no longer the case when using an additional loss term as in the adversarial.
In https://arxiv.org/pdf/1703.03507.pdf it is recommanded to train both networks with 2 different learning rates, the one of the regression being bigger. An Implementation of this has been done and about 500 random grid points in a 4 dimensional hyperparameter space have been tested. The results are non conclusive. The significance reached its max as 1.85. For the most optimal run, it has a significance of 1.82 and the following mass plot decorrelation is observed, the configuration for this run was: imp_1_epochs_800_discoepochs_0.7_learnrate_0.010.00050.10.001_discoimp_180
:w
In one final grid search all methods of decorrelation were compared. When using just the adversarial decorrelation or disco decorrelation, only the hyperparameter was scanned together with different learnrate configurations. For the combined method, first decorrelation with disco and then an adversarial net and was scanned, as well as the learnrate configuration and the ratio between disco epochs and adversarial epochs, 1 = Only disco.
As we expect the as well as the significance drops with increasing . On the right there are the mass plots with the highest significance (significances=1.7,1.69) were we dont see significant sculpting, not that the in the left plot is averaged over all quantiles, meanwhile in the right plot it is just the of the highest DNN score quantile. Generally everything with a bigger than 5 has significant sculpting if we look at the averaged . But most importantly in none of the just adversarial decorrelation we can observe the dibson peak.
Also note that in this training the regression network and the classifier network had different learnrates, where the classifier learnrate was smaller by a factor of than the regression learnrate
For the disco decorrelation we see that the significance ist sitributed on a narrower scale, this is also due to the fact that a narrower range was scanned for the hyperparameter %\lambda%. The plots on the right again show the massdistribution, the first one corresponds to a significance of 1.99, but we can recognize minor sculpting, the right is the first one where we dont see such increase with mass, it yiels then a significane of 1.94.
Note that the diboson peak can not be seen in any of the mass plots, although in previous scans it could be seen. This might be due to the fact than in previous scans around 400 grid points were scanned, resulting in few configuarations that show the diboson peak. In this scan only 50 points were used.

The plot on the left shows a heatmap of the and significance for (), the middle one as (%\lambda%, amountof disco epochs). Note that it was always chosen the maximal significance for a given (,amount of disco) configuration, respectively (,amount of disco,) in the middle plot. On the left plot we can recognize the diboson peak, it corresponds to a significance of 1.87.
Not that the the diboson peak can be seen in nine of the top twelve grid points ranked in significance, where we exclude the cases with clear sculpting.
There is a presentation giving a quick overview of the steps that have been conducted until now: presentation_tfVHbb.pdf
The diboson peak leaves the same trace in the detector as the higgs signal if we exclude the invariant mass as a feature. This means a complete mass decorrelated network should not assign low scores to events coming from the diboson contribution. Therefore an important step in the studies was to check whether we can observe the diboson peak in the highest signal quantile. Also a sanity test was applied in every case were we see the diboson peak: If we remove the diboson contribution from the sample the peak needs to disappear.
On the left we can see the evaluation on the test set using the diboson contribution, in the middle without. Note that the excess of events around 90 GeV disappears. On the right there is the distribution of scores for the dibosn sample and signal sample. Note that as expected there are few events where the net assigns low scores to diboson events.
All of these nets were trained with the diboson contribution in sample. Another option would be to train without the diboson sample and check whether the neural net assigns the same scores to diboson eveents are for signal. The idea behind this would then be that since the classifier has never seen background events that leave exactly the same trace in the detector as the signal it should then classify all of them as signal. However the eventual classifier would the carry a slight bias, since it was not trained on a sample that is distributed as the data we will finally feed it.
Note: All of the previous studies have been conducted on the SR_medhigh_Znn region as well as the following unless specified otherwise
The evaluation of the neural nets was done with the Xbb framework, one problem was that tensorflow version that has been used to conduct the previous studies was V1.13, however the CMSSW/Xbb framework only runs tensorflow V1.5.0, unfortunately the models from version of tensorflow >V1.12 are incompatible with tensorflow V1.5.0. Therefore the models had to be run again, were the optimal configuration from previous runs was used. However not a big random search for all hyper parameters has been conducted in order to move forward, to build the machinery for the 2D fit. For the configuration used a significance of 1.891 was obtained. However this will surely not be the configuration used in the final investigation when the machinery is complete.
Evaluation of the networks is done by using the following command:
./submit.py T Zvv2017 F evaltf_v1 J run addCollections Eval.SR_medhigh_Znn set='Directories.samplefiles:=<!Directoriessamplefiles_split!>' input SYSout output EVALout i set='systematics.systematics=Nominal' force
There are few things to mention:
Prior to this training.ini has to be edited to contain the following block for the addCollections parameter:
[SR_medhigh_Znn]
MVAtype = <!MVAGeneraltype!>
signals = [<!Plot_generalallSIG!>]
backgrounds = [<!Plot_generalallBKG!>]
treeVarSet = ZvvBDTVarsWP
branchName = DNN
checkpoint = /work/kaechb/classifier/results/tfZllDNN/Zvv2017_SR_medhigh_Znn_200310_V11Dec9.h5/adversarial_imp_0.1_epochs_800_discoepochs_0.75_learnrate_0.050.010.0050.0001_discoimp_160/checkpoints
signalIndex = 0
bins = [0.0000, 0.0787, 0.1353, 0.2000, 0.2726, 0.3515, 0.4343, 0.5256, 0.6276, 0.7186, 0.7919, 0.8477, 0.8913, 0.9237, 0.9495, 1.0001]
In general.ini this line has to be included:q for the region to evaluate:
SR_medhigh_Znn = tensorflowEvaluator_fromCheckpoint.tensorflowEvaluator(mvaName='SR_medhigh_Znn',condition='hJidx[0]>1&&hJidx[1]>1')
parameters:
set='Directories.samplefiles:=<!Directoriessamplefiles_split!>' > this makes use of the individual samples and not the merged one
input SYSout > this just sets the input folder (SYSout in paths.ini)
output EVALout> this just sets the output folder (EVALout in paths.ini)
i > this uses the interactive command for the submit.py
set='systematics.systematics=Nominal' > this excludes the use of systematics
force > this overwrites the output files if they already exist
1) Add all Control/Signal regions to training.ini
2) ./submit.py T Zvv2017 F cachetrainingv1 J cachetraining set='Weights.useSpecialWeight:=False' i
3) ./submit.py T Zvv2017 J export_h5 i force
4) ..?
5) Profit
Since our simulation is still a simulation and we can not perfectly simulate nature we have to apply some corrections to our MC data. This are the so called scale factors: For every major background we look at a region that is mostly dominated by one of these backgrounds. The pT distributon of the backgrounds can be seen in the following:
The data was compared to mc in one bin:
Just out of pure interested, a check on whether the scalefactor is dependent on m was done as well:
In that region we compare data and our simulation and fit the scale factor with a max likelihood fit where the likelihood is given by:
For numerical reasons we do not maximize this term but rather minimize the negative log likelihood (nLL). We then fit the scale factors and get the uncertainty from nLL+0.5, assuming gaussian distribution of the likelihood.
To check the MC Data agreement in the Signal region, histograms were made. Note that around higher DNN values and around the highs peak the histograms were blinded.
For better visibility of the Higgs contribution a log scale was also employed:
Applied decorrelated neural net to signal region, Binned in and the score. The bin range was scanned from 30 x 30 to 300x300.
Histograms were made in the inclusive signal and background, signal only, background only, Zlf, Zhf, tt, and all the other backgrounds only. This can be seen in the following:
Note that all of these histograms have the sample_weight of the ntuples applied
For the signal strength we apply a similar fit, in this case we just bin in two variables, namely the mass and score. We then again maximize the likelihood but here note that there are some extra steps. We go through 2 dimension where we also bin in mjj and DNN, but also we apply the scalefactors to MC data, where we multiply the scale factor with the yield of the process per bin.
The likelihood then looks like this:
Note that currently we are only Asimov fitting , i.e. . This gives us the sensitivity of our analysis. We receive the following values (this is with 30 bins in each dimension):
I  Attachment  History  Action  Size  Date  Who  Comment 

png  2d_bkg.png  r1  manage  19.0 K  20200625  14:08  BennoKach  
png  2d_other.png  r1  manage  19.3 K  20200625  14:08  BennoKach  
png  2d_sig.png  r1  manage  18.2 K  20200625  14:08  BennoKach  
png  2d_sigbkg.png  r1  manage  19.5 K  20200625  14:08  BennoKach  
png  2d_tt.png  r1  manage  19.5 K  20200625  14:08  BennoKach  
png  2d_zhf.png  r1  manage  21.3 K  20200625  14:08  BennoKach  
png  2d_zlf.png  r1  manage  18.6 K  20200625  14:08  BennoKach  
png  CR_hf.png  r1  manage  29.8 K  20200625  14:41  BennoKach  
png  CR_lf.png  r1  manage  32.1 K  20200625  14:41  BennoKach  
png  CR_tt.png  r1  manage  29.8 K  20200625  14:41  BennoKach  
png  DNN.png  r1  manage  30.9 K  20200627  09:44  BennoKach  
png  DNN_log.png  r1  manage  28.4 K  20200627  09:53  BennoKach  
png  adversarial_2_bins.png  r1  manage  10.8 K  20200518  08:47  BennoKach  
png  amountdisco_adversarial.png  r1  manage  28.7 K  20200601  08:10  BennoKach  
png  best_run_quantile_top.png  r1  manage  11.3 K  20200430  10:18  BennoKach  DisCo Decorrelation 
png  best_run_sig.png  r1  manage  11.0 K  20200430  10:21  BennoKach  
png  bestrundiscoadv.png  r2 r1  manage  11.4 K  20200518  09:19  BennoKach  
png  bestrundiscodnn.png  r1  manage  20.6 K  20200612  18:41  BennoKach  
png  bestrundiscowith.png  r1  manage  29.7 K  20200612  18:39  BennoKach  
png  bestrundiscowo.png  r1  manage  28.5 K  20200612  18:39  BennoKach  
png  bestrunrandomdiscoadv.png  r1  manage  11.2 K  20200520  11:02  BennoKach  
png  chi_saturate.png  r1  manage  13.1 K  20200509  10:44  BennoKach  
png  chisquare0.png  r1  manage  10.1 K  20200326  18:16  BennoKach  
png  chisquarehigh.png  r1  manage  10.8 K  20200326  18:16  BennoKach  
png  chisquareinbetween.png  r1  manage  11.3 K  20200331  01:26  BennoKach  
png  disco.png  r1  manage  18.2 K  20200601  07:52  BennoKach  
png  discoadvchigrid.png  r1  manage  74.2 K  20200518  08:42  BennoKach  
png  discoadvsiggrid.png  r1  manage  69.6 K  20200518  08:42  BennoKach  
png  div_loss_adv.png  r1  manage  16.8 K  20200507  13:56  BennoKach  
png  firstgridsearch.png  r1  manage  8.7 K  20200326  17:29  BennoKach  First Grid search, empty fields mean that the DNN only classified everything as the same class 
png  grid_adv_chi.png  r1  manage  78.5 K  20200507  13:08  BennoKach  
png  grid_adv_sig.png  r1  manage  65.5 K  20200507  13:08  BennoKach  
png  grid_adversarial.png  r1  manage  28.9 K  20200601  07:30  BennoKach  
png  grid_chi_disco_with_pt.png  r1  manage  77.3 K  20200512  15:21  BennoKach  
png  grid_chi_disco_wo_pt.png  r1  manage  76.7 K  20200512  15:21  BennoKach  
png  grid_sig_disco_with_pt.png  r1  manage  76.5 K  20200512  15:17  BennoKach  
png  grid_sig_disco_wo_pt.png  r1  manage  76.6 K  20200512  15:18  BennoKach  
png  heatmap_log_chi.png  r1  manage  119.1 K  20200507  11:39  BennoKach  
png  heatmap_sig.png  r1  manage  110.4 K  20200507  11:39  BennoKach  
png  hist_tt.png  r1  manage  25.8 K  20200625  14:08  BennoKach  
png  hist_zhf.png  r1  manage  27.8 K  20200625  14:08  BennoKach  
png  hist_zlf.png  r1  manage  27.7 K  20200625  14:08  BennoKach  
png  imp0.png  r1  manage  11.4 K  20200512  14:41  BennoKach  Kolmogorov is 0 
png  justadv.png  r1  manage  13.7 K  20200601  07:34  BennoKach  
png  mass_sculpting_adv.png  r1  manage  11.1 K  20200518  08:41  BennoKach  
png  max_decorr_adv.png  r1  manage  11.1 K  20200518  08:45  BennoKach  
png  middlekol.png  r1  manage  11.4 K  20200512  14:41  BennoKach  Kolmogorov is 0.5 
png  mjj.png  r1  manage  29.9 K  20200627  09:44  BennoKach  
png  mjj_log.png  r1  manage  27.4 K  20200627  09:53  BennoKach  
png  mu_asimov.png  r1  manage  24.7 K  20200625  14:09  BennoKach  
png  mu_asimov_corr.png  r1  manage  22.7 K  20200627  11:10  BennoKach  
presentation_tfVHbb.pdf  r2 r1  manage  419.0 K  20200612  18:50  BennoKach  Presentaiton at an phd interview  
png  scores_adv.png  r1  manage  14.2 K  20200518  08:47  BennoKach  
png  sf_tt.png  r1  manage  25.4 K  20200625  14:07  BennoKach  
png  sf_tt_1.png  r1  manage  28.1 K  20200627  11:06  BennoKach  
png  sf_tt_1bin.png  r2 r1  manage  28.1 K  20200627  10:59  BennoKach  
png  sf_zhf.png  r1  manage  26.6 K  20200625  14:07  BennoKach  
png  sf_zhf_1.png  r3 r2 r1  manage  28.9 K  20200627  11:04  BennoKach  
png  sf_zhf_1_.png  r1  manage  28.9 K  20200627  11:06  BennoKach  
png  sf_zlf.png  r1  manage  25.4 K  20200625  14:07  BennoKach  
png  sf_zlf_1.png  r3 r2 r1  manage  26.6 K  20200627  11:04  BennoKach  
png  sf_zlf_1_.png  r1  manage  26.6 K  20200627  11:06  BennoKach  
png  sig_1.692justadv_imp_10_epochs_800_discoepochs_0.95_learnrate_0.050.010.0050.0001_discoimp_160.png  r1  manage  11.3 K  20200601  07:47  BennoKach  
png  sig_1.702justadv_imp_10_epochs_800_discoepochs_0.95_learnrate_0.050.010.0050.0001_discoimp_180.png  r1  manage  11.1 K  20200601  07:45  BennoKach  
png  sig_1.871adversarial_imp_0.01_epochs_800_discoepochs_0.95_learnrate_0.050.010.0050.0001_discoimp_160.png  r1  manage  11.1 K  20200601  08:11  BennoKach  
png  sig_1.941disco_imp_0.01_epochs_800_discoepochs_0.9_learnrate_0.050.010.0050.0001_discoimp_120.png  r1  manage  11.1 K  20200601  07:53  BennoKach  
png  sig_1.993disco_imp_0.01_epochs_800_discoepochs_0.85_learnrate_0.050.010.0050.0001_discoimp_140.png  r1  manage  11.3 K  20200601  07:53  BennoKach  
png  sig_saturate.png  r1  manage  13.1 K  20200509  10:15  BennoKach  
png  smallkol.png  r1  manage  11.2 K  20200512  14:41  BennoKach  Kolmogorov is 0 
png  with_pt_mass.png  r1  manage  11.0 K  20200512  15:41  BennoKach  
png  with_pt_mass_sig_189.png  r1  manage  10.9 K  20200512  15:41  BennoKach  
png  wo_pt.png  r1  manage  11.3 K  20200512  15:41  BennoKach  
png  wo_pt_mass_sig_189.png  r1  manage  11.3 K  20200512  15:40  BennoKach 