Decorrelated Adversarial Neural Network
In this project the decorrelation was achieved by introducing a second network
which tries to estimate the mass of an event based on the output of the classifier and the

of the event. The loss functions of the combined net,
which we will call the adversarial net from now on, is given by
where

is the loss function for the neural net,
that tries to estimate the mass and

the one for the classifier.
The classifier uses cross entropy as a loss function while the regression uses mse.
At first only the classifier is trained (as of now) just for fix 50 epochs,
from epoch 50 to 100 then only the regression is trained, after epoch 100 both nets are trained simultaneously.
The decorrelated neural Network can be used by setting the parameter 'massless' to 'adversarial' .
The following hyperparameters can then be set:
'massless_importance': This is the term

in combined the loss function
Chi Square Distance for the Histograms
In order to get a quantisation of the sculpting a chi-square distance
as in
https://root.cern.ch/doc/master/classTH1.html#a6c281eebc0c0a848e7a0d620425090a5
is used.
The two histograms to be compared are the background without any cuts applied on the DNN variable
and the background where some cuts on the DNN are applied. The distance is averaged over 4 quantiles of the dnn score.
In the following two extreme cases can be seen:



On the left

is equal to 0, this was because in this plot there are no entries, because of the cut on a high DNN Score
On the right $\chi^2$% is high and we can see the sculpting clearly
However we want to balance the two extreme cases, one example for that can be seen in the following figure:

is greater than zero, but still close to zero, that is what we would like to see
---++ Kolmogorov Distance
Another measure of distance was also used , namely Kolmogorov distance for more see here
https://root.cern.ch/doc/master/classTH1.html#aeadcf087afe6ba203bcde124cfabbee4
. It is bound between 0 and 1,
with higher values meaning more compatability.
However it is not as discrimating as the

, the reason can be in the following plots. For 2 values of 0 of the kolmogorov distance, there can be seen a big difference in the sculpting:


Issues with the current implementation of the adversarial neural network
Currently there is one major issue with the adversarial implementation. Namely that the network mostly only choses to converge into two extreme cases.
Either it does not decorrelate at all or it gives every event the same score, and therefore removing any information about the mass fromt he output. These can be seen in the following figures:


On the left therese a heatmap for the significance and on the right one for the
Another problem is a conceptual one. if we maximize the loss of the regression network, why shouldn't the regression network just estimate values that are far off the mass range?
This is what happens if we choose to just simple minimize the loss

:
Decorrelation by DisCo
Since the network is not easy to get to converge, another method was looked at,
namely Distance Correlation. It adds the Distance correlation (
DisCo ) between the output variable
and the mass together with a hyperparameter as multiplier to the loss function. The
DisCo is a lot easier to implement and gives promising results:
Decorrelation with
DisCo, little Sculpting observable in highest DNN quantile
Also the significance did not reduce dramatically:
Scanned Hyperparameters
There has been conducted a grid search with the following parameters:
'massless_importance': over the range of 1 -100
'nEpochs': [50,200,400,600]
The result in this grid search can be seen here, one heatmap of the significance on the left and one for the %log($\chi^2 $)% on the right:

To use the decorrelation with the use of Distance Correlation, the parameter 'massless' has to be set to 'disco' and a 'massless_importance' has to be chosen.
The importance was also scanned for different network configurations, one can see that it saturates at 100:

Comparing Decorrelation with
The effects of adding an additional decorrelation term, namely the disco term between the ouput and the variable pt have been investigated.
On the left there is a grid search with decorreletion wrt the mass only, on the right a term with

has been added. In the upper row its for the significances and in
the bottom row for the


For the most optimal run of both, namely (importance=120,epochs=600) for only mass decorr and (imp=100,epochs=600) with pt decorr on the right, we get a value for the significance of 1.89 in both cases.
This is when we want to see the diboson peak, otherwise there is a slightly better run, when we only decorrelate the mass, but there you can't recognize the diboson peak. However the diboson peaks only differs from the higgs peak in one feature which is the mass, meaning that it is not completly mass decorrelated.

DisCo Adversarial
Since the biggest problem was to get the Adversarial net to converge, another idea was implemented. It is a network that is first decorrelated for n Epochs, using the disco correlation. After that it is trained adversarially vs a mass regression. A grid search was conducted in the importances of both decorrelation methods. The results can be seen in the following:
On the left is the significance scan and on the right the log(

) scan, note that the cases with a

bigger than 10 were given the significance of 1.5 - so that there is no confusion since they are clearly sculpting the mass.

On the left there is the run where the diboson peak is visible and it has the maximum significance, this is with a importance of 0.001 and a disco importance of 60, resulting in a significance of 1.91. For the one on the right, a high importance as well as disco importance was used, resulting in a significance of 1.71.

on the left there is the score plot from 0 to 1, this can be interpretted incorrectly, further inverstigations show that the signal and background have different distributions, just at low values of the DNN score. This is due to incoperating the adversarial loss. When using softmax as usual in a classifier, the score represents the probability that it is either one of the classes, however now this is no longer the case when using an additional loss term as in the adversarial.

Disco Adversarial with different learnrates
In
https://arxiv.org/pdf/1703.03507.pdf
it is recommanded to train both networks with 2 different learning rates, the one of the regression being bigger. An Implementation of this has been done and about 500 random grid points in a 4 dimensional hyperparameter space have been tested. The results are non conclusive. The significance reached its max as 1.85. For the most optimal run, it has a significance of 1.82 and the following mass plot decorrelation is observed, the configuration for this run was: imp_1_epochs_800_discoepochs_0.7_learnrate_0.01-0.0005-0.1-0.001_discoimp_180

:w
Final Comparison of all three methods
In one final grid search all methods of decorrelation were compared. When using just the adversarial decorrelation or disco decorrelation, only the hyperparameter

was scanned together with different learnrate configurations. For the combined method, first decorrelation with disco and then an adversarial net

and

was scanned, as well as the learnrate configuration and the ratio between disco epochs and adversarial epochs, 1 = Only disco.
Just adversarial decorrelation


As we expect the

as well as the significance drops with increasing

. On the right there are the mass plots with the highest significance (significances=1.7,1.69) were we dont see significant sculpting, not that the

in the left plot is averaged over all quantiles, meanwhile in the right plot it is just the

of the highest DNN score quantile. Generally everything with a

bigger than 5 has significant sculpting if we look at the averaged

. But most importantly in none of the just adversarial decorrelation we can observe the dibson peak.
Also note that in this training the regression network and the classifier network had different learnrates, where the classifier learnrate was smaller by a factor of

than the regression learnrate
Just disco decorrelation


For the disco decorrelation we see that the significance ist sitributed on a narrower scale, this is also due to the fact that a narrower range was scanned for the hyperparameter %\lambda%. The plots on the right again show the massdistribution, the first one corresponds to a significance of 1.99, but we can recognize minor sculpting, the right is the first one where we dont see such increase with mass, it yiels then a significane of 1.94.
Note that the diboson peak can not be seen in any of the mass plots, although in previous scans it could be seen. This might be due to the fact than in previous scans around 400 grid points were scanned, resulting in few configuarations that show the diboson peak. In this scan only 50 points were used.
Disco adversarial decorrelation



-
The plot on the left shows a heatmap of the

and significance for (

), the middle one as (%\lambda%, amountof disco epochs). Note that it was always chosen the maximal significance for a given (

,amount of disco) configuration, respectively (

,amount of disco,

) in the middle plot. On the left plot we can recognize the diboson peak, it corresponds to a significance of 1.87.
Not that the the diboson peak can be seen in nine of the top twelve grid points ranked in significance, where we exclude the cases with clear sculpting.
There is a presentation giving a quick overview of the steps that have been conducted until now:
presentation_tfVHbb.pdf
Importance of the diboson peak
The diboson peak leaves the same trace in the detector as the higgs signal if we exclude the invariant mass as a feature. This means a complete mass decorrelated network should not assign low scores to events coming from the diboson contribution. Therefore an important step in the studies was to check whether we can observe the diboson peak in the highest signal quantile. Also a sanity test was applied in every case were we see the diboson peak: If we remove the diboson contribution from the sample the peak needs to disappear.


On the left we can see the evaluation on the test set using the diboson contribution, in the middle without. Note that the excess of events around 90
GeV disappears. On the right there is the distribution of scores for the dibosn sample and signal sample. Note that as expected there are few events where the net assigns low scores to diboson events.
All of these nets were trained with the diboson contribution in sample. Another option would be to train without the diboson sample and check whether the neural net assigns the same scores to diboson eveents are for signal. The idea behind this would then be that since the classifier has never seen background events that leave exactly the same trace in the detector as the signal it should then classify all of them as signal. However the eventual classifier would the carry a slight bias, since it was not trained on a sample that is distributed as the data we will finally feed it.
Evaluation of the neural nets
Note: All of the previous studies have been conducted on the SR_medhigh_Znn region as well as the following unless specified otherwise
The evaluation of the neural nets was done with the Xbb framework, one problem was that tensorflow version that has been used to conduct the previous studies was V1.13, however the CMSSW/Xbb framework only runs tensorflow V1.5.0, unfortunately the models from version of tensorflow >V1.12 are incompatible with tensorflow V1.5.0. Therefore the models had to be run again, were the optimal configuration from previous runs was used. However not a big random search for all hyper parameters has been conducted in order to move forward, to build the machinery for the 2D fit. For the configuration used a significance of 1.891 was obtained. However this will surely not be the configuration used in the final investigation when the machinery is complete.
Evaluation of the networks is done by using the following command:
./submit.py -T Zvv2017 -F eval-tf_v1 -J run --addCollections Eval.SR_medhigh_Znn --set='Directories.samplefiles:=<!Directories|samplefiles_split!>' --input SYSout --output EVALout -i --set='systematics.systematics=Nominal' --force
There are few things to mention:
Prior to this training.ini has to be edited to contain the following block for the --addCollections parameter:
[SR_medhigh_Znn]
MVAtype = <!MVAGeneral|type!>
signals = [<!Plot_general|allSIG!>]
backgrounds = [<!Plot_general|allBKG!>]
treeVarSet =
ZvvBDTVarsWP branchName = DNN
checkpoint = /work/kaechb/classifier/results/tfZllDNN/Zvv2017_SR_medhigh_Znn_200310_V11-Dec9.h5/adversarial_imp_0.1_epochs_800_discoepochs_0.75_learnrate_0.05-0.01-0.005-0.0001_discoimp_160/checkpoints
signalIndex = 0
bins = [0.0000, 0.0787, 0.1353, 0.2000, 0.2726, 0.3515, 0.4343, 0.5256, 0.6276, 0.7186, 0.7919, 0.8477, 0.8913, 0.9237, 0.9495, 1.0001]
In general.ini this line has to be included:q for the region to evaluate:
SR_medhigh_Znn = tensorflowEvaluator_fromCheckpoint.tensorflowEvaluator(mvaName='SR_medhigh_Znn',condition='hJidx[0]>-1&&hJidx[1]>-1')
parameters:
--set='Directories.samplefiles:=<!Directories|samplefiles_split!>' -> this makes use of the individual samples and not the merged one
--input SYSout -> this just sets the input folder (SYSout in paths.ini)
--output EVALout-> this just sets the output folder (EVALout in paths.ini)
-i -> this uses the interactive command for the submit.py
--set='systematics.systematics=Nominal' -> this excludes the use of systematics
--force -> this overwrites the output files if they already exist
Two Dimensional Fit
1) Add all Control/Signal regions to training.ini
2) ./submit.py -T Zvv2017 -F cachetraining-v1 -J cachetraining --set='Weights.useSpecialWeight:=False' -i
3) ./submit.py -T Zvv2017 -J export_h5 -i --force
4) ..?
5) Profit
Scale Factors
Since our simulation is still a simulation and we can not perfectly simulate nature we have to apply some corrections to our MC data. This are the so called
scale factors: For every major background we look at a region that is mostly dominated by one of these backgrounds. The pT distributon of the backgrounds can be seen in the following:


The data was compared to mc in one bin:


Just out of pure interested, a check on whether the scalefactor is dependent on m was done as well:


In that region we compare data and our simulation and fit the scale factor with a max likelihood fit where the likelihood is given by:
For numerical reasons we do not maximize this term but rather minimize the negative log likelihood (nLL). We then fit the scale factors and get the uncertainty from nLL+0.5, assuming gaussian distribution of the likelihood.


1d Histograms of Data/MC agreement in Signal Region
To check the MC Data agreement in the Signal region, histograms were made. Note that around higher DNN values and around the highs peak the histograms were blinded.

For better visibility of the Higgs contribution a log scale was also employed:

2d Histograms
Applied decorrelated neural net to signal region, Binned in

and the score. The bin range was scanned from 30 x 30 to 300x300.
Histograms were made in the inclusive signal and background, signal only, background only, Zlf, Zhf, tt, and all the other backgrounds only. This can be seen in the following:





Note that all of these histograms have the sample_weight of the ntuples applied
Fit signal strength
For the signal strength we apply a similar fit, in this case we just bin in two variables, namely the mass and score. We then again maximize the likelihood but here note that there are some extra steps. We go through 2 dimension where we also bin in mjj and DNN, but also we apply the scalefactors to MC data, where we multiply the scale factor with the yield of the process per bin.
The likelihood then looks like this:
Note that currently we are only Asimov fitting , i.e.

. This gives us the sensitivity of our analysis. We receive the following values (this is with 30 bins in each dimension):