-- AntonioGiannini - 2019-11-07


Introduction

Sorry for typo errors in the copy-paste code, please look at the full code on the git repository:

paths to ntuples:
  • ATLAS UI Napoli:
    • /data3/agiannini/DBLCode_19nov19/TreeAnalysis_MLUpdate_05mar19/TreeAnalysis_InputJetsOR_29may19/TreeAnalysis_mc16a/run/mc16a_RNN_2jets_TrainVBFggFH1000_04jun19/CutRNN_0p8_SelSpin0/
    • /data3/agiannini/DBLCode_19nov19/TruthDAOD/xAODReader/run/xMLTutorial_20jul19/
  • lxplus:
    • /afs/cern.ch/work/a/angianni/public/MLTutorial_22jul19/Trees
    • /afs/cern.ch/work/a/angianni/public/MLTutorial_22jul19/TruthTrees

Presequisites

  • python

How to install miniconda and keras

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh=====

logout and back in, now conda should work

conda create -n AIenv

conda update conda

conda config --set auto_activate_base False

conda activate AIenv (if you have problem try only: activate AIenv)

=conda install keras scikit-learn pandas numpy theano matplotlib seaborn pydot colorama

=pip install scikit-hep (if you have problem with pip version please update the python version using the command: scl enable rh-python36 bash)

=pip install uproot

Usefull links

Machine Learning Forum suggestions:

Dinos @ Exotics Workshop 2018 Roma RNN lecture
  • slazebni.cs.illinois.edu/spring17/lec20_rnn.pdf
Michela Paganini tutorial ML chat with Silvia/Claudio lwtnn: how to use keras-model into C++ based framework


    • Set arr = ->
    • Set l = <
    • Set r = >

Session1: Deep Neural Network (DNN)

Read Trees in ROOT files

Let's start to read the Trees stored into ROOT files we have. It is enought to know the path to ROOT files, the name of the files and the name of the tree we would like to read.


#!/bin/python

import os, re, pickle

import uproot

import pandas as pd<br /><br />

path = '/data3/agiannini/DBLCode_19nov19/TreeAnalysis_MLUpdate_05mar19/TreeAnalysis_InputJetsOR_29may19/TreeAnalysis_mc16a/run/mc16a_RNN_2jets_TrainVBFggFH1000_04jun19/CutRNN_0p8_SelSpin0/'

files = ['FlatTree_ggFH1000_spin0.root', 'FlatTree_Diboson_spin0.root']

for f in files:

print(f)

file_name = path + '/' + f

print(file_name)

if os.path.isfile( file_name ):

theFile = uproot.open(file_name)

theKeys = theFile.keys()

print(theKeys)

tree = theFile['nominal']

DF = tree.pandas.df()

print(DF)

<br />

Manage Panda DataFrames

Given the dataframe from the tree, we can simply add new informations we need, for example, add a new columns where we can fix if the sample will be used for category 0 or 1 of the binary classification.


### Get Pandas DataFrame from tree

DF = tree.pandas.df()

if 'ggFH' in f : DF['Category'] = 1

else: DF['Category'] = 0

print(DF)

<br />

if you need you can do a pre-selection on the events you really need for the traning of the model:


### Do Pre-Selection

print ('{:<20} {:<15}'.format('Events before preselection:', Fore.BLUE+str(DF.shape[0])))

presel = (DF.isGGFMergedPresel==1 ) #& ()

DF = DF[ presel ]

print ('{:<20} {:<15}'.format('Events after preselection:', Fore.BLUE+str(DF.shape[0])))

<br />

we need to include a python library in order to use the nice color style smile


from colorama import init, Fore, Back, Style

init(autoreset=True)

<br />

It could be VERY usefull to write this dataframe to a pickle file or a numpy array file, for example in order to avoid to re-read the original file several times.


### Save DF as numpy or pickle file

file_name_simple = f

file_name_simple = file_name_simple.replace('FlatTree_', '')

file_name_simple = file_name_simple.replace('_spin0.root', '')

DF.to_pickle( path_output + file_name_simple + '_FullNoRandom.pkl')<br /><br />

<br />

Train/Test splitting

Given the dataframes directly from the tree, it could be usefull to shuffle the events there before start to work with them (think at how the different Z+jets slices are distributed into the trees...); of course it is NOT really needed here. We need the scikit-learn library to perform the shuffling.


import sklearn.utils

...

### Shuffle the events

X = sklearn.utils.shuffle(DF, random_state=123) #'123' is the random seed

<br />

Once we shuffled the events, we can start to split the samples in Test/Train sub-datasets.


### Split in Train/Test samples

print ('{:><hr />20} {:<15} {:<15}'.format('Number of events',Fore.GREEN+str(X.shape[0])," Splitting to..."))

Ntrain_stop = int(round(X.shape[0] * 0.7))

X_Train = X[:Ntrain_stop]

X_Test = X[Ntrain_stop:]

print ('{:><hr />20} {:<15}'.format('Train events',Fore.GREEN+str(X_Train.shape[0])))

print ('{:><hr />20} {:<15}'.format('Test events',Fore.GREEN+str(X_Test.shape[0])+'\n' ))

<br />

Now, let's think that we have more trees that we would like to use as samples of the Category 0 of our binary problems and more trees for the Category 1. The simpler way to work with them is just to put together all the dataframes of the Category0 and of the Category1.


### Manage Train/Test DFs already created

samples_categ0 = ['Diboson']

samples_categ1 = ['ggFH1000']

X_train_categ0 = []

X_test_categ0 = []

X_train_categ1 = []

X_test_categ1 = []<br /><br />

for f in files:

for i in samples_categ0:

if i in f:

print ('{:><hr />20} {:<15}'.format('Reading DataFrame ', Fore.GREEN+str(path_output + i + '_Train.pkl') ) )

X_train = pd.read_pickle( path_output + i + '_Train.pkl')

X_train_categ0 += [X_train]

print ('{:><hr />20} {:<15}'.format('Reading DataFrame ', Fore.GREEN+str(path_output + i + '_Test.pkl') ) )

X_test = pd.read_pickle( path_output + i + '_Test.pkl')

X_test_categ0 += [X_test]

for i in samples_categ1:

if i in f:

print ('{:><hr />20} {:<15}'.format('Reading DataFrame ', Fore.GREEN+str(path_output + i + '_Train.pkl') ) )

X_train = pd.read_pickle( path_output + i + '_Train.pkl')

X_train_categ1 += [X_train]

print ('{:><hr />20} {:<15}'.format('Reading DataFrame ', Fore.GREEN+str(path_output + i + '_Test.pkl') ) )

X_test = pd.read_pickle( path_output + i + '_Test.pkl')

X_test_categ1 += [X_test]

DF_train_categ0 = pd.concat(X_train_categ0)

DF_train_categ1 = pd.concat(X_train_categ1)

DF_test_categ0 = pd.concat(X_test_categ0)

DF_test_categ1 = pd.concat(X_test_categ1)

DF_train_categ0.to_pickle( path_output + 'DF_Train_categ0.pkl')

DF_train_categ1.to_pickle( path_output + 'DF_Train_categ1.pkl')

DF_test_categ0.to_pickle( path_output + 'DF_Test_categ0.pkl')

DF_test_categ1.to_pickle( path_output + 'DF_Test_categ1.pkl')

<br />

now, put togheter and shuflle the events of the 2 categories:


### Mix together the 2 categories

DF_train = X_train_categ0 + X_train_categ1

DF_train = pd.concat(DF_train)

DF_train = sklearn.utils.shuffle(DF_train, random_state=123) #'123' is the random seed

DF_test = X_test_categ0 + X_test_categ1

DF_test = pd.concat(DF_test)

<br />#DF_test = sklearn.utils.shuffle(DF_test, random_state=123) # it is not really needed shuffle the test events<br /><br />

<br />

finally, we can bulid the X and y arrays we need in order to perform the training of out model:


### build X and y for the training

VariablesToModel = [

'fat_jet_pt', 'fat_jet_eta', 'fat_jet_phi', 'fat_jet_E', 'l1_pt', 'l1_eta', 'l1_phi', 'l1_e', 'l2_pt', 'l2_eta', 'l2_phi', 'l2_e']

X_train = DF_train[VariablesToModel].as_matrix()

y_train = DF_train['Category'].as_matrix()

X_test = DF_test[VariablesToModel].as_matrix()

y_test = DF_test['Category'].as_matrix()

DF_train[VariablesToModel].to_pickle( path_output + 'X_Train.pkl')

DF_train['Category'].to_pickle( path_output + 'y_Train.pkl')

DF_test[VariablesToModel].to_pickle( path_output + 'X_Test.pkl')

DF_test['Category'].to_pickle( path_output + 'y_Test.pkl')

<br />

Look at input features and features scaling

The last step before training the model is to perform the scaling of the input features (and why not give a look to them if not done already).

For make quick plots of the input feature use:


variable = 'fat_jet_pt'

bins = 50

a = 0

b = 1000

plt.hist(DF_train_categ0[variable], bins=bins, range=[a,b], histtype='step', lw=2, alpha=0.5, label=[r'Category0 Train'], normed=True)

plt.hist(DF_train_categ1[variable], bins=bins, range=[a,b], histtype='step', lw=2, alpha=0.5, label=[r'Category1 Train'], normed=True)

plt.hist(DF_test_categ0[variable], bins=bins, range=[a,b], histtype='stepfilled', lw=2, alpha=0.5, label=[r'Category0 Test'], normed=True)

plt.hist(DF_test_categ1[variable], bins=bins, range=[a,b], histtype='stepfilled', lw=2, alpha=0.5, label=[r'Category1 Test'], normed=True)

plt.ylabel('Norm. Entries')

plt.xlabel(variable)

plt.legend(loc="upper right")

plt.savefig(path_output+"/plot_" + variable + ".pdf")

plt.clf()

<br />

Include the sklear library:


from sklearn import preprocessing

from sklearn.preprocessing import StandardScaler, LabelEncoder

<br />

and do:


scaler = preprocessing.StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

<br />

Build model: dense layers and dropout

We are now ready to build our architecture and try the training of the model; first include:


from keras.models import Sequential,Model

from keras.layers.core import Dense, Activation

from keras.layers import BatchNormalization, Dropout, concatenate

from keras.callbacks import ModelCheckpoint, EarlyStopping

from keras.optimizers import Adam

<br />

and then build the model:


print (Fore.BLUE+"--------------------------------")

print (Back.BLUE+" BUILDING DNN MODEL ")

print (Fore.BLUE+"--------------------------------")

N_input = 8 # number of input variables

width = 24 # number of neurons for layer

dropout = 0.3

depth = 3 # nuber of hidden layers

model = Sequential()

model.add(Dense(units=width, input_dim=N_input))

model.add(Activation('relu'))

model.add(Dropout(dropout))

for i in range(0, depth):

model.add(Dense(width))

model.add(Activation('relu'))

model.add(Dropout(dropout))

model.add(Dense(1, activation='sigmoid'))<br /><br />

<br />

and perform the traning:


print (Fore.BLUE+"-----------------------------")

print (Back.BLUE+" TRAIN DNN MODEL ")

print (Fore.BLUE+"-----------------------------")

model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy'])

from collections import Counter

cls_ytrain_count = Counter(y_train)

print(cls_ytrain_count)

Nclass = len(cls_ytrain_count)

print ('{:<25}'.format(Fore.BLUE+'Training with class_weights because of unbalance classes !!'))

wCateg0 = (cls_ytrain_count[1] / cls_ytrain_count[0])

wCateg1 = 1.0

print(Fore.GREEN+'Weights to apply:')

print('{:<15}{:<15}'.format('Category0',round(wCateg0, 3)))

print('{:<15}{:<15}'.format('Category1',wCateg1))

callbacks = [

# if we don't have a decrease of the loss for 4 epochs, terminate training.

EarlyStopping (verbose=True, patience=7, monitor='val_loss'),

# Always make sure that we're saving the model weights with the best val loss.

ModelCheckpoint ( path_output+'/model.h5', monitor='val_loss', verbose=True, save_best_only=True)<br />]

modelMetricsHistory = model.fit(

X_train,

y_train,

class_weight={

0 : wCateg0,

1 : wCateg1},

epochs=100,

batch_size=2048,

validation_split=0.2,

callbacks=callbacks,

verbose=1

)<br /><br />

<br />

Output score and ROC curve


### Evaluate the model on Train/Test

X_train_categ0 = X_train[y_train==0]

X_train_categ1 = X_train[y_train==1]

print ('Running model prediction on Xtrain_categ0')

yhat_train_categ0 = model.predict(X_train_categ0, batch_size=2048)

print ('Running model prediction on Xtrain_categ1')

yhat_train_categ1 = model.predict(X_train_categ1, batch_size=2048)

print ('Running model prediction on Xtrain')

yhat_train = model.predict(X_train, batch_size=2048)<br /><br />

X_test_categ0 = X_test[y_test==0]

X_test_categ1 = X_test[y_test==1]

print ('Running model prediction on Xtest_categ0')

yhat_test_categ0 = model.predict(X_test_categ0, batch_size=2048)

print ('Running model prediction on Xtest_categ1')

yhat_test_categ1 = model.predict(X_test_categ1, batch_size=2048)

print ('Running model prediction on Xtest')

yhat_test = model.predict(X_test, batch_size=2048)<br /><br />

bins=np.linspace(0,1, 50)

plt.hist(yhat_train_categ0, bins=bins, histtype='step', lw=2, alpha=0.5, label=[r'Category0 Train'], normed=True)

plt.hist(yhat_train_categ1, bins=bins, histtype='step', lw=2, alpha=0.5, label=[r'Category1 Train'], normed=True)

plt.hist(yhat_test_categ0, bins=bins, histtype='stepfilled', lw=2, alpha=0.5, label=[r'Category0 Test'], normed=True)

plt.hist(yhat_test_categ1, bins=bins, histtype='stepfilled', lw=2, alpha=0.5, label=[r'Category1 Test'], normed=True)

plt.ylabel('Norm. Entries')

plt.xlabel('DNN score')

plt.legend(loc="upper center")

plt.savefig(path_output+"/MC_Data_TrainTest_Score.pdf")

plt.clf()

<br />

Now, let's plot the ROC curve.


### Make ROC Curves

w_test = DF_test['weight']

w_train = DF_train['weight']

fpr_w, tpr_w, thresholds_w = roc_curve(y_test, yhat_test, sample_weight=w_test)

roc_auc_w = auc(fpr_w, tpr_w, reorder=True)

print ('{:<35} {:<25.3f}'.format(Fore.GREEN+'ROC AUC weighted',roc_auc_w))

fpr_train_w, tpr_train_w, thresholds_train_w = roc_curve(y_train, yhat_train, sample_weight=w_train)

roc_auc_train_w = auc(fpr_train_w, tpr_train_w, reorder=True)

print ('{:<35} {:<25.3f}'.format(Fore.GREEN+'ROC AUC weighted',roc_auc_train_w))

plt.plot(fpr_w, tpr_w, color='darkorange', lw=2, label='Full curve Test (area = %0.2f)' % roc_auc_w)

plt.plot(fpr_train_w, tpr_train_w, color='c', lw=2, label='Full curve Train (area = %0.2f)' % roc_auc_train_w)

plt.plot([0, 0], [1, 1], color='navy', lw=2, linestyle='--')

plt.xlim([-0.05, 1.0])

plt.ylim([0.0, 1.05])

plt.ylabel('True Positive Rate (weighted)')

plt.xlabel('False Positive Rate (weighted)')

plt.title('ROC curve for Category1 vs Category0')

plt.legend(loc="lower right")

plt.savefig(path_output + "/ROC_weighted.png")

plt.clf()

<br />

Reorganise the code in a (simple) class-code

main --> https://gitlab.cern.ch/angianni/mltool/blob/master/CodeForSession2/runDNN.py

helper tool --> https://gitlab.cern.ch/angianni/mltool/blob/master/CodeForSession2/Helpers.py

Session2: Deep Neural Network (DNN)

Accuracy and Loss functions

The Accuracy is defined as:

The Loss function is defined as:

Add this very simple class (or just the code if you prefer):


def plotTrainPerformance(path_output, modelMetricsHistory):

plt.plot(modelMetricsHistory.history['acc'])

plt.plot(modelMetricsHistory.history['val_acc'])

plt.title('Model accuracy')

plt.ylabel('Accuracy')

plt.xlabel('Epoch')

plt.legend(['Train', 'Val'], loc='lower right')

plt.savefig(path_output + "/Accuracy.png")

plt.clf()

# summarize history for loss

plt.plot(modelMetricsHistory.history['loss'])

plt.plot(modelMetricsHistory.history['val_loss'])

plt.title('Model loss')

plt.ylabel('Loss')

plt.xlabel('Epoch')

plt.legend(['Train', 'Val'], loc='upper right')

plt.savefig(path_output + "/Loss.png")

plt.clf()

<br />

Features ranking: permutation importance method

Look at the Eli5 libray:

https://eli5.readthedocs.io/en/latest/index.html

https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html?highlight=permutator

add it if there is time.

First, install the library:

conda install -c conda-forge eli5

Use the methods:


from eli5.permutation_importance import get_score_importances

def MyNNScore (X, y):

modelpath = 'run_test/'

model = load_model(modelpath+'/model.h5')

yhat = model.predict(X, verbose = True, batch_size=2048)

return yhat

def DoFeaturesRanking (path_output, X, y, variables):

base_score, score_decreases = get_score_importances(MyNNScore, X, y)

feature_importances = np.mean(score_decreases, axis=0)

fpr0, tpr0, thresholds0 = roc_curve(y, base_score)

AUC0 = auc(fpr0, tpr0)

score = base_score - score_decreases

#print(score.shape)

nRandom = score.shape[0]

nVars = score.shape[1]

nEvents = score.shape[2]<br /><br /> AUC = np.zeros( (nRandom, nVars) )

for i in range(0, nRandom):

for j in range(0, nVars):

score_all = score[i][j][:]

score_1 = score_all[y==1]

score_0 = score_all[y==0]

fpr, tpr, thresholds = roc_curve(y, score_all)

AUC[i][j] = auc(fpr, tpr)

AUCs = np.mean(AUC, axis=0)

AUCs_min = np.min(AUC, axis=0)

AUCs_max = np.max(AUC, axis=0)

AUCs_error = ( AUCs_max-AUCs_min) / 2

plt.figure()

AUC0_vec = np.zeros( ( len(variables) ) )

AUC0_vec[:] = AUC0

plt.plot(variables, AUC0_vec, label='AUC0')

plt.errorbar(variables, AUCs, yerr=AUCs_error, label='AUC variables')

plt.legend(loc="upper right")

plt.xlabel('')

plt.ylabel('AUC')

plt.savefig(path_output + 'plot_FR_AUC.pdf')

AUC_mean = np.zeros( (nVars) )

plt.figure()

for j in range(0, nVars):

score_all = np.mean( score, axis=0 )

score_all = score_all[j][:]

print(score_all.shape)

score_1 = score_all[y==1]

score_0 = score_all[y==0]

label = ', Var' + str(j)

label = 'Removing ' + variables[j]

plt.hist(score_1, range=[0,1], bins=50, label=label, histtype='step', normed=False)

fpr, tpr, thresholds = roc_curve(y, score_all)

AUC_mean[j] = auc(fpr, tpr)

# plt.plot(fpr, tpr, label=label)

plt.legend(loc="upper right")

plt.xlabel('DNN Score')

plt.ylabel('Events')

plt.savefig(path_output + 'plot_FR_scores.pdf')

plt.figure()

plt.plot(variables, AUC0_vec, label='AUC0')

#plt.plot(variables, 1-(AUC_mean/AUC0_vec), label='Features Weight')

plt.plot(variables, (1-AUC_mean)/(1-AUC0_vec), label='Model Inefficiency')

plt.legend(loc="upper right")

plt.xlabel('')

plt.ylabel('Rank')

plt.savefig(path_output + 'plot_FR_Rank.pdf')

return base_score, score_decreases, feature_importances<br /><br />

<br />

Example DNN1: Test a trained model over new datasets

The idea is that now the new dataset will be only-test events and we would like load the model we have and get the model prediction on these new samples.

We need to load the model and the scaler we use during the training phase:


from keras.models import load_model

...

def LoadDNNModel (modelpath):

if not os.path.isfile(modelpath + '/model.h5'):

print(Fore.RED + 'Model path does not exist')

quit()

print(Back.BLUE + "Loading model at path " + modelpath)

model = load_model(modelpath+'/model.h5')

scaler = joblib.load(modelpath + '/scaler.save')

return model, scaler

<br />

do not forget to include the saving of the scaler during the training phase:


from sklearn.externals import joblib

...

### Save the scaler

scaler_file = "scaler.save"

joblib.dump(scaler, scaler_filename)

<br />

now, we should just convert the ROOT files, manage the dataframes (this time only features selection and scaling) and load the model. Use the function:


def DNNResultsOnlyTest (path_output, model, scaler, samples, VariablesToModel):

### read the DFs

for s in samples:

print ('{:><hr />20} {:<15}'.format('Reading DataFrame ', Fore.GREEN+str(path_output + s + '_FullNoRandom.pkl') ) )

DF = pd.read_pickle( path_output + s + '_FullNoRandom.pkl')

X = DF[VariablesToModel].as_matrix()

y = DF['Category'].as_matrix()

### Features scaling

X = scaler.transform(X)

### Evaluate the model on Test

print ('Running model prediction on Xtest')

yhat = model.predict(X, batch_size=2048)

bins=np.linspace(0,1, 50)

plt.hist(yhat, bins=bins, histtype='step', lw=2, alpha=0.5, label=[r'' + s ], normed=True)

### add the score to the DF

DF['DNNScore'] = yhat

DF.to_pickle( path_output + s + '_WithScore.pkl')

plt.ylabel('Norm. Entries')

plt.xlabel('DNN score')

plt.legend(loc="upper center")

plt.savefig(path_output+"/MC_Data_NewSamples_Score.pdf")

plt.clf()

<br />

so, we can use these method in the main:


#!/bin/python

import os, re, pickle

import Helpers as h

path = '/data3/agiannini/DBLCode_19nov19/TreeAnalysis_MLUpdate_05mar19/TreeAnalysis_InputJetsOR_29may19/TreeAnalysis_mc16a/run/mc16a_RNN_2jets_TrainVBFggFH1000_04jun19/CutRNN_0p8_SelSpin0/'

files = ['FlatTree_ggFH1000_spin0.root',

'FlatTree_ttBar_spin0.root', 'FlatTree_Diboson_spin0.root', 'FlatTree_Zjets_spin0.root',

'FlatTree_Data15_spin0.root', 'FlatTree_Data16_spin0.root'

]

path_output = 'run_test/'

if not os.path.exists(path_output):

os.makedirs(path_output)

###

BranchesToRead = [

'weight', 'NJETS', 'isGGFMergedPresel',

'fat_jet_pt', 'fat_jet_eta', 'fat_jet_phi', 'fat_jet_E',

'l1_pt', 'l1_eta', 'l1_phi', 'l1_e',

'l2_pt', 'l2_eta', 'l2_phi', 'l2_e',

]

h.ConvertROOTFiles(path, files, path_output, BranchesToRead)

###

samples_categ0 = ['Diboson']

samples_categ1 = ['ggFH1000']

<br />#DF_train, DF_test = h.PrepareDF(path, files, path_output, samples_categ0, samples_categ1)<br /><br />###

VariablesToModel = [

'fat_jet_pt', 'fat_jet_eta', 'fat_jet_phi', 'fat_jet_E',

'l1_pt', 'l1_eta', 'l1_phi', 'l1_e',

'l2_pt', 'l2_eta', 'l2_phi', 'l2_e',

]

<br />#X_train, X_test, y_train, y_test = h.PrepareDFForTraining(path_output, VariablesToModel, DF_train, DF_test)

###

<br />#model = h.DNNBuilder()

<br />#model, modelMetricsHistory = h.DNNTrainer(path_output, model, X_train, y_train)

###

<br />#h.DNNResults(path_output, model, X_train, X_test, y_train, y_test, DF_train, DF_test)

<br />#h.plotTrainPerformance(path_output, modelMetricsHistory)

### Load model and Test

model, scaler = h.LoadDNNModel(path_output)

samples = ['Zjets', 'Diboson', 'ttBar', 'Data15', 'Data16', 'ggFH1000']

h.DNNResultsOnlyTest(path_output, model, scaler, samples, VariablesToModel)

<br />

it could be needed (and faster) read only a sub-set of branches from the ROOT files:


def ConvertROOTFiles (path, files, path_output, branchesToRead):

...

DF = tree.pandas.df(branchesToRead)

<br />

Read output DataFrames with scores and make plots (stack)

Use the very simple function:


def MakeStackPlot (path_output):

Zjets = pd.read_pickle( path_output + 'Zjets_WithScore.pkl')

Diboson = pd.read_pickle( path_output + 'Diboson_WithScore.pkl')

ttBar = pd.read_pickle( path_output + 'ttBar_WithScore.pkl')

Data15 = pd.read_pickle( path_output + 'Data15_WithScore.pkl')

Data16 = pd.read_pickle( path_output + 'Data16_WithScore.pkl')

d = [Data15, Data16]

Data = pd.concat( d )

bins=np.linspace(0, 1, 50)

plt.hist( Zjets['DNNScore'], bins=bins, stacked=True, lw=2, alpha=0.5, label='Zjets' , weights=Zjets['weight'] )

plt.hist( Diboson['DNNScore'], bins=bins, stacked=True, lw=2, alpha=0.5, label='Diboson' , weights=Diboson['weight'] )

plt.hist( ttBar['DNNScore'], bins=bins, stacked=True, lw=2, alpha=0.5, label='ttBar' , weights=ttBar['weight'] )

_ = skh_plt.hist(Data['DNNScore'], bins=bins, errorbars=True, histtype='marker',label='Data', color='black')

plt.yscale('log')

plt.ylabel('Entries')

plt.xlabel('DNN Score')

plt.legend(loc="upper right")

plt.savefig(path_output+"/StackPlot_Score.pdf")

plt.clf()<br /><br />

Example DNN2: Parametrised-DNN

The architecture of the pDNN is the same of the DNN, there are just more input variables as the number of the parameters of the problem are. In our case, let's try to parametrised the truth mass of the signals. What we need is to use an extra variables to the DF and to the DNN inputs; let use:


def PrepareDFForTrainingPDNN (path_output, VariablesToModel, DF_train, DF_test):

DF_train_0 = DF_train[ DF_train['Category']==0 ]

DF_train_1 = DF_train[ DF_train['Category']==1 ]

DF_train_1['TruthMass'] = DF_train_1['Sample']

DF_train_1['TruthMass'] = DF_train_1.TruthMass.replace({'ggFH': ''}, regex=True)

DF_train_1['TruthMass'] = pd.to_numeric(DF_train_1['TruthMass'])

DF_train_0['TruthMass'] = np.random.choice( a=DF_train_1['TruthMass'], size=DF_train_0.shape[0] )

DF_train = pd.concat( [DF_train_0, DF_train_1] )

#print(DF_train_0['TruthMass'])

#plt.figure()

#plt.hist(DF_train_1['TruthMass'])

#plt.hist(DF_train_0['TruthMass'])

#plt.show()

DF_test_0 = DF_test[ DF_test['Category']==0 ]

DF_test_1 = DF_test[ DF_test['Category']==1 ]

DF_test_1['TruthMass'] = DF_test_1['Sample']

DF_test_1['TruthMass'] = DF_test_1.TruthMass.replace({'ggFH': ''}, regex=True)

DF_test_1['TruthMass'] = pd.to_numeric(DF_test_1['TruthMass'])

DF_test_0['TruthMass'] = np.random.choice( a=DF_test_1['TruthMass'], size=DF_test_0.shape[0] )

DF_test = pd.concat( [DF_test_0, DF_test_1] )

### build X and y for the training

X_train = DF_train[VariablesToModel].as_matrix()

y_train = DF_train['Category'].as_matrix()

X_test = DF_test[VariablesToModel].as_matrix()

y_test = DF_test['Category'].as_matrix()

DF_train[VariablesToModel].to_pickle( path_output + 'X_Train.pkl')

DF_train['Category'].to_pickle( path_output + 'y_Train.pkl')

DF_test[VariablesToModel].to_pickle( path_output + 'X_Test.pkl')

DF_test['Category'].to_pickle( path_output + 'y_Test.pkl')

### Features scaling

scaler = preprocessing.StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

### Save the scaler

scaler_file = "scaler.save"

joblib.dump(scaler, path_output + '/' + scaler_file)

return X_train, X_test, y_train, y_test<br /><br />

<br />

once the model has been trained (the procedure is exactly the same of before, so just re-use the prefious methods wink ) you can also test the model on all samples you like and in particular re-evaluate the model for each "truth value" of the parameter:


def PDNNResultsOnlyTest (path_output, model, scaler, samples, VariablesToModel):

### read the DFs

for s in samples:

print ('{:><hr />20} {:<15}'.format('Reading DataFrame ', Fore.GREEN+str(path_output + s + '_FullNoRandom.pkl') ) )

DF = pd.read_pickle( path_output + s + '_Train.pkl')

masses = [300, 700, 1000, 2000, 3000]

for i in masses:

DF['TruthMass'] = i

X = DF[VariablesToModel].as_matrix()

y = DF['Category'].as_matrix()

### Features scaling

X = scaler.transform(X)

### Evaluate the model on Test

print ('Running model prediction on Xtest')

yhat = model.predict(X, batch_size=2048)

bins=np.linspace(0,1, 50)

plt.hist(yhat, bins=bins, histtype='step', lw=2, alpha=0.5, label=[r'' + s + ' ' + str(i) ], normed=True)

### add the score to the DF

DF['DNNScore_' + str(i)] = yhat

DF.to_pickle( path_output + s + '_WithScore.pkl')

plt.ylabel('Norm. Entries')

plt.xlabel('DNN score')

plt.legend(loc="upper center")

plt.savefig(path_output+"/MC_Data_NewSamples_Score.pdf")

plt.clf()

<br />

make the function MakeStackPlot parametrised according the variable to plot:


def MakeStackPlot (path_output, variable):

Zjets = pd.read_pickle( path_output + 'Zjets_WithScore.pkl')

Diboson = pd.read_pickle( path_output + 'Diboson_WithScore.pkl')

ttBar = pd.read_pickle( path_output + 'ttBar_WithScore.pkl')

Data15 = pd.read_pickle( path_output + 'Data15_WithScore.pkl')

Data16 = pd.read_pickle( path_output + 'Data16_WithScore.pkl')

d = [Data15, Data16]

Data = pd.concat( d )

bins=np.linspace(0, 1, 50)

plt.hist( Zjets[variable], bins=bins, stacked=True, lw=2, alpha=0.5, label='Zjets' , weights=Zjets['weight'] )

plt.hist( Diboson[variable], bins=bins, stacked=True, lw=2, alpha=0.5, label='Diboson' , weights=Diboson['weight'] )

plt.hist( ttBar[variable], bins=bins, stacked=True, lw=2, alpha=0.5, label='ttBar' , weights=ttBar['weight'] )

_ = skh_plt.hist(Data[variable], bins=bins, errorbars=True, histtype='marker',label='Data', color='black')

plt.yscale('log')

plt.ylabel('Entries')

plt.xlabel('DNN Score')

plt.legend(loc="upper right")

plt.savefig(path_output+"/StackPlot_" + variable + ".pdf")

plt.clf()

<br />

you can quickly plot all the scores in the different mass hypothesis you have:


h.MakeStackPlot(path_output, 'DNNScore_300')

h.MakeStackPlot(path_output, 'DNNScore_700')

h.MakeStackPlot(path_output, 'DNNScore_1000')

h.MakeStackPlot(path_output, 'DNNScore_2000')

h.MakeStackPlot(path_output, 'DNNScore_3000')

<br />
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2019-11-07 - AntonioGiannini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback