NN for JER improvement

Getting prepared

Here is how to use the software package Adrian wrote and put in svn as "HiggsAnalysis" in order to do the VH analyses with the help of the GlaNp package. We plan to check out the HiggsAnalysis package.

First you have to checkout and build the latest versions of the GlaNtp and HiggsAnalysis packages.

GlaNtp: AdrianBuzatuZHFrameworkGlasgowGlaNtpCheckout

HiggsAnalysis: AdrianBuzatuZHFrameworkGlasgowGlaNtpHiggsAnalysisCheckout

Then we define the variables in a new exterm.

ssh -Y ppepc137.physics.gla.ac.uk

Then open an xterm terminal.


Go to the HiggsAnalysis package

Do you want to run a tagged version of HiggsAnalysis? Typically users that are not developers of HiggsAnalysis want that. Go the tagged version you already have installed.

cd $HOME/public_ppe/HiggsAnalysis/HiggsAnalysis-00-00-03

Do you want to run the trunk (head) version of HiggsAnalysis? Typically developers of HiggsAnalysis want that (Adrian was the creator of this package and always uses this, so the tagged will not be the latest version, but the latest stable version).

cd $HOME/public_ppe/HiggsAnalysis/trunk

Setup GlaNtp and others (short practical version)

Do you want to setup a tagged version of GlaNtp? Typically users that are not developers of GlaNtp want that. Set up the tagged version you already have installed. The tag version is hard coded in the file.

source setupTag.sh

Do you want to set up the trunk (head) version of GlaNtp? Typically developers of GlaNtp want that (Adrian was has added some executables in the RootUtil/test folder of GlaNtp, so Adrian uses this).

source setupTrunk.sh

Be careful to always source and never execute these scripts. If you execute them, then the setup is done only during running and the new environment variables do not exist afterwards. To check that they do, show the directory where GlaNtp is taken from, check for example that ROOT is now defined and one command from GlaNtp is accessible to you.

which root
which CheckEntriesrv5

Setup GlaNtp and others (long fancy useless version)

Skip this section if you did the section above with setting GlaNtp in the quick way.

Define the environment values we will use:

export WORKING_AREA=$HOME/public_ppe
export PACKAGE_GN=GlaNtpPackage
export PACKAGE_GN_TAG="00-00-66"
export PACKAGE_GN_TAG="h"

Set up the already checked out version of GlaNtp:

source $HOME/public_ppe/scripts/setup_glantp.sh -v $PACKAGE_GN_TAG

Check is to see if you can access one of the commands in GlaNtp

which CheckEntriesrv5

Now that we are done with setting up GlaNtp, we define the environment variables for the package HiggsAnalysis.

export PACKAGE_HA=HiggsAnalysis
export PACKAGE_HA_TAG="00-00-03"
export PACKAGE_HA_TAG="h" (not used anywhere, but for consistency with GlaNtp)

Now we are ready to do the analyses, so we go to the appropriate HiggsAnalysis folder, from where we run.


To check that we are indeed here


Prepare the Input Flattuple

Once Mike gives us flattuples from the grid using AtlasHbb, I copied them in my afs area. You should copy them to your afs area or to another location using the format of "flattuple_Process-1.root", where Process is "WH115", "WH120", "WH130", "ZH115", "ZH120", "ZH125", "ZH130". We use several processes so that we are not biased by process and Higgs mass, but also to have more statistics when training and validating our neural network.

cd /afs/phas.gla.ac.uk/data/atlas/abuzatu06/fromMike/flattuples
ls -l

Next we need to thin these flattuples to keep only the events that we want so that when we loop over each event we will loop through fewer events and the training will be quicker. We create a new folder where we will put the thinned files that will keep the same name.

mkdir /afs/phas.gla.ac.uk/data/atlas/abuzatu06/fromMike/flattuples2

The unit command to thin a flattuple using the new script Adrian added to GlaNtp is this one:

ThinFlattuplerv5 /afs/phas.gla.ac.uk/data/atlas/abuzatu06/fromMike/flattuples/flattuple_ZH120-1.root /afs/phas.gla.ac.uk/data/atlas/abuzatu06/fromMike/flattuples2/flattuple_ZH120-1.root physics globaldata 2 11000000000000001111111111111100 00001111111000000011001111111100 0

In the command "ThinFlattuplerv5" we can use base 2 or base 10 for the cutmask and invertword. We can run for all events "0" as last parameter, or at a certain number of events. The cutmask above makes sure that we have only 2 jets and both are b tagged, without looking at the charged lepton. Therefore, this cutmask can be used both for the WH and ZH.

Since we have now several processes, we create a script to run on all these processes from one command. Besides the thinning, the script will print how many events were generated and how many remain after the selection. We can see that about 10% of events remain.

cd ~abuzatu/public_ppe/HiggsAnalysis/trunk/JER
./doThinTrees.sh 11000000000000001111111111111100 00001111111000000011001111111100 0         WH115+WH120+WH130+ZH115+ZH120+ZH125+ZH130

Now we have these thinned ntuples in the folder

ls -lh /afs/phas.gla.ac.uk/data/atlas/abuzatu06/fromMike/flattuples2

And we want to merge all of them into one file.

cd ~abuzatu/public_ppe/HiggsAnalysis/trunk/JER
./doMergeTrees.sh  WH115+WH120+WH130+ZH115+ZH120+ZH125+ZH130 ALL

Now you see that a new file was created, "flattuple_ALL-1.root", obtained by merging files with a total generated events of 399843 and that in the physics tree we have now a total of 39490 events, each with two jets. So we have indeed a larger statistics than just one file alone.

Prepare the Training and Testing Trees

We will use the macro "bjets_perEvent.C" that is called from the .sh file "doCreateTrainingTrees.sh". Things are hard-coded in the .C and .sh file, if you want to modify some of them. The default is that it will take the file from the path above and save in the folder "root" from the "JER" folder. When running the script will tell you detailed text of what it did. Basically it loops over all events and keeps only the events for which both reconstructed jets are withing dR of 0.4 with their matched truth jets. If the number of events is odd, we remove the last event, as we need an event number of events for the booking. For these events, we fill the trees with jets, as we will do the trainining on a jet by jet basis. We create a training tree (with the odd numbered events: first, third, etc, with both leading and subleading jet for each event), a testing tree (same for even number events: second, forth, etc). We also have a tree with all jets, but reordered so that we first fill all leading jets in order and then all the subleading jets in order. We pass this third tree to the MLP neural network, which will split it in a training tree and a testing tree b, which will be identical to the training tree and testing tree that we filled ourselves. We do this in order to be able to do closure and overtraining tests of our own, with any histogram we want.


The output is in the folder root

ls -lh root

If you look inside the file, you see the three trees

root.exe root/bjets_NN.root
TBrowser a

Choose Variables to Train on

We also create the Figures 3-7 from the CDF NIM. We plot several variables for jets with Pt at generator level smaller and larger than a cut (chosen by me to be 60 GeV instead of 50 GeV at CDF). If there are shape differences than the variable is able to help correct the Pt reconstructed towards the Pt at the generator level.


Train the NN

Now we train the artificial neural network using the MLP NN that comes with ROOT (later we can use a version that comes with TMVA). We use the file "bjets_parametrization.C" called from the script "doParametrization.sh". You need to edit "doParametrization.sh" to tell the number of epochs of training, or if you want "status2" or "status3" for the truth jets. You need to edit "bjets_parametrization.C" in order to decide what are the input variables, how many hidden layers you want and what are the test (output) variables. It can be one, such as Pt, or several, such as the entire four vector (Pt, Eta, Phi, E).


We will see how the learn error (from the training tree) goes smaller with more epochs, and how the test error (from the testing tree) goes small with more entries, but as how at some point is starts increasing again (after that number of epochs you start to overtrain by simply memorizing the initial events).

Training the Neural Network
Epoch: 0 learn=0.214905 test=0.213905
Epoch: 10 learn=0.191387 test=0.190204
Epoch: 20 learn=0.187376 test=0.186353
Epoch: 30 learn=0.185891 test=0.184649
Epoch: 40 learn=0.185136 test=0.183819
Epoch: 50 learn=0.185028 test=0.183846
Epoch: 60 learn=0.184952 test=0.183715
Epoch: 70 learn=0.18494 test=0.183695
Epoch: 80 learn=0.184896 test=0.183734
Epoch: 90 learn=0.184791 test=0.183644
Epoch: 99 learn=0.184776 test=0.183667

Below you also see how much did each variable contribute to the training. The numbers come from the testing tree.

@Pt -> 0.0595232 +/- 0.061599
@svp_M -> 0.00182973 +/- 0.00182009
@svp_Lxy -> 0.00960417 +/- 0.00937878

For example, above we see how the Pt really dominates.

Then our script also saves to our webpage testing plots produced automatically by MLP

Info in <TCanvas::Print>: eps file /afs/phas.gla.ac.uk/user/a/abuzatu/public_html/JER/NN/bjets_NN.eps has been created
Info in <TCanvas::Print>: pdf file /afs/phas.gla.ac.uk/user/a/abuzatu/public_html/JER/NN/bjets_NN.pdf has been created

The training translates into a .cxx and .h file that are produced in the local folder. The .sh script will move them to the "NN" folder. Ideally we would want some text file to be created, so that we do not have to recompile the files that use the NN every time we retrain. Unfortunately we do not know yet how to do this. This means that we need to include the .h file with the same name in the file that use the NN and change the name also in the body of the files.

ls -lh NN

Perform Closure Test

We want to see if in the training tree we were able to reproduce what we wanted in the first place. This is called a closure test. For this we use the training tree we created. Of course, then we can do the same plots for the testing tree to see if we overtrained. And later we could change the macros to overlay the training and testing trees. For this we use the macro "bjets_perJet.C" and the script "doClosureTest.sh".


This produces plots for several variables: Pt, Eta, Phi, E, M. For each variables, a plot would contain in fact two plots. The plot on the left is the ratio between the truth variable and the reconstructed one (blue) and between the value the NN learned and the reconstructed one (red). Basically, the goal of the NN is for the red shape to match the blue shape. If they agree, the training was able to learn the features of the training sample. If not, we need to go back to redo the training with different input variables, different hidden layers, different output variables, etc. The right plot shows the absolute values of the truth value (blue), the reconstructed variable (black) and what the NN has learned (red). The goal is to go from black to blue. What we do in reality with the NN is to go from black to red. Again, the goal is for the red shape to agree with the blue shape. At the moment the code changes only Pt and not changes the other variables, so the red will agree with the black for the Eta, Phi, E, M. But in the future we can change those as well, to try to change the entire four vector.

Perform Overtraining Test

If you believe the training worked, it is time to check for overtraining, which means producing the same plots but for a sample of orthogonal jets, not used in the training, i.e. the testing tree. Just run the same macro on the testing tree instead of the training tree.


If there red and the blue agree again, it means you did not overtrain. Then you are ready to make plots on an event-by-event basis, which means correct both jets in the event and then make plots like the invariant mass of the two jets. Then check the plots on the webpage.

Produce Event-by-Event plots (like mbb)

We will use the macro "bjets_perEvent.C" once more and the script "doMbb.sh".


Then check the mbb plot on the webpage, as well as the rms and mean values in the text output.

RMS  r=16.218 n=16.0144 t=3.35711e-05 n/r=0.987447
Mean r=102.717 n=117.563 t=115 n/r=1.14454 r/t=0.893187 n/t=1.02229

Here r=reconstructed; t=truth (remember it is status2 or status3 depending on what you chose); n=reconstructed corrected with the help of the NN. You want to have a mean closer to the the Higgs mass of your sample (n/t closer to 1) to improve the mass scale and a smaller RMS to improve mass resolution (n/r to be smaller than 1.0).

Be careful that when we created the trees for training we used all processes merged ("ALL"), but now that we want an improved resolution we can check the mbb plots for different processes and masses ("WH115" for example). So we put this in the name of the file. Then you can do the test for all processes.

Produce Latex Presentation

Then we create or update automatically the .pdf presentation produced with beamer in latex, so that we can see easily all plots together after a new training of the neural network.


The presentation will be on our webpage.


Back to mother page: AdrianBuzatuZHFrameworkGlasgowGlaNtp.

-- AdrianBuzatu - 14-Aug-2012

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2020-08-19 - TWikiAdminUser
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback