Page for resources related to Machine Learning (ML) topics within the B-Physics and Light States (BLS) group
Introduction
This page aims to serve the B-physics (and anyone with interest) community, providing resources to various ML resources, a quick setup guide to create a virtual enviornment, an overview of common ML software and techniques used within the BLS group and ATLAS, and a summary of current toolkits/methods within the BLS group. If able to, links to some basic scripts will be provided.
The ATLAS machine learning forum (MLF) does not encourage the use of TMVA for ML practises, however, the TMVA package offered by ROOT is still very powerful and will be discussed here.
Neural Network (NN)
Boosted Desicion Trees (BDTs)
A decision tree is a binary tree structure classifier, similar visually to a flow chat, where repeated (yes/no) decisions are taken on a single variable at at time until a stop criterion is applied. The phase split is split into many regions that are signal or background like.
The boosting of a decision tree extends this concept from a single tree to several trees to form a forest. This overcomes a decision trees instability w.r.t the statistical fluctuations in the training sample from which the tree structure is derived.
A BDT takes a set of input variables with labelled binary classifiers, and finds the optimal threshold on one of these observables to separate the classes. The point of a split is called a node, with the subsequent path as a result of a decision branching. The splitting of daughter nodes may continue indefinately until full separation occurs, or limiting the splitting of the trees at a maximum depth (
MaxDepth), or minimum number of events in a terminal node (
MinNodeSize). The end ponts of the splitting process are called leaves.
The trees are derived from the same training ensemble by re-weighting events, and are combined into a single classifier given by a weighted average of the individual decision trees and a final discriminant is formed called the BDT response. Boosting increases the stability of the classifier, and improves the separation performance of a single tree.
Below lists some concepts related to BDTs
Separation Index
To determine the optimal location on an observable to separate a class, a separation index such as Gini Index or Cross Entropy Index is chosen. Defined in terms of the purity for the classification of signal and background events. If you have a purite of one class, the separation index will be very small. If you have a mixture of classes the separation index will be large. By the end of a tree, a separation index should be small. Want to select cuts/variables to minimise separation index and maximise the purity of a class.
Boosting
Single decision tree is a weak learner. Can lead to overtraining if lots of complexity. less complex enviornments leads to less overtraining but limited performance (less pure leaves of a particular class and event misclassification).
Boosting can improve the algorithm. After an initial tree is built, misclassified events are weighted so that they have more importance in the next tree decision, altering the separation index and building a new tree. The results of this tree are weighted and the process continues until a stop criterion applies.
Boosting works best if applied to trees that do not have much classification power. These weak classifiers are small trees, limited in growth to a typical tree depth, as small as two depending on interaction between different input variables.
The output of the forest of trees, derived from the same training sample, is the weighted perforamance of each tree, and their combined performance nmakes them strong learners.
You want each tree in a BDT to be a weak learner, but a bunch of trees to be strong learners. If you have one tree and you want ti make it a good learner, it is actually very difficult without overtraining or having a massive tree. There is a trade off. You can combine loads of decision trees and use a boosting method to get good performance. The following tree is there to minimise the mistakes of the previous tree. Thus when you combine all scores/weights from each tree for each event, then you get the BDT output plot.
Pruning and Bagging
Overtraining can be minimised through pruning and bagging. Bagging uses a randomly chosen subset of the training sample for each tree. There can be performance increase as the forest is less likely to be tuned precisely to the full training data, and outlier events being heavily boosted. Pruning is cutting back a tree from the bottom up after a tree has been built to its maximum size, removing leaves of unimportant nodels, without reducing performance, reducing overtraining.
Pruning is unnecessary when using boosting algorithsm which work on weak classifers which limit the depth.
Training (ELI5)
This section is very much an 'explain it like I'm 5' (ELI5) to how a BDT works, and describes a number of hyper-parameters (see brackets) which can be configured by the user.
The BDT starts to build a tree, which could be up to 4 layers of nodes (
MaxDepth - 4 layers of nodes means four layers of spliting. The first node does not count. Each layer uses only ONE training variable that is divided into NCuts to try and find the optimal value for splitting. NCuts - E.g the variable mass is split into 100 NCuts and it looks at each interval to see how that slice looks like in terms of signal and background), but it may not reach 4 layers of nodes if a node contains < 5% of events.
At each node it uses a Gini Index (Separation Type) to find the most power variable and cut value at that point to maximise splitting. At each node the boosting algorithm
AdaBoost (
BoostType) is present saying these are training events, so the BDT knows exactly how it's classified, but it seems that at a particular node some events have been misclassified, and
AdaBoost assigns each missclassified event an error. The tree is then built.
A new tree appears and the whole process starts again, but the misclassified events of the previous tree have a weighting to give them more prominence and to try and classify them better. At the the end there will be 400 trees (NTrees).
At each layer there will be some nodes, and each node in that layer uses only one of the input training variables to train on. That variable is divided up into NCuts. The BDT scans each interval to determine whether it looks signal like or background like and uses that to determine that nodes purity, i.e how many events look background like and how many look signal like.
At the end, the BDT does some calculations based on each event in each tree and their weight, and determines the BDT score for that event.
Testing
The probabilities it determined for each event are tested on a different set of similar properties, i.e the weights determined during the BDT training stage are applied to the testing events, to see how well it classified events during training. Testing and training BDT scores should be the same. Testing is how well the BDT performed during training.
Evaluation
Use the BDT on data it does not already know the classification of, i.e it does not know which events are signal and which are background.
Importance of lots of training and testing events
Links
Popular ML software used within ATLAS
TMVA
Kera
Creating a Virtual Enviornment
ATLAS does not support Python3, therefore when setting up a virtual enviornment please follow these instructions exactly to avoid later conflicts:
- setupATLAS
- lsetup "root 6.22.00-python3-x86_64-centos7-gcc8-opt" //Please check compatibility if not on a centos07 machine
- python3 -m venv myenv
- source myenv/bin/activate
- pip install --upgrade pip
Returning to your virtual enviornment:
- setupATLAS
- lsetup "root 6.22.00-python3-x86_64-centos7-gcc8-opt" //Please check compatibility if not on a centos07 machine
- source myenv/bin/activate
Basic Scripts
Provided here are links to some useful scripts (please note some are a little out of date, and some features have now been depreciated and replaced with others)
Keras:
https://github.com/YaleATLAS/CERNDeepLearningTutorial/tree/93a7aeef67fc6108fe810c784a38cb06fd650cf2
Very useful! Provides a first introduction to using python for machine learning and plotting variables using matplotlib. Very easy to follow, note that as_matrix() is depreciated and should be replaced with to_numpy().
https://github.com/nican2018/event-classification
Event classification based on jet substructure.
ML techniques used within B-Physics
J/psi + gamma analysis
The J/psi + gamma analysis is using a BDT method through the TMVA package offered by ROOT to separate signal from background. The events are selected at random based on event ID, with 50/50 train/test split.
Resources and Links
Papers
A continually updated list of ML papers for particle physics:
'A Living Review of Machine Learning for Particle Physics'
An introduction to Neural Networks for LHC physics:
Deep Learning and its Application to LHC Physics[[
https://arxiv.org/abs/1806.11484
HEP community white paper regards ML:
Machine Learning in High Energy Physics Community White Paperhttps://arxiv.org/abs/1807.02876
--
AmyTee - 2020-10-29