Boosted Decision Trees

Introduction

This page summarises the studies on Boosted Decision Tree (BDT) as part of the MVA algorithm benchmarking in CMS.

Algorithm configuration

Comparative studies/configuration optimisation

  • BDT (TMVA implementation) has multiple internal parameters. Different configurations will be studied to find the optimal combination.
  • Seperation criteria for node splitting - comparative studies of:
    • Gini Index [default]:
      • Or in another form:
    • Cross entropy:
    • Misclassification error:
    • Statistical significance:
    • Gini Index with Laplace is listed in the parameters but not described in the documentation:
      • It arises from adding a small correction to p:

TMVA parameter table Expand... Hide table

Option Default Predefined Values Decription
NTrees 200 - Number of trees in the forest (training iterations)
BoostType AdaBoost AdaBoost, Bagging, Grad Boosting type for the trees in the forest
UseBaggedGrad False - Use only a random sub-sample of all events for growing the trees in each iteration. (Only for GradBoost).
GradBaggingFraction 0.6 - Defines the fraction of events to be used in each iteration when the above is true
Shrinkage 1 - Learning rate for GradBoost algorithm
AdaBoostBeta 1 - Parameter for AdaBoost algorithm
UseRandomisedTrees False - Choose at each node splitting a random set of variables
UseNvars 4 - Number of variables used if randomised tree option is chosen
UseNTrainEvent N - Number of Training events used in each tree building if randomised tree option is chosen
UseWeightedTrees True - Use weighted trees or simple average in classification from the forest
UseYesNoLeaf True - Use Sig or Bkg categories, or the purity as classification of the leaf node
NodePurityLimit 0.5 - In boosting/pruning, nodes with purity > NodePurityimit are signal; background otherwise
SeparationType GiniIndex CrossEntropy, GiniIndex, GiniIndexWithLaplace, MisClassificationError,SDivSqrtSPlusB Separation criterion for node splitting
nEventsMin max(20,NEvtsTrain/ NVar^2 /10) - Minimum number of events required in a leaf node
nCuts 20 - Number of steps during node cut optimisation
PruneStrength -1 - Pruning Strength. Negative number means optimum is found by TMVA
PruneMethod CostComplexity NoPruning, ExpectedError, CostComplexity Method used for pruning (removal) of statistically insignificant branches
PruneBeforeBoost False - Flag to prune the tree before applying boosting algorithm
PruningValFraction 0.5 - Fraction of events to use for optimizing automatic pruning
NNodesMax 100000 - Max number of nodes in a tree
MaxDepth 100000 - Max depth of the decision tree allowed

Training

  • Study the performance of the classifier as a function of the number of input variables (uncorrelated)
  • Study the number of training events as a function of the number of input variables (uncorrelated) (for a specific classification performance)
  • Overtraining - check if the behaviour below (observed by Harrison Prosper, CMS MVA Workshop 2007) is confirmed; find explanations, if necessary
bdtovertrainingpoint.png

Preprocessing

  • Study the effect of different pre-processing transformations of the input data on the BDT performance
  • Study how the amount of correlation between input variables influences the classifier's performance.

Ensemble learning algorithms

  • Comparative studies of:
    • AdaBoost, Adaptive Boost
    • GradBoost, Gradient Boost
    • Bagging

-- MarkRTurner - 07-Jun-2011

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2011-06-09 - MarkRTurner
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback