MVA Algorithms and Tools

Introduction

This page collects information relevant for the study of MVA inside CMS

Topics for study (to be continued...)

  1. Methods and tools for multi-dimensional validation of input data (MC-data comparison) - useful for any type of analysis (proponents: Harrison Prosper & Pushpa Bhat)
  2. Variable selection
    • Survey methods/algorithms proposed in the literature, perfom comparative studies, propose methods to be used in CMS
  3. Pre-processing
    • Compare the current methods/algorithms implemented in the CMS MVA tool/framework and TMVA -> extract conclusions
    • Survey other methods/algorithms available (for example what NeuroBayes does, other methods proposed in HEP or outside of HEP)
    • Proposal for new methods to be implemented (or interfaced) in CMS MVA tool
    • Implement the proposed/agreed methods, study their behaviour, compare them with the previously implemented methods
    • Develop recommendations of what/when to be used (one or more methods, depending of the behaviour and performance of the method)
    • Maintain and document the pre-processing part of the CMS MVA tool
  4. Discriminants (algorithms to determine the discriminant function)
    1. Comparison of the different software implementations - for each algorithm:
      1. make a survey of what software implementations are commonly available (e.g. in CMS MVA framework, TMVA, others)
      2. compare/benchmark the different implementations
      3. propose a baseline implementation if necessary, implement or interface the agreed implementation in the CMS MVA tool
      4. maintain and document the implementation
    2. Benchmark each algorithm
      1. Develop and document a detailed methodology specific to each algorithm
        • study different configurations of the algorithm (vary the architecture of the algorithm, the internal parameters etc.) -> extract guidelines on what works in most of the cases, and when a detailed optimisation is necessary
        • study the dependence of the algorithm performance on the pre-processing of the data -> develop recommendations on what pre-processing is necessary for each algorithm
        • study the dependence between the number of input variables and the number of training events (different algorithms have different needs) ->extract recommendations
        • study the overtraining (some methods more sensitive than others); if necessary, propose and implement methods for overtraining treatment
        • develop and document the methodology of how the method to be used, what sanity checks to be performed, how the results to be presented in order to ensure transparency of the analysis etc.
      2. Compare results given by different methods on the same data sets (using the methodology developed) ->extract recommendations
        • Algorithms to be investigated - the commonly used ones for the time being
          • Likelihood estimators
          • Linear discriminants
          • Artificial Neural Networks
          • Boosted Decision Trees (also comparison of different boosting and baggin methods)
          • Bayesian Neural Networks
          • Support Vector Machine
  5. Cut optimisation
    • Algorithms available in TMVA
      • Monte Carlo sampling
      • Genetic Algorithms
      • Simulated Annealing
    • Algorithm proposed by Harrison Prosper
      • Randon Grid Search
    • To do
      • comparison of the methods -> extract recommendations
      • if necessary, implement/interface new algorithms in the CMS MVA tool, maintain and document them
  6. Output
    • Survey the performance plots implemented in different tools (e.g. TMVA, NeuroBayes, etc)
    • Proposal for implementation of what is missing
    • Implement, maintain and document this part of the code
    • Survey methods for interpretation of the output as a probability
    • Propose methods for implementation
    • Implement, maintained and document the agreed method

Current commitments

Preprocessing - Salvatore Tupputi

  • Compare the current methods/algorithms implemented in the CMS MVA tool/framework and TMVA and extract conclusions
  • Develop recommendations of what/when to be used (one or more methods, depending on the behavior and performance of the method)
  • Maintain and document the pre-processing part of the CMS MVA tool

Algorithms benchmarking: likelihood estimators - Michele de Gruttola

  • benchmark and study the likelihood estimators implemented in the CMS MVA framework and TMVA.
    • compare the algorithms implemented in CMS MVA frameworks and TMVA, extract conclusions and propose a baseline(recommended algorithms)
    • study different configurations of the algorithm, the dependence between the number of input variables and the number of training events, the influence of the preprocessing of the input data
    • develop and document the methodology of how the method to be used, what sanity checks to be performed, how the results to be presented
  • prepare and maintain a set of benchmark data

Algorithms benchmarking: Boosted Decision Trees - Mark Turner (supervisor Liliana Teodorescu)

  • study different configurations, the dependence between the number of input variables and the number of training events, the influence of the preprocessing of the input data
  • compaire different boosting and bagging algorithms

Cut optimisation - Joe Bochenek (supervisor Harrison Prosper)

  • Compare algorithms from TMVA
    • Genetic algorithms
    • Monte Carlo Sampling
    • Simulating annealing (maybe)
  • Provide Random Grid Search to CMS

Starting Ntuples

We can start our studies /optimization/comparison from the ntuples located at

rfdir /castor/cern.ch/user/d/degrutto/2011/MVA/

you'll find tar ball there to untar. The /test folder contains ntuples and TMVA macros already pointing to them. The scripts are TMVAClassification_X.C pointing to X*.root (X=ZH,WH,ZinvH). Here then we have three set of analyses we can performL

  • ZH analyses: low mass Higgs decaying into b-b pairs in association with a Z decay into leptons
  • WH analyses: low mass Higgs decaying into b-b pairs in association with a W decay into leptons
  • ZinvH analyses: low mass Higgs decaying into b-b pairs in association with a Z decay into neutrinos

We can maybe choose one of the three or all of them.

The variables name in the ntuples should be self explanatory: just a reminder

always require Zpt (Wpt, MetEt ) > 150, because the Zjets background has been   generated with that requirement: this a a boost Higgs analyses! 

Looking for example at the file TMVA/test/WHWH-115.root we get the following variables:

hjjMass --> mass of the jet-jet pair (hopefully around 115 GeV ;) )
hjjPt --> pt of the jet-jet pair 
WenPt --> pt of the W-->e + nu, please require always pt>150 
WenPt --> pt of the W-->mu + nu, please require always pt>150
jcsv1 --> CSV tag of the leading jet (hopefully >~ 0.5)
jcsv2 --> CSV tag of the second jet (hopefully >~ 0.5)
jjdr --> deltaR beetween the two jets
hjjWendPhi --> deltaPhi between the H and W candidate (hopefully back to back!)
.......................

The list of samples used and their correspondent lumi is given in the figure attached samples.png

Documentation

Responsible: MicheleDegruttola

-- MicheleDegruttola - 12-May-2011

MVA Mailing List (Subscribe)

cms-mva@cern.ch

Meetings

  • Informal Meeting on Dec. 2, 2011 [Bat40 Atrium]

Topic attachments
I Attachment History Action Size DateSorted ascending Who Comment
PNGpng samples.png r1 manage 1053.5 K 2011-05-12 - 11:51 MicheleDegruttola samples
PDFpdf MTurner121211.pdf r1 manage 2191.4 K 2011-12-12 - 14:59 MarkRTurner  
PDFpdf slvtr_12dec11.pdf r1 manage 7069.9 K 2011-12-12 - 14:56 UnknownUser Slides for meeting December 12th 2011 meeting
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2012-01-17 - PushpalathaBhat
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback