MVA Framework Computer Offline Guide

Contents:

MVAComputer documentation and CMSSW interface

The MVAComputer

The MVAComputer is implemented as a single user-visible class MVAComputer in the PhysicsTools namespace. The constructor takes a pointer to a calibration object of the type PhysicsTools::Calibration::MVAComputer. This calibration object can either be constructed by hand, obtained from the CondDB or the MVATrainer (for the training procedure). More details on how to store and obtain a calibration object can be found further below.

A calibration object defines a number of input variables for the MVA computer. These input variables are identified by their name (a string constant). Depending on the use case, each variable can appear a different number of times for each computation. The typical case would be that each variable appears exactly one time. In some cases though, it is useful to allow optional variables (i.e. in case some physical property is not always available if it cannot be reliably reconstructed). Some properties, on the other hand, can possibly appear more than one time per computation, which means that certain input variables can appear multiple times. A typical use case are track variables for b-tagging. Special precautions have to be taken for these atypical input variables to work, as the MVA computer sub-modules usually don't allow this flexibility themselves. For this to work, additionall preprocessing modules for these variables have to be explicitly specified in the calibration object.

The input for each individual discriminator computation is passed as a collection of PhysicsTools::Variable objects (any STL-compatible iterable type, like a simple C array or a std::vector). A PhysicsTools::Variable is a simple type, containing the identifier for an input variable and a double for the value for that variable instance.

The variable identifier is basically a string constant stored in an PhysicsTools::AtomicId object. Such an identifier object can be transparently constructed from any C or C++ string, but in order to avoid the expensive string lookup involved, it is recommended to create and cache the AtomicId objects beforehand (i.e. once in the class constructor) and pass them directly in some time-critical inner loop.

A simple example looks like:

#include "CondFormats/PhysicsToolsObjects/interface/MVAComputer.h"
#include "CMS.PhysicsTools/MVAComputer/interface/Variable.h"
#include "CMS.PhysicsTools/MVAComputer/interface/MVAComputer.h"

using namespace CMS.PhysicsTools;

// obtain a calibration object from somewhere
Calibration::MVAComputer *calibration = ...; 

// create MVA computer
MVAComputer *mva = new MVAComputer(calibration);

// loop over all events
for(...) {
        std::vector<Variable> input;

        input.push_back(Variable("pt", event->pt));
        input.push_back(Variable("eta", event->eta));
        input.push_back(Variable("deltaR", event->deltaR));
        input.push_back(Variable("isolation", event->isloation));

        double discriminator = mva->eval(input);

        if (discriminator > discriminatorCut)
                // signal event
        else
                // background event
}

As mentioned, it is a waste of CPU cycles to construct the identifier from a string object. Also in this case it is stupid to re-allocate and fill the vector on each iteration. In case of a constant-sized array, something like the following example would be more appropriate:

#include "CondFormats/PhysicsToolsObjects/interface/MVAComputer.h"
#include "CMS.PhysicsTools/MVAComputer/interface/Variable.h"
#include "CMS.PhysicsTools/MVAComputer/interface/MVAComputer.h"

using namespace CMS.PhysicsTools;

enum InputVariables {
        kPt = 0,
        kEta,   
        kDeltaR,
        kIsolation,
        kNumInputVars
};
  
static const AtomicId const inputVarIds[kNumInputVars] = {
        "kPt",
        "kEta",
        "kDeltaR",
        "kIsolation"
};
  
// obtain a calibration object from somewhere
Calibration::MVAComputer *calibration = ...; 

// create MVA computer
MVAComputer *mva = new MVAComputer(calibration);

// loop over all events
for(...) {
        Variable input[kNumInputVars] = {
                Variable(inputVarIds[kPt], event->pt),
                Variable(inputVarIds[kEta], event->eta),
                Variable(inputVarIds[kDeltaR], event->deltaR),
                Variable(inputVarIds[kIsolation], event->isolation)
        };

        // overloaded method that takes begin and end "iterators"
        double discriminator = mva->eval(input, input + kNumInputVars);

        if (discriminator > discriminatorCut)
                // signal event
        else
                // background event
}

The MVAComputer calibration objects

The class definitions for the calibration objects are found in the CondFormats/PhysicsToolsObjects package and are therefore accompanied by a LCG dictionary to allow persistent storage in the CMS conditions database using POOL and CORAL. The CMS conditions database basically allow storage in any supported SQL database. The central online database at CERN uses an Oracle database which can be queried via a web interface, which is queried using Frontier and allows site-local caching of results via a local web proxy (squid). For non-centralised private storage of CondDB objects, the lightweight sqlite backend allows storage in a local database file.

All CondDB objects are identified by a record, which corresponds to a special instance of a C++ class inside of CMSSW, that has to be explicitly defined and registered with the framework as a plugin. Secondly, different instances of a object associated with a record can be distinguished by tags, which are simply string identifiers associated with the individual calibrations. And finally, each calibration has a certain interval of validity (IOV). The main purpose is that the correct calibration can be retrieved that was valid for the currently processed event of real data.

The MVA computer calibration objects, if not instantiated by hand, are available inside CMSSW from an event setup source or event setup producer.

The objects can be retrieved from the CondDB using the following cfg snippet:

include "CondCore/DBCommon/data/CondDBSetup.cfi"

es_source = PoolDBESSource {
        using CondDBSetup   

        string connect = "sqlite_file:MVAComputerObjects.db"
        untracked string catalog = "file:mycatalog.xml"
        string timetype = "runnumber"

        VPSet toGet = {
                {
                        string record = "SomeRcd"
                        string tag = "Foobar_tag"
                }
        }
}

The ``connect'' parameter describes the database in which the object resides, in this case a local sqlite database file. The ``catalog'' specifies an accompanying PoolFileCatalog, which is created alongside when creating the database file itself and without which the access won't work and hence has to be kept. The ``timetype'' parameter defines that the run number has to be used for the IOV (which is appropriate in the case that the calibration does not change within a run). The ``toGet'' parameter vector defines a list of objects to be retrieved from the database, each identified by ``record'' and ``tag''. The record corresponds to the C++ class name of the record, and the tag is a user-definable identifier when storing the object.

There are two types of calibration objects that can be stored in the CondDB. First there is the calibration object for a single MVAComputer itself, namely PhysicsTools::Calibration::MVAComputer. Secondly there is a container class which can contain multiple MVAComputer calibrations, each identified by a name (similar to map<string, Calibration::MVAComputer>).

The CondDB record is defined as follows (replace SomeRcd with your own record):

In the header file:

#include "FWCore/Framework/interface/EventSetupRecordImplementation.h"

class SomeRcd :
        public edm::eventsetup::EventSetupRecordImplementation<SomeRcd> {};
and in the corresponding .cc file:
#include "CondFormats/DataRecord/interface/SomeRcd.h"
#include "FWCore/Framework/interface/eventsetuprecord_registration_macro.h"

EVENTSETUP_RECORD_REG(SomeRcd);

Then the type of the calibration object has to be defined for that record and registered as a plugin in a .cc file:

#include "CondCore/PluginSystem/interface/registration_macros.h"
#include "CondFormats/PhysicsToolsObjects/interface/MVAComputer.h"
#include "CondFormats/DataRecord/interface/SomeRcd.h"

using namespace CMS.PhysicsTools::Calibration;
REGISTER_PLUGIN(SomeRcd, MVAComputer);

Do not forget to build that file as an EDM_PLUGIN.

After having completed these CMSSW registration stunts, the calibration objects for the newly created record can be retrieved from an edm::EventSetup object, e.g. from the analyze method in an EDAnalyzer as follows:

#include "FWCore/Framework/interface/EventSetup.h"
#include "FWCore/Framework/interface/ESHandle.h"  
#include "FWCore/Framework/interface/EventSetupRecord.h"
#include "FWCore/Framework/interface/EventSetupRecordKey.h"

#include "CondFormats/DataRecord/interface/SomeRcd.h"
#include "CondFormats/PhysicsToolsObjects/interface/MVAComputer.h"

#include "CMS.PhysicsTools/MVAComputer/interface/MVAComputer.h"

using namespace CMS.PhysicsTools;

...
   
void FoobarTest::analyze(const edm::Event& iEvent,
                         const edm::EventSetup& iSetup)
{
        // define an EventSetup handle of type Calibration::MVAComputer.
        edm::ESHandle<Calibration::MVAComputer> calibHandle;

        // retrieve the ES handle for record SomeRecd
        iSetup.get<SomeRcd>().get(calibHandle);

        // the raw object can be accessed as follows:
        const Calibration::MVAComputer *calibration = calibHandle.product();

        // the MVA computer can be instantiated from this object:
        MVAComputer *mva = new MVAComputer(calibration);

        ...
}

For performance reasons this example is suboptimal: The MVA computer is instantiated for each event. It is much more reasonable to instantiate it in the first event and cache the MVAComputer pointer for the following events as well and destroy it at the end of the processing. In case you expect to calibration to change in-between you have to check whether the calibration object has changed and re-create the MVAComputer if that happens. The calibration objects define a cache identifier for that purpose:

std::auto_ptr<MVAComputer> mva;
Calibration::MVAComputer::CacheId cacheId;

...
   
if (!mva.get() || calibration->changed(cacheId)) {
        mva = std::auto_ptr<MVAComputer>(new MVAComputer(calibration));
        cacheId = calibration->getCacheId();
}

The object containing the container of MVAComputer calibrations mentioned above is called PhysicsTools::Calibration::MVAComputerContainer. The procedure for retrieving and checking the validity of a cached MVA computer are working analogously. The individual calibrations can be retrieved from the container using the find(const string &label) method which returns a reference to a PhysicsTools::Calibration::MVAComputer objects and throws an exception if there was no corresponding entry found for the label passed.

MVAComputer layout

The MVAComputer calibration object describes how an MVA computer is structured internally. An MVA computer can consist of multiple ``variable computers'' that can be freely interconnected (in any allowed configuration). Similarly to the whole MVA computer each variable computer can take a number of input variables. The allowed kind of variables depend on the type of variable processor. Also, depending on the type and the number of input variables, each variable processor provides one or more output variables, which in turn can be connected as a input variables to other variable processors. One of the variables available is then selected as the output variable for the whole MVA computer. The connection graph is strictly directed in forward direction, so is not possible to create loops. When the MVA computer is constructed from a calibration objects, some basic consistency checks are done. It is still possible to create a broken configuration though, some errors can only be caught at runtime or not at all, so attention has to be paid not to throw anything at the MVAComputer. Successul calibrations obtained from the trainer should always be valid, it should typically not be needed to manually create a calibration.

The calibration for the variable computers derives from a common base class. This base class contains the input variables information. It is represented by a simple bit set (i.e. array of bits) that selects the input variables from the pool of all variables available at that point. The logic works as follows:

The variable computers are instantiated in the order they are defined in the uppermost PhysicsTools::Calibration::MVAComputer object. At the beginning, the ``variable pool'' only contains the input variables defined in the MVAComputer calibration object, in the order of appearance. The first variable is assigned the index 0, the second variable the index 1 and so on. The first variable computer defined can select its input variables from that pool. The size of the bit set has to exactly match the number of variables in the pool or instantiation is aborted with an exception. Each variable used is repesented by a 1 in the corresponding bit (bit 0 set means variable 0 is picked, etc...). Note that this does not allow free ordering of the input variables, it is always restricted to the order of appearance in the variable pool. This is generally not a limitation as it just means that the ordering of the configuration variables of each variable computer is given by the context. The few cases (actually only one at the moment) where the ordering does matter, the variable computer calibration objects can define a permutation map to explicitly reorder the input variables to match the requirements.

Each variable computer then defines a number of output variables (depending on its type and the number of its input variables), which are added to the variable pool in that order, meaning that its output variables are now also available as input variables to it succeeding variable processors. The input variable bit set of the next variable processor now has to be larger to accomodate the variables from both the initial input variables and the output variables of preceeding variable processors. The process is repeated until all variable processors are instantiated. At the end, the uppermost MVAComputer calibration then picks one variable out of the final pool as the final output variable.

It has to be taken care to check for compatible variables with respect to their allowed number of appearance (see optional and multiple appearance of input variables). Some variable computers can handle missing variables, other cannot. The same counts for variables that appear multiple times. Special variable computers are available that can convert the type of variables to make it compatible with other variable processors. For example, missing variables can be replace with a default value using the ProcOptional variable computer. A likelihood (ProcLikelihood) is able to collapse variables that can appear multiple types into a single variable by multiplication of the probabilities. Special variable processors can also do stuff like splitting up a single variables into multiple distinct ones by selecting the first, second, etc... and passing them to individual output variable connectors (see ProcSplitter) or iterating over each instance of such a multi-variable and calling another variable each time (ProcForeach). Simple counting of instances is also possible (ProcCount).

Variable processors

In the following, the individual variable processors available as of CMSSW~1.5.0 are described in detail, starting with processors meant for glue and preprocessing, and concluding with actual MVA processors:

ProcClassed:

This variable processor splits category selection variables into one variable for every possible category which describe whether that category was selected or not.

Input variables: 1 non-optional and non-multiple variable

The input variable is considered a ``class'' variable, meaning that it describes a category, represented by an integer value corresponding to some enumeration. Its allowed values are 0, 1, ..., n, where the maximum n is defined in the calibration.

Configuration parameters:

  • the number of possible categories n

Output variables: n non-optional, non-multiple variables

Each allowed category 0, 1, ... n is assigned one output variable (in that order). In case the input variable selects that category, the corresponding output variable is set to one, otherwise to zero. In case the input variable is out of range, all output variables are set to zero.

ProcCount:

Every input variable is counted and the result made available as output variable.

Input variables: n optional, multiple variables

Configuration parameters:

Output variables: n non-optinal and non-multiple variables

Each input variable is assigned an output variable that contains the number of instances of that input variable

ProcForeach:

The ProcForeach variable processor allows running a specified number of following variable processors in a loop, passing only one instance from a selected number of input variables at each iteration, thus allowing to run variable processors not able to deal with optional or multiple appearance of input variables by themselves for each instance of multi-variables. The output variables of those processors are then collected and turned into multi-variables that match the configuration of the ProcForeach input variables.

Input variables: n optional, multiple variables

The input variables that are to be passed to the following m variable processors one by one have to be specified here. Note that each of these variables has to appear exactly the same number of times.

Configuration parameters:

The number of directly following m variable processors to be executed in the loop has to be specified. These variable processors in the defined order for each instance of input variables. All variables those following m variable processors choose as input variables, starting from their master ProcForeach variable processor (including the output variables of the ProcForeach itself) are always passed as exactly one non-optional and non-multiple variable. All variables in the pool before the ProcForeach are passed unmodified. If the input variables are empty, the following m variable processors are simply skipped and their output variables set to empty.

Output variables: n non-optional, non-multiple variables

Each of the n input variables is repesented by an output variable. The only difference is that the output is suited as input for the variable processors to be looped over, as in contrast to the input variables, the variables instances are passed individually into the following m variable processors. Note that if any of these variables or the output variables of the following m variable processors is used inside the loop, only the one instance of these variables currently produced by the current iteration is visible, whereas for the variable processors after the loop, all output variables are visibile as optional and multiple input variable again.

ProcSplitter:

This variable processor splits the first m instances of the n multi-instance input variables into individual non-multiple output variables (and all remaining instances as one multi-instance output variable).

Input variables: n optional, multiple variables

Configuration parameters:

  • The specified m first instances of every input variable that has to be split off.

Output variables: n * m optional, non-multiple variables and

n optional, multiple variables

For every output variable the m first instances appear as individual optional output variables (empty if less than m instance variables are available) and all remaining instances as an optional and multiple variable, in that order.

ProcMultiply:

Allows the multiplication of input variables.

Input variables: n non-optional, non-multiple variables

Configuration parameters:

For each of the m output variables, $k_{m}$ input variables can be multiplied and the result returned as output variable. The ordering is arbitrary. The number of input variables required is $Sum_{i=1}^n k_{m}$.

Output variables: m

The m results of the multiplications.

ProcOptional:

The ProcOptional variable processors simply replaces empty variables with a provided default value.

Input variables: n optional, non-multiple variables

Configuration parameters:

  • for each of the n variables a default value has to be provided that is set for empty input variables.

Output variables: n non-optional, non-multiple variables

ProcNormalize:

Applies a ``normalisation'' process to the variables. In this term ``normalisation'' means to things: First the range of the variables is transformed to be in the range between 0 and 1. Secondly the transformation is a non-linear one. The transformation function is chosen so that the output distribution is more uniform than the input distribution. The exact behaviour is controlled by the trainer.

Input variables: n optional, multiple variables

The input variables to be normalised

Configuration parameters:

  • the range and distribution histograms for each input variable

Output variables: n optional, multiple variables

Each output variable exactly corresponds to an input variable with the transformation of the value applied according to the range and histogram provided for that variable.

ProcMatrix:

This variable processor provides a general linear matrix transformation for the variables. n input variables are transformed to m output variables using an m x n matrix.

Input variables: n non-optional, non-multiple variables

Configuration parameters:

  • A m x n matrix.

Output variables: m non-optional, non-multiple variables

ProcLinear:

Provides a computer for a simple linear discriminant analysis. This is implemented as a simple linear combination of input variables and an offset.

Input variables: n non-optional, non-multiple variables

Configuration parameters:

  • n coefficients
  • 1 offset

Output variables: 1 non-optional, non-multiple variable

ProcLikelihood:

This variable processor implements a simple probability density estimator using a likelihood approach. Each input variable is assigned a background and signal probability distribution histogram, which is interpolated using cubic splines. The output variable is evaluated as the likelihood of the input variables describing a signal event, i.e. $L = s / (b + s)$, where $s$ and $b$ describe the probability of the input variables describing a signal event or background event respectively. $s$ and $b$ are defines as product over the individual properties for each input variable being signal- or background-like (obtain from the respective normalised histograms for signal and background for that variable).

Input variables: n optional, multiple variables

Note that the probabilities for each instance of multiple variables are obtained using the same distributions. If one wishes to use different histograms for different instances, ProcSplitter should be used beforehand.

Configuration parameters:

  • n signal and n background distribution histograms

Output variables: one optional, non-multiple variable

If all input variables are empty, no output variable is computed at all, i.e. will be empty and can be assigned a default using ProcOptional.

ProcMLP:

The ProcMLP module implements a simple feed-forward multi-layer perceptron (a simple artificial neural network). The n input variables represent the first layer, the m output variables the last layer. The layout and coefficients of the intermediate layers are defined in the configuration.

Input variables: n non-optional, non-multiple variables

Configuration parameters:

  • the number of hidden layers
  • the activation function for each hidden layers (sigmoid or linear)
  • the number of neurons in each layer
  • the coefficients for the inputs to each neuron (a coefficient for every connection to the nodes of the preceeding layer and an offset)

Output variables: m non-optional, non-multiple variables

ProcTMVA:

This processor provides a glue to the evaluator of the TMVA framework, provided by ROOT and therefore allows the use of any algorithm provided by TMVA inside the MVAComputer framework.

Input variables: n non-optional, non-multiple variables

Configuration parameters:

  • the number and names of the input variables
  • the name of the TMVA method
  • the TMVA training data (gzip-compressed TMVA weights file contents)

Output variables: one non-optional, non-multiple variable

-- ChristopheSaout - 06 Dec 2007

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2008-02-15 - ChristopheSaout
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback