MVA Framework Offline Guide tutorial

Introduction to the MVA framework in CMSSW with examples.

Introduction

The main classes, for the MVAComputer on the evaluation side and MVATrainer on the training side are available in FWLite. These interfaces are in parts fairly low-level, so there are additional convenience classes available, which are going to be introduced in the following FWLite-based examples.

In order to follow the examples you probably need the latest version of the code, so please use CMSSW CVS HEAD when using at least CMSSW_2_0_0 or go into the 1_6_X backport branch btag_CMSSW_1_6_X_backport_branch for older CMSSW versions (note: Don't be alarmed by "btag", I'm just reusing that same branch to avoid any duplication, since the MVA stuff is simply also contained in the backport).

Under your project area under CMSSW_X_Y_Z/src

project CMSSW
cvs co -r btag_CMSSW_1_6_X_backport_branch CondFormats/PhysicsToolsObjects
cvs co -r btag_CMSSW_1_6_X_backport_branch PhysicsTools/MVAComputer
cvs co -r btag_CMSSW_1_6_X_backport_branch PhysicsTools/MVATrainer

The examples can be found in the respective /test directories.

the MVA trainer

Let's start from a simple didactic example. We assume a two-dimensional space of input variables, x and y, with a simple signal and background distribution, two gaussian blobs with a large overlap. Like this:

image1.gif

These background and signal distributions will be created on the fly in some examples and use ROOT TTrees in others. In CMS.PhysicsTools/MVATrainer/test the files testFWLiteTreeTrainer.C, testFWLiteTrainerViaTreeReader.C, testFWLiteTrainer.C and testMVATrainer.cpp are doing so. The first example is using the most simple to use trainer interface, the others are using the accessing the internal classes more and more directly - the last two examples are not using the ROOT interface, and the last example is not using FWLite.

If you are just looking for a way to simply train on prepared ROOT trees, without needing to do any programming for that, look here. If you have a TMVA weights file and want it converted, look in the section below and find the paragraph.

TreeTrainer example

Let's look at testFWLiteTreeTrainer.C. The second method TTree *createTree(bool signal) creates a TTree on the fly for either signal or background events. Each TTree will contain two branches "x", and "y" and contain 20000 entries each.

        TTree *sig = createTree(true);
        TTree *bkg = createTree(false);

        cout << "Training with " << sig->GetEntries()
             << " signal events." <<  endl;
        cout << "Training with " << bkg->GetEntries()
             << " background events." << endl;

Now we will instantiate a trainer. Note that we are not using the MVATrainer class itself in this example, but rather the ROOT Tree interface to the trainer called TreeTrainer. The most simple constructor simply takes a signal and background tree as argument:

        TreeTrainer trainer(sig, bkg);

Other possibilities include to only a pass a singe tree. In this case, the tree must contain a branch that tells the trainer whether the entry constitutes a signal or background event. The branch has then to be named __TARGET__ and should ideally be a Bool_t or any other number with is either zero or non-zero. Alternatively one can also call the empty constructor and add trees by hand using the addTree(tree, [target, [weight]]) method. Again, if the target (true = signal, false = background) is not specified, a __TARGET__ branch is expected. Additionally a weight can be specified. It defaults to 1.0 otherwise (or to the value of the per-event value of a branch called __WEIGHT__). One last and the most flexible possibility is to instantiate TreeReader instances and pass those using addReader(reader). The TreeReader is the interface used internally. It allows manual specification of branch-to-variable mappings. An example on how to use the TreeReader can be found further below.

Now the trainer can be told to start the actual training process. But, obviously, first it needs know /how/ to train. In the MVA framework a whole network of trainer modules (called "variable processors") can be configured, so the configuration is done through a non-trivial language described in an XML file, rather than a bunch of strings or lots of C++ code. So, in order to modify the training algorithm or the preprocessing only the training description (the XML file) needs to be exchanged. Voilą. Let's come to the XML language later. The file testMVATrainer.xml is provided with the examples which will call TMVA for the variables "x" and "y" and train a neural network.

The trainer is started by calling

        Calibration::MVAComputer *calib = trainer.train("testMVATrainer.xml");
The argument points directly to the training description file. Alternatively, a pointer to an MVATrainer object that has been instantiated by the user can be passed. Otherwise, the TreeTrainer will to it transparently itself. This can be useful in several cases:
  • the MVATrainer object is created by outside code
  • manual access to the MVATrainer is desired (to call its methods, etc...)
  • one wants to iterate manually (see below)

Note the result that is returned by the train(...) method, an object of type CMS.PhysicsTools::Calibration::MVAComputer. The Calibration namespace indicates that this object is persistable and can be fed to an MVAComputer later. This object contains the result of the training containing all the expertise from the processed data. If you are interested, you can look in CondFormats/PhysicsToolsObjects/interface/MVAComputer.h to see how this object is constructed. All the details of the trained "variable processors" and also the layout of the trained MVAComputer can be extracted from that object. The XML file used for training is not needed for later evaluation of the calibration.

This calibration object can now pe persistently stored on disk. This can either be a plain file or, especially important when used inside the CMSSW reconstruction framework, in the CMS conditions database. In FWLite on the first option is available, since access to the CondDB requires to run in the CMSSW framework and interfacing with the EventSetup records and PoolDBOutput service. Options for reading a standalone calibration file and writing it out again into the CondDB (central Oracle DB or a local SQLite file) are available as plugins for CMSSW. No coding needed there either, only writing a .cfg (see the example .cfg file in the test directory).

Let's write the object into a plain file and delete the calibration object (which is created using new in the trainer and it is our responsibility to clean up afterwards):

        MVAComputer::writeCalibration("TrainedGauss.mva", calib);

        cout << "TrainedGauss.mva written." << endl;

The file TrainedGauss.mva now contains a calibration, which can be read in again and used for obtaining a discriminator from a given point in the multi-dimensional space of input variables.

Note that in the default configuration, the MVATrainer also creates some files while training. For each variable processor trained, typically one ore more files are written, containing the results of the training in a machine readable form (mostly .xml files, not a compressed binary like the final calibration object). Whether the MVATrainer creates this output or not can be controlled through member functions when instantiating it manually. It can also be told to read in the state from disk after an interrupted training process to avoid the need of reprocessing the whole sample. Have a look at the saveState() and loadState() methods.

In this particular example the file train_norm.xml contains the total distribution (histograms) of the input variables, which is used for the normalizer configured in this XML training description. Since TMVA is used in this example, the input trees passed to TMVA is written to train_tmva_input.root, the TMVA output file available as train_tmva_output.root and the TMVA weights file in weights/MVATrainer_tmva_MLP.weights.txt - note again that these files are just informational and the important information included in the final calibration object, so they to not have to be kept around.

Continue the example here if you wish to quickly find out how to use the calibration object for classification.

A simple MVATrainer executable

A binary executable, reading ROOT trees and training them - essentially the same as the example above - is provided in CMSSW. After `eval scramv1 runtime -sh` the binary "mvaTreeTrainer" can be called directly. Here is its help output:

Syntax: mvaTreeTrainer <train.xml> <output.mva> <data.root>
   mvaTreeTrainer <train.xml> <output.mva> <signal.root> <background.root>

Trees can be selected as (<tree name>@)<file name>

It takes the training description as first argument, the output file name as second argument and takes one or two ROOT trees. If one ROOT tree is specified, it has to contain the __TARGET__ branch, if two are specified, the first is implicitly the signal and the second implicitly the background tree. In case the ROOT files contain more than one tree, the tree to use can be selected using the tree@file syntax.

more technical MVATrainer details (training without TreeTrainer)

One does not need a ROOT tree for feeding data into the MVATrainer. In fact, the CMSSW plugins allow a user to use EDM files as input, and use an MVATrainer CMSSW plugin to loop over EDM files and pass the data using an EDAnalyzer.

The MVAComputer can consist of multiple "variable processors" that are stacked. This means that they cannot be all trained at once in one single iteration over the data. The MVATrainer will notify the user if another iteration over the data is needed or if the training has been completed. One can use the TreeTrainer::iterate() method to do exactly one iteration. The return boolean value tells whether training is complete. After that the calibration object can be retrieved via MVATrainer::getCalibration().

Now if we don't use the TreeTrainer since we don't have a ROOT tree as input, we will need to do the iteration and feeding the data by hand. (note that there is an example that does the MVATrainer iteration by hand, but uses the TreeReader to read from actual ROOT trees, see testFWLiteTrainerViaTreeReader.C).

Have a look at testFWLiteTrainer.C:

        MVATrainer trainer("testMVATrainer.xml");
No magic here. Instantiate a trainer with the training description.

        for(;;) {
                Calibration::MVAComputer *calib = trainer.getTrainCalibration();
Start an endless loop and get the "train calibration". Note that this "train calibration" works exactly like the final calibration object, except that it does not return an actual discriminator. The only use of this train calibration is to instantiate an MVAComputer with it which you can then feed the data. This means that you can use the exact same code as for evaluating - with the only difference that you have to pass the target information (the truth, signal or background) by passing the magic variable __TARGET__ which is either 1 for signal or 0 for background. Any optionally a __WEIGHT__ variable with the weight (defaults to 1 otherwise).

The pointer returned is zero when the training has been completed. In this case abort the loop:

                if (!calib)
                        break;

Now, as mentioned above, instantiate the MVAComputer and feed it the variables:

                MVAComputer computer(calib, true);
                train(&computer);
The train() method is defined further below and basically calls computer->eval(...) 20000 times for signal and background.

                trainer.doneTraining(calib);
        }
Tell the trainer that one training iteration is done. Some algorithms will now start doing the complicated work (fitting, mimimum finding).

Now we can obtain the final calibration, like in our example above:

        Calibration::MVAComputer *calib = trainer.getCalibration();

        MVAComputer::writeCalibration("TrainedGauss.mva", calib);

the XML training description file

The description can be found here.

monitoring plots

If trainer monitoring is enabled with MVATrainer::setMonitoring(true) or passing untracked bool monitoring = true to the CMSSW trainer looper.

Output monitoring plots are then collected from the trainer and put in a file (typically train_monitoring.root) which contains genera informations like variable distributions going into the variable processors and information specific to the variable processors.

Running root -l ViewMonitoring.C opens a simple GUI that allows to inspect these plots in a reasonably preformatted way.

Let's have a look at our example:

input_norm_input_x.png input_norm_input_y.png (reachable via "input variables"/"norm")

These are the input variables to the normalizer module labelled "norm" in the configuration file.

The ProcNormalize module will then derive to obtain a smoothed PDF approximation for the combined signal + background distribution of each variable and use that to obtain a transformation function which can be used to normalize the distribution between 0 and 1. Those PDFs look like:

ProcNormalize_norm_input_x_pdf.png ProcNormalize_norm_input_y_pdf.png (reachable via "ProcNormalize"/"norm")

After the normalization module they are fed into the ProcMatrix module, which will do a principal components analysis to try to figure out the correlation. The output is as follows:

input_rot_norm_x.png input_rot_norm_y.png (reachable via "input variables"/"rot")

One can immediately see that the input variable distributions have been transformed to give an equal distribution between 0 and 1 (signal + background combined). Peaks are washed out and tails compacted. In a two-dimensional representation the signal and background distributions are now contained in a rectangle between (0, 0) and (1, 1), with the signal being concentrated in the upper right and the background in the bottom left corner.

In the next step, the ProcMatrix PCA analysis finds the following correlations:

ProcMatrix_rot.png (reachable via "ProcMatrix"/"rot")

Here the analyzer finds a strong correlation along the diagonal axis through the centers of both signal and background distributions (along a thought line through the center of the two gaussian blobs in the original distribution). As a result the variables will be rotated to eliminate those correlations. In this case, this means that the center of the signal/background blobs will be aligned on one of the axises, i.e. rotation in X-Y by 45°.

The distributions used as signal/background PDFs for the likelihood ratio in the third step then look like:

ProcLikelihood_lkh_rot_rot1.png ProcLikelihood_lkh_rot_rot2.png (reachable via "ProcLikelihood"/"lkh")

One can see that as a result, the dimension orthogonal to principal component has been rid of information, it is all contained in the dimension along it. Note that because of this, this particular simple example is also the best candidate for a Fisher's discriminant (because all information is basically in one dimension).

And finally the result of our MVA training can now be looked at:

output_lkh_discriminator_pdf.png output_lkh_discriminator_effs.png

Here, the discriminator distribution is shown for signal and background, and the corresponding cut efficiencies (meaning with rate of signal / background events is selected at a given minimum discriminator cut) as well.

output_lkh_discriminator_effpur.png output_lkh_discriminator_effsigbkg.png

This shows the efficiency plotted against the purity (define as one minus the background rate at the same cut). Also a logarithmic variant of the same plot (with the background rate shown logarithmically on the y axis) is added to zoom into the high purity region, which is often the interesting region in physics analyses.

Using this small ROOT macro which uses the MVAComputer evaluation descriped in the next chapter, once can plot the discriminator distribution in the two-dimensional input variable space to see the output of the MVA decision:

discr.png

The central signal and background regions are clearly visible as well as the transition region in-between.

evaluation using the MVAComputer

The MVAComputer is the core of all computations in the MVA framework. It acts like a sort of dispatcher and evaluates the "variable computer" network based on information available in an MVAComputer calibration object. These objects can be put together by hand (see CMS.PhysicsTools/MVAComputer/test for examples), but trust me, you really don't want to (unless you know exactly what you're doing). That's what the MVATrainer is for. There is one exception: It was a popular request to be able to wrap TMVA, so if you already have a TMVA weights file that you just want to use in the MVA framework, use the executable mvaConvertTMVAWeights to convert the TMVA weights file into an MVA file.

The MVAComputer is constructed from a calibration object and can subsequently evaluate the network to obtain a discriminator. This is done by calling the eval method of the MVAComputer object. This method is actually a templated method. The variables can be passed in any iterable form of key-value pairs. It can be called with one container as argument or with a begin and end iterator. The classs Variable::Value= is predefined for this purpose. It is constructed with the variable name and its value. The container an be a STL vector of those, or a simple C array of them. Also, the key is actually an AtomicId, but it can be transparently converted from any string. This is done for performance reasons (string lookups are expensive). more details can be found in the MVAComputer documentation. Also, a default container class Variable::ValueList ist provided for simplicity. It has a add(key, value) method and acts mostly like a normal STL vector otherwise. Examples using this interface can be found in CMS.PhysicsTools/MVAComputer/test/testFWLiteRead.C and CMS.PhysicsTools/MVAComputer/test/testReadMVAComputerCondDB.cc.

example MVAComputer evaluation with TreeReader

Here we will use another interface provided by TreeReader, which should be more familiar to ROOT users, as it works similarly to the TTree::Fill() mechanism.

Have a look at testFWLiteEvaluation.C (in CMS.PhysicsTools/MVATrainer/test). Here we go:

        MVAComputer mva("TrainedGauss.mva");
Here we instantiate the MVAComputer from the file we've just written. Note that it can be read manually using MVAComputer::readCalibration and fed to the MVAComputer as pointer as well (which will be the normal use case anyway).

We will use the TreeReader convenience interface to access MVAComputer::eval using a TTree::Fill like interface. Create the TreeReader and configure the variables and pointers to the variables. Please do not use this in performance-critical code (instead, preallocate a vector of values Variable::ValueList, fill the variable names, once and then use the MVAComputer::eval interface directly).

        double x, y;

        TreeReader reader;
        reader.addSingle("x", &x);
        reader.addSingle("y", &y);
Here you can see calls to addSingle. This means that the variable is a single value. Alternatively addMulti can be used, if one whishes to use variables which contain multiple values. Pass a pointer to a std::vector<...> in this case. Also addMulti has a hidden third boolean parameter, which indicates whether the variable is considered omittable (missing). If this flag is set to true and the value is set to -999.0, then this variable will be considered missing. This default magic value can be changed using setOptional on each variable (note that this issue only crops up when using TreeReader - when calling MVAComputer::eval manually, just really omit the variable).

Now, evaluation of the MVA is very simple. This example speaks for itself.

        x = 2, y = 2;
        cout << "at (+2.0, +2.0): " << reader.fill(&mva) << endl;

The reader is passed a pointer to our MVAComputer object, it will pass the variables defined and return a double with the discriminator.

The output from our example will look like the following:

root -b -l testFWLiteEvaluation.C
Processing testFWLiteEvaluation.C...
at (+2.0, +2.0): 0.94144
at (+0.1, +0.1): 0.554656
at (+0.0, +0.0): 0.514649
at (-0.1, -0.1): 0.474332
at (-2.0, -2.0): 0.0639852

If you look back at the image with the signal and background distributions, this is what we expect. In the middle between the two we are around 0.5 (in the middle between 0 and 1), and near the center of each blob close to 1 (signal) or 0 (background) respectively.

more about the TreeReader

The TreeReader does not only act as an interface similar to TTree::Fill(), it can also be used to read actual ROOT trees. One can call the setTree(TTree *tree) method to set a tree and then add branches using addBranch("someName"). Note that addSingle, addMulti and addBranch can also be mixed, if one wishes to read some variables from the ROOT file and provide others manually (for example the __TARGET__ variable). If the TreeReader is constructed with a ROOT tree as argument, all branches found in the file are mapped automatically (which is used by the default behaviour of TreeTrainer).

MVAComputer evaluation using the eval() method

One direct interface to MVAComputer evaluation looks like this. Variable::ValueList is just a wrapper around std::vector<Variable::Value>, because FWLite has problems otherwise. Variable::ValueList::add() just calls values.push_back(Variable::Value(key, value)) internally.

        Variable::ValueList vars;
        vars.add("toast", 4.4);
        vars.add("toast", 4.5);
        vars.add("test", 4.6);
        vars.add("toast", 4.7);
        vars.add("test", 4.8);
        vars.add("normal", 4.9);

        cout << mva.eval(vars) << endl;

As mentioned, do not pass strings in the performance-critical loop, as lookups are expensive. Do something equivalent to this instead:

        static const AtomicId idToast("toast");
        static const AtomicId idTest("test");
        static const AtomicId idNormal("normal");

        ...

        Variable::ValueList vars;
        vars.add(idToast, 4.4);
        vars.add(idToast, 4.5);
        vars.add(idTest, 4.6);
        vars.add(idToast, 4.7);
        vars.add(idTest, 4.8);
        vars.add(idNormal, 4.9);

        cout << mva.eval(vars) << endl;

Or even better, if you have a fixed amout of variables and not STL vector at all, but a simple array (and begin/end pointers):

        Variable::Value vars[6] = {
                Variable::Value("toast", 0.0),
                Variable::Value("test", 0.0),
                Variable::Value("normal", 0.0)
        };

        ...
        vars[0].setValue(4.4);
        vars[1].setValue(4.5);
        vars[2].setValue(4.6);

        cout << mva.eval(vars, vars + 3) << endl;

That last example, unfortunately, only seems to work in real C++ and not in Cint.


ChristopheSaout - 10 Dec 2007 - Page author

Responsible: ChristopheSaout

Topic attachments
I Attachment History Action Size Date Who Comment
C source code filec PlotDiscr2D.C r1 manage 1.4 K 2008-03-17 - 11:45 ChristopheSaout plot discriminator distribution in two-dimensional input variable space
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2008-03-17 - ChristopheSaout



 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback