MVA Framework Trainer Offline Guide

MVATrainer documentation and CMSSW interface

The MVATrainer

The MVATrainer provides a training framework that implements a set of training algorithms and writes MVAComputer calibration objects as result. Additionally a set of CMSSW modules and module templates are provided to allow tight integration with the EDM framework, i.e. to facilitate automatic looping over EDM data sets. For training, the way the data is passed to the trainer does not differ much from the way data is passed to the MVAComputer for actual evaluation. The calibration object is replaced by a ``training claibration object'' that is obtained from the MVATrainer and from which an MVAComputer has to constructed in the same manner. After every pass over the training data the MVATrainer has to be notified so that it can update its state. After each run it can decide whether an additional run over the data is needed or the final calibration object can be stored, in which case it can be retrieved to be stored into the CMS CondDB. While training additional variables have to be passed alongside the input variables. The target information, a boolean value, is mandatory to let the trainer know whether the input variables are describing a signal or a background event. Optionally, an event weight can be passed. This is important if the signal and background events are not homogenously and equally distributed, which is necessary to obtain a good training. The detailed configuration of the targetted MVAComputer and the training parameters are described in a user-define training definitions file, written in XML syntax.

The low-level interface to the MVATrainer (as opposed to using one of the provided ESProducerLoopers) is shortly exemplified in the following.

The PhysicsTools::MVATrainer class is instantiated and passed the filename of the XML training description. Then the method getTrainCalibration() is called to retrieve a calibration suitable for constructing an MVAComputer. This MVAComputer then should be fed the training data with target information. Note that in order to notify the trainer that a training run is completed, the calibration object should either be destroyed or the trainer explicitly notified by calling doneTraining() and passing the training calibration object in question (in the example below the auto_ptr takes care of destroying the calibration object at the end of the loop iteration). Then the process should be repeated until no calibration object is returned. In this case, a call to getCalibration() should return the final calibration object.

#include <assert.h>
#include <memory>

#include "CondFormats/PhysicsToolsObjects/interface/MVAComputer.h"
#include "CMS.PhysicsTools/MVAComputer/interface/Variable.h"   
#include "CMS.PhysicsTools/MVAComputer/interface/MVAComputer.h"
#include "CMS.PhysicsTools/MVATrainer/interface/MVATrainer.h"

using namespace CMS.PhysicsTools;

// instantiate trainer with training description
MVATrainer trainer("testMVATrainer.xml");

// loop over data until training is complete
for(;;) {
        // retrieve training calibration from trainer 
        std::auto_ptr<Calibration::MVAComputer> calib(
                                              trainer.getTrainCalibration());

        // if no training calibration is available, training is complete
        if (!calib.get())
                break;

        // construct MVAComputer from training calibration
        std::auto_ptr<MVAComputer> computer(new MVAComputer(calib.get()));

        // call our train method with MVAComputer instance
        trainWithData(computer.get());
}

// retrieve final calibration
Calibration::MVAComputer *calib = trainer.getCalibration();

// if it's not set something has gone wrong
assert(calib != 0);

// store calibration

The trainWithData() method looks similar to a regular MVAComputer evaluation, with additional target and optional weight information passed. It is crucial that every iteration passes an identical number of events, or the trainer might get confused. Note that the return value of the MVAComputer's eval() method simply returns the target information when training calibrations are used.

void trainWithData(MVAComputer *mva)
{
        // loop over all events
        for(...) {
                std::vector<Variable> input;

                bool target = event->isSignal();        // mandatory
                double weight = 1.0;                    // optional 

                input.push_back(Variable(MVATrainer::kTargetId, target));
                input.push_back(Variable(MVATrainer::kWeightId, weight));

                input.push_back(Variable("pt", event->pt));
                input.push_back(Variable("eta", event->eta));
                input.push_back(Variable("deltaR", event->deltaR));
                input.push_back(Variable("isolation", event->isloation));

                mva->eval(input);
        }
}

The performance arguments that count for MVAComputer should also be respected here (the example is suboptimal). the target and weight identifier constants in MVATrainer are predefined AtomicId objects. Their string representations are "__TARGET__" and "__WEIGHT__".

An optional feature of the MVATrainer is that the current state of a training can be stored on disk. This state can then be introspected by the user or reloaded into the MVATrainer. This way it is possible to keep the state after a calibrated preprocessing procedure and saving some CPU time when experimenting with different MVA algorithms that share the same preprocessing. Also, if a fully trained MVAComputer is saved this way, the state can be reloaded and stored into the CondDB without needing to touch the training data at all (two CMSSW modules are provided for this).

One ore more files are typically written by each variable processor in the MVATrainer, except for variable processors that don't need to be trained (ProcCount, ProcForeach, ProcSplitter and ProcOoptional). For most variable processors this state is stored as an XML files, which then contains the histograms and covariance matrices and such things that are used while training. The information store there might exceed what is stored in the final calibration object (e.g. ProcMatrix will additionally store the covariance matrix that is used to compute the final rotation matrix). The format of the filenames used is defined in the training description.

The PhysicsTools::MVATrainer class has two methods saveState() and loadState() to explicitly trigger the save and load procedure. With saveState() all trained processors will be saved. loadState() will load the state into variable processors for which the files are found, processors for which it is missing will be left untrained, and can subsequently be trained using the normal mechanism described above.

The additional setAutoSave() and setCleanup() methods can be called after instantiating the MVATrainer. They are passed a boolean flag to turn the feature on or off. setAutoSave(true) activates auto-saving. This means that when a variable processor has been trained, it automatically saves it state to disk. This way, the execution of the trainer can be interrupted without loosing the whole state information. setCleanup(true) is turned on by default and also implicitly deactivated by a setAutoSave(true) call. Some variable processors produce temporary files while training, which are automatically cleaned up afterwards unless it's turned off.

persistent storage of MVAComputer CondDB objects

The resulting PhysicsTools::Calibration::MVAComputer objects can be stored to the CMS CondDB using the PoolDBOutputService of CMSSW. The configuration file snippets needed for this looks like:

service = PoolDBOutputService {
        string connect = "sqlite_file:MVAComputerObjects.db"
        string timetype = "runnumber"
        untracked uint32 authenticationMethod = 1
        untracked string catalog = "file:mycatalog.xml"

        VPSet toPut = {
                {
                        string record = "SomeRcd"
                        string tag = "Foobar_tag"
                }
        }

        PSet DBParameters = {
                untracked string authenticationPath = "."
                untracked int32 messageLevel = 0
                untracked bool loadBlobStreamer = true
        }

Storage of multiple objects can be specified by adding multiple toPut parameter sets. As for retrieving a calibration object, the record type has to match the name of the C++ record class name. The tag is a string identifier that's used to identifiy the object in question when retrieving it out of multiple versions. In general, the parameters should match those when retrieving the object using PoolDBESSource.

In case sqlite files are being used for storage, additional objects and tags can be added to an existing database, but tags cannot be replaced. If the file does not exist, it will be created.

A ObjectRelationPOOLMapping xml file is available for the MVAComputer calibration classes. This mapping file can be used to create the POOL SQL databases before storing actual CondDB objects into them. The mapping file provides nicer SQL table and column names than the automatically created POOL defaults and also uses the BLOB streamer for large data vectors to reduce unnecessary cluttering and size. Unfortunately, as of CMSSW 1.5.0_pre5 the BLOB streamer seems still broken for this purpose.

The procedure to call the PoolDBOutputService from C++ is as follows:

#include "FWCore/Framework/interface/IOVSyncValue.h"
#include "FWCore/ServiceRegistry/interface/Service.h"
#include "CondCore/DBOutputService/interface/PoolDBOutputService.h"
#include "CondFormats/PhysicsToolsObjects/interface/MVAComputer.h" 

// write CMS.PhysicsTools::Calibration::MVAComputer object in calib (see above)
calib = trainer->getCalibration();

// retrieve edm service
edm::Service<cond::service::PoolDBOutputService> dbService;
if (!dbService.isAvailable())
        return; // service is not available, throw exception or something

// schedule object for storage
dbService->createNewIOV<Calibration::MVAComputer>(
                                calib, dbService->endOfTime(), "SomeRcd");

Calling createNewIOV creates a new interval of validity (IOV) range for the given record. The dbService->endOfTime() method makes sure the interval is unlimited (i.e. a dummy, always valid IOV is created). The record name has to match a definition in the configuration file or the call will fail. The createNewIOV method will take over ownership of the calibration object and write it out at the end of the CMSSW execution and destroy it afterwards. The PoolDBOutputService is only available inside CMSSW modules, so a suitable place for the call above would be e.g. in the endJob() method of an EDAnalyzer.

In case a MVAComputerContainer is to be used to store multiple MVAComputer calibration objects, the procedure is the same. An empty PhysicsTools::Calibration::MVAComputerContainer has to be created and the individual computers added using the add() method.

In case no data has to be processed by the EDM framework, a dummy CMSSW configuration file has to created. This means adding an empty source:

source = EmptySource {
        untracked uint32 firstRun = 1
        untracked uint32 numberEventsInRun = 1
}

untracked PSet maxEvents = {
        untracked int32 input = 1
}

CMSSW framework training modules

The MVATrainer package also provides a CMSSW module template that can take over the looping over EDM data. This way a more or less traditional EDAnalyzer can be used for passing the training data. Retrieval of the training calibration object is then done similarly to that of the actual calibration object when evaluating an MVAComputer.

This module template has to be instantiated with the C++ record class and registered as an EDM plugin.

The PhysicsTools/MVATrainer/interface/MVATrainerLooperImpl.h header defines two template classes PhysicsTools::MVATrainerLooperImpl and PhysicsTools::MVATrainerContainerLooperImpl that can be instantiated with a CondDB record class and defines CMSSW module. Consider this example Subsystem/Package/plugins/module.cc:

#include "FWCore/Framework/interface/MakerMacros.h"
#include "FWCore/Framework/interface/LooperFactory.h"
#include "CondFormats/DataRecord/interface/SomeRcd.h"
#include "CMS.PhysicsTools/MVATrainer/interface/MVATrainerLooperImpl.h"
#include "CMS.PhysicsTools/MVATrainer/interface/MVATrainerSaveImpl.h"  

// trainer helpers
using namespace CMS.PhysicsTools;

typedef MVATrainerLooperImpl<SomeRcd> SomeMVATrainerLooper;
DEFINE_FWK_LOOPER(JetTagMVATrainerLooper);

typedef MVATrainerSaveImpl<SomeRcd> SomeMVATrainerSaver;
DEFINE_FWK_MODULE(JetTagMVATrainerSaver);

and the accompanying BuildFile:

<use name=FWCore/Framework>
<use name=FWCore/ParameterSet>
<use name=FWCore/ServiceRegistry>
<use name=FWCore/Utilities>

<library file="module.cc" name="SomeMVATrainerPlugins">
   <use name=CondFormats/DataRecord>
   <use name=CMS.PhysicsTools/MVATrainer>
   <flags EDM_PLUGIN="1">
</library>

The string and plugins names containing the substring ``Some'' in the above example are arbitrary and should be replaced by the concrete implementation. The above example also defines a module for an MVATrainerSaver, which is the EDAnalyzer that does the actual storage of the final calibration object to the CondDB. For the PhysicsTools::Calibration::MVACalibrationContainer objects, respective MVATrainerConatinerLooperImpl and MVATrainerContainerSaveImpl templates are also available in the same header file.

In order to use the EDLooper plugin just defined, the following definitions in the CMSSW config file are required. Note that the config file snippet contains pars for both a plain MVATrainerLooper/Saver and MVATrainerContainerLooper/Saver. Please look at the comments.

module someMVATrainer = SomeMVATrainer {
        # some EDAnalyzer trainer module to be implemented by the user
        # should retrieve training calibration from ESProducerLooper  
        # from event setup, instantiate the MVAComputer and pass
        # the input variables and target information
}
 
module someMVATrainerSaver = SomeMVATrainerSaver {
        # for CMS.PhysicsTools::Calibration::MVAComputer:

        # empty

        # ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
        # for CMS.PhysicsTools::Calibration::MVAComputerContainer:

        vstring toCopy = {
                # list the calibration labels in container from
                # PoolDBESSource to copy over to the new container
                "oldLabel"
        }
        vstring toPut = {
                # list the calibration labels from completed trainings
                # to store to the new container
                "label1", "label2"
        }
}
 
looper = someMVATrainerLooper {
        # for CMS.PhysicsTools::Calibration::MVAComputer:

        untracked string trainDescription = "SomeMVATrainingDescription.xml"
        untracked bool loadState = false
        untracked bool saveState = false

        # ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
        # for CMS.PhysicsTools::Calibration::MVAComputerContainer:

        VPSet trainers = {
                {
                        # the calibrationRecord should match toPut
                        # in the MVATrainerContainerSaveImpl<SomeRcd>
                        string calibrationRecord = "label1"

                        # as in the plain MVATrainerLooper (see above)
                        untracked string trainDescription =
                                        "SomeMVATrainingDescriptionLabel1.xml"
                },
                { 
                        string calibrationRecord = "label2"
                        untracked string trainDescription =
                                        "SomeMVATrainingDescriptionLabel2.xml"
                }
                # ...
        }
}
 
path p = {
        # ...
        someMVATrainer
}
 
endpath outpath = { someMVATrainerSaver }

The two MVATrainerContainer plugins also support copying of existing MVAComputer calibrations inside a container to the new database. For this to work, the PoolDBESSource containing the old calibration has to be added to the config file and an es_prefer statement set to set precedence over the ESProducer defined by the looper. The labels inside the container for which the MVAComputer calibrations are to be copied, have to be listed in the toCopy parameter of the MVATrainerSaver plugin.

the MVATrainer training description file

The MVATrainer training description file is the central steering configuration for the trainer. It contains both the desired MVAComputer layout and the settings for the preprocessing and MVA algorithm. The training description file is formulated in XML. The root node of the document is an ``MVATrainer'' tag. This tag contains a mandatory ``general'' tag, a mandatory ``input tag'', a number of ``processor'' tags and a final ``output'' tag. These define the some general options, the input variables, the variable processor and the MVAComputer output variable respectively. The details are exemplified by the following simple configuration file:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<MVATrainer>
        <general>
                <option name="id">SomeMVATrainer</option>
                <option name="trainfiles">train_%1$s%2$s.%3$s</option>
        </general>
        <input id="input">
                <var name="x" multiple="false" optional="false"/>
                <var name="y" multiple="false" optional="false"/>
        </input>
        <processor id="norm" name="ProcNormalize">
                <input>
                        <var source="input" name="x"/>
                        <var source="input" name="y"/>
                </input>
                <config>
                        <pdf/>
                        <pdf/>
                </config>
                <output> 
                        <var name="x"/>
                        <var name="y"/>
                </output>
        </processor>
        <processor id="tmva" name="ProcTMVA">
                <input>
                        <var source="norm" name="x"/>
                        <var source="norm" name="y"/>
                </input>
                <config>
                        <method type="MLP" name="MLP">!V:NCycles=50:HiddenLayers=5:TestRate=10</method>
                </config>
                <output> 
                        <var name="discriminator"/>
                </output>
        </processor>
        <output>
                <var source="tmva" name="discriminator"/>
        </output>
</MVATrainer>

The ``general'' section found at the beginning currently contains a number of options, which are a list key-value pairs defined as . The currently available options are some name of the training description and the filename template for the optionally storable state information of the variable processors:

        <option name="id">SomeMVATrainer</option>
        <option name="trainfiles">train_%1$s%2$s.%3$s</option>

The positional parameter %1$s is replaced by the name of the variable processor, %2$s optionally by some additional identifier like "_input" and %$3s by the filename extension. It is suggested to put the name of the training description into the filenames to avoid filename clashes when experimenting with different trainings.

The ``input'' section contains the list of input variables and their flags. This is a simple list of variable names and a boolean attribute indicating whether this variable appears optionally and if this variable consists of exactly one value or can have multiple values.

        <input id="input">
                <var name="x" multiple="false" optional="false"/>
                <var name="y" multiple="false" optional="false"/>
        </input>

The variable processor definitions are added inside a ``processor'' tag. The attributes contain some unique identifier ``id'' within the training description and the name of the C++ class for the variable processor.

Every variable processor then has an ``input'' section, a ``config'' section and a ``output'' section. The input section describes the input variable set for the variable processor using ``var'' tags. Each variable has a ``source'' attribute to indicate from which previous variable processor the variable is selected, the global MVAComputer input variables uses the magic source name "input". The ``name'' attribute selects the variable from the selected source.

The ``config'' section then defines the details of the variable processor configuration and training parameters. The different training configuration directives for the individual variable processors are described below. Note that the input variables have to be defined in the strict order of appearance within the training description.

The final ``output'' tag defines the output variables for that variable processor. Similar to the input variables, the names of the output variables are defined here, without the ``source'' attribute.

        <processor id="norm" name="ProcNormalize">
                <input>
                        <var source="input" name="x"/>
                        <var source="input" name="y"/>
                </input>
                <config>
                        <pdf/>
                        <pdf/>
                </config>
                <output> 
                        <var name="x"/>
                        <var name="y"/>
                </output>
        </processor>

The final ``output'' tag for the MVATrainer then selects the variable used as global MVAComputer output variable. The syntax is identical to the input variable declaration for the variable processors.

        <output>
                <var source="tmva" name="discriminator"/>
        </output>

The variable processor training directives

For details about what each variable processor does, please refer to the table above.

ProcCount:

This variable processor trainer does not need an entry in the config section. The output variables have to match the input variables.

ProcForeach:

The only configuration option is the number n of directly following variable processors to include into the ProcForeach loop:

        <procs next="n"/>

The output variables have to match the input variables.

ProcSplitter:

The ProcSplitter defines one configuration variable that selectes the number n of instances to separate off from the input variables.

        <select first="n"/>

more...

This part is not finished. Basically all variable processors described in the MVAComputer guide can be configured here. If you need a certain one, just contact me and I will fix up the documentation on how to configure this one.


ChristopheSaout - 06 Dec 2007 - page author

Responsible: ChristopheSaout

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2008-02-26 - ChristopheSaout



 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback