Thinned Collections and Creating a Thinned Collection Producer

Goal: Users need to decrease the size of data by using thinned collections. This page describes thinned collections and how to produce them.

Introduction

Collections in CMS data can be quite large and sometimes we are only interested in a small subset of the elements of the collection. For example, consider a hit collection. If we have selected high transverse momentum tracks, we may only be interested in the hits those tracks reference. A thinned collection is made from a master collection by selecting the interesting elements and copying them into a new smaller collection. For example, a collection of hits referenced by high transverse momentum tracks would be a thinned collection.

Products can contain references into collections in the form of a Ref, Ptr, RefToBase, RefVector, PtrVector, RefToBaseVector, or other types derived from them. If one makes a thinned collection as described below, then there is special support in the Framework that will enable those references to continue to function and find the objects they point to when the master collection has been dropped and only the thinned collection kept. Without that special support, the fact that the ProductID and indexes into the collection change will break all the references.

This special support for thinned collections was added in CMSSW_7_3_X_2014-09-26-0200 and does not exist in earlier releases.

Creating the Producer

You create a producer for a thinned collection by defining a selector class and using that as a template parameter to the ThinningProducer template class. Here is how that looks:

#include "FWCore/Framework/interface/MakerMacros.h"
#include "FWCore/Framework/interface/ConsumesCollector.h"
#include "FWCore/Framework/interface/stream/ThinningProducer.h"
#include "FWCore/Framework/interface/Frameworkfwd.h"
#include "DataFormats/TestObjects/interface/ThingCollectionfwd.h"
#include "DataFormats/TestObjects/interface/Thing.h"

namespace edm {
  class ParameterSetDescription;
}

namespace {
  class MySelector {
  public:
    MySelector(edm::ParameterSet const& pset, edm::ConsumesCollector&& cc);
    static void fillDescription(edm::ParameterSetDescription & desc);
    void preChoose(edm::Handle<edmtest::ThingCollection> tc, edm::Event const& event, edm::EventSetup const& es);
    bool choose( unsigned int iIndex, edmtest::Thing const& iItem);
  };

  MySelector::MySelector(edm::ParameterSet const& pset, edm::ConsumesCollector&& cc) {
    // Get the values of parameters that you need

    // Note that you should NOT declare the thinned collection or association that
    // will be produced or the master collection that will be consumed. The
    // edm::ThinningProducer template does that automatically. You will need
    // to declare anything else that you consume.
  }

  void MySelector::fillDescription(edm::ParameterSetDescription & desc) {
    // The inputTag parameter is added automatically by the template
    // and if it is the only parameter then leave this function body empty. 

    // Add any additional parameters for the selector here to the ParameterSetDescription.
    // For example, if you want to read a track collection in the selector, maybe this    
    // desc.add<edm::InputTag>("trackTag");
  }

  void MySelector::preChoose(edm::Handle<edmtest::ThingCollection> tc, edm::Event const& event, edm::EventSetup const& es) {
    // Get whatever information you might need in the choose function.
    // For example if thinning a hit collection, one might get the high transverse
    // momentum tracks here and record the keys to the hits they reference.
  }

  bool MySelector::choose( unsigned int iIndex, edmtest::Thing const& iItem) {
    // Return true if you select a particular element of the master collection
    // and want it saved in the thinned collection. Otherwise return false.
    return true;
  }
}
typedef edm::ThinningProducer<edmtest::ThingCollection, MySelector> MyProducer;
DEFINE_FWK_MODULE(MyProducer);

In the above sample code, you would need to replace "ThingCollection" and "Thing" with the type of the collection you want to thin and the type of the elements it contains. You would also need to replace "MySelector" and "MyProducer" with names you make up. Then replace the comments with whatever logic you need to select the elements to be copied to the thinned collection.

You can find a more fleshed out example in FWCore/Integration/test/ThinningThingProducer.cc. This code is run in a unit test and so has extra things in it only needed for the test. But it does show in more detail how one might implement the functions. The details of the logic needed for the selection could be very different in different cases.

Configuration

You must configure the producer to run. There is nothing special about this. Configure it the same way you would configure any producer.

process.myProducerModuleLabel = cms.EDProducer("MyProducer",
    inputTag = cms.InputTag('myMasterCollectionModuleLabel')
)

The "inputTag" parameter is required. It is automatically added to the ParameterSetDescription in the template class. If that is the only parameter, then you can leave the body of the fillDescription function empty. If you want to use more configuration parameters in your selector, then you must add them in the fillDescription function as described here SWGuideConfigurationValidationAndHelp.

Normally, you will configure the output modules to keep the thinned collection and drop the master collection. If this is done, the Framework automatically handles making the Refs and Ptrs continue to work and get the elements from the thinned collection if they are there. If you have getByToken calls directly accessing the data in these collections you will have to adjust any InputTags to specify the collection that is kept. The type of the master and thinned collections are the same, but module label, instance, and process name will not all be the same.

There is an additional product the Framework produces and uses when dereferencing Refs and Ptrs which has the type "ThinnedAssociation". The user should not need to use this, only the Framework. Keeping and dropping of the ThinnedAssociation type is handled automatically by the Framework based on which thinned collections are kept. Any keep or drop statements related to it in the configuration will be ignored.

Performance

Dereferencing a Ref or Ptr to an element of a thinned collection will take more CPU than when the master collection is present. When dereferencing several steps happen.

  1. Looks for the master collection and uses it if it is found
  2. Looks in a table in the Framework that relates BranchIDs of thinned collections, master collections, and ThinnedAssociations.
  3. Loops over ThinnedAssociation products related to a particular master collection. The ThinnedAssociation contains a vector of indices that associate each element in the thinned collection to an element in the master
  4. If the desired element is in the thinned collection it tries to get the collection and the loop stops if it is found

It is possible to make thinned collections from thinned collections repeatedly if one wants. It is also possible to make multiple thinned collection from one master collection. The machinery should handle these situations without difficulty and search through multiple levels of thinning and loop over all the thinned collections associated with a particular master collection. The only downside is more CPU to look for thinned collections holding desired elements.

When referencing a Ref or Ptr to a master collection, the result gets cached and the lookup only needs to be done once. This has been true since before thinned collections were implemented and will remain to be true in the case where the master collection is present. For Ptr this caching occurs when the element is in a thinned collection, but for Refs there is no caching when the element is in a thinned container, so in many cases Ref's will perform worse than Ptr's with thinned collections. The internal design of Ref and especially RefVector makes this caching impossible.

PtrVector has one added performance advantage with thinned collections. In the case where the functions "begin", "end", or "isAvailable" are called, the PtrVector will look for and cache pointers to all its elements in one search over the thinned collections.

The memory used by a Ref or Ptr did not change when thinned collections were implemented.

At this point we have not performance tested or profiled usage of these thinned collections. If someone has a realistic use case where performance is an issue, please report it to the Framework hypernews. With realistic profiling tests, we might revisit tradeoffs made in the design or find places where the implementation was not optimal.

Miscellaneous Details

  1. The redirection of Refs and Ptrs also works when reading data with FWLite and ROOT.
  2. The feature works with EDM I/O (PoolSource and PoolOutputModule) and Streamer I/O.
  3. It works with secondary file input.
  4. It works if the master or thinned collections are EDAlias'd, although it will not allow using EDAlias for the ThinnedAssociation.
  5. It works with the SubProcess feature.

Review status

Reviewer/Editor and Date (copy from screen) Comments
DavidDagenhart - 29 Sep 2014 created page
DavidDagenhart - 29 Sep 2014 reviewed

Responsible: DavidDagenhart
Last reviewed by: DavidDagenhart - 29 September 2014

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2014-09-30 - DavidDagenhart
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback