3.1 Analysis Overview: an Introduction

Complete: 5
Detailed Review status

Goals of this page:

This page presents a big-picture overview of performing an analysis at CMS.

  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.
  • We need to understand the most important CMS data formats, RECO and AOD, so they are described next. PAT is also mentioned, although it will be described in detail later.
  • Finally, we explore two options for a quantitative analysis of CMS events:
    • FW Lite -- using ROOT enhanced with libraries that can understand CMS data formats and aid in fetching object collections from the event
    • the full Framework -- using C++ modules in cmsRun

The data flow, from detector to analysis

(For a more thorough overview, please see WorkBookComputingModel; this section necessarily distills the information which was presented there in much more detail.)

To enable the most effective access to CMS data, the data are first split into Physics Datasets (PDs) and then the events are filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets group or split triggers in order to achieve balance in their size.

However, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever-tighter event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:

  1. the primary datasets and skims are produced; they are defined using the trigger information (for stability) and produced centrally on Tier 1 systems
  2. the secondary skims are produced by the physics groups (say a Higgs group) by running on the primary skims; the secondary skims are usually produced by group members running on the Tier 2 clusters assigned to the given group
  3. optionally, the user then skims once again, applying an ever tighter event selection
  4. the final sample (with almost final cuts) can then be analyzed by FW Lite. It can also be analyzed by the full framework, however we recommend using FW Lite as it is interactive and far more portable

The primary skims (step 1 above) reduce the size of the primary datasets in order to reduce the time of subsequent layers of skimming. The target of the primary skims is a reduction of about a factor of 10 in size with respect to the primary datasets.

The secondary skimming (step 2 above) must be tight enough to make the secondary skims feasible in terms of size. And yet it must not be too tight since otherwise certain analyses might find themselves starved for data. However, in this case what is `tight' is analysis-dependent, so it is vital for the group members to be involved in the definition of their group's secondary skims!

The user selection (step 3) is made on the Tier 2 by the user, and it's the main opportunity to reduce the size of the samples the user will need to deal with (and lug around). In many cases, this is where the preliminary event selection is done, and thus it is the foundation of the analysis. It is expected that the user may need to re-run this step (e.g., in case of finding out that the cuts were too tight), but this is not a problem since the tertiary skims are being run on the secondary skims which are already reduced in size.

That being said, it is important to tune the user's skim to be as close to `just right' as possible: the event selection should be looser than they are expected to be after the final cut optimization, but not too loose -- otherwise the skimming would not serve its purpose. If done right, this will not only save your own time, but also preserve the collaboration's CPU resources.

Reduction in event size: CMS Data Formats and Data Tiers

(For a more thorough overview, please see WorkBookComputingModel; this section necessarily distills the information which was presented there in much more detail.)

In addition to the reduction of the number of events, in steps 1-3 it is also possible to reduce the size of each event by

  • removing unneeded collections (e.g. after we make PAT candidates, for most purposes the rest of the AOD information is not needed); this is called stripping or slimming.
  • removing unneeded information from objects; this is called thinning . It is an advanced topic; it's still experimental and not covered in here.

Stripping, slimming and thinning in the context of analysis is discussed more below.

Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.

Data formats (classes) for reconstructed data, for example, include Reco.Track, Reco.TrackExtra, and many more. See the Offline Guide section SWGuideRecoDataTable for the full listing.

Data Tiers: Reconstructed (RECO) Data and Analysis Object Data (AOD)

Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a data tier, which has already been introduced in the Workbook section describing the Computing Model. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term GenSimDigi includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are RECO (all reconstructed objects and hits) and AOD (a smaller subset of RECO which is needed by analysis).

RECO data contains objects from all stages of reconstruction. AOD data are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.

whats_in_aod_reco.gif

RECO

RECO is the name of the data-tier which contains objects created by the event reconstruction program. It is derived from RAW data and provides access to reconstructed physics objects for physics analysis in a convenient format. Event reconstruction is structured in several hierarchical steps:

  1. Detector-specific processing: Starting from detector data unpacking and decoding, detector calibration constants are applied and cluster or hit objects are reconstructed.
  2. Tracking: Hits in the silicon and muon detectors are used to reconstruct global tracks. Pattern recognition in the tracker is the most CPU-intensive task.
  3. Vertexing: Reconstructs primary and secondary vertex candidates.
  4. Particle identification: Produces the objects most associated with physics analyses. Using a wide variety of sophisticated algorithms, standard physics object candidates are created (electrons, photons, muons, missing transverse energy and jets; heavy-quarks, tau decay).

The normal completion of the reconstruction task will result in a full set of these reconstructed objects usable by CMS physicists in their analyses. You would only need to rerun these algorithms if your analysis requires you to take account of such things as trial calibrations, novel algorithms etc.

Reconstruction is expensive in terms of CPU and is dominated by tracking. The RECO data-tier will provide compact information for analysis to avoid the necessity to access the RAW data for most analysis. Following the hierarchy of event reconstruction, RECO will contain objects from all stages of reconstruction. At the lowest level it will be reconstructed hits, clusters and segments. Based on these objects reconstructed tracks and vertices are stored. At the highest level reconstructed jets, muons, electrons, b-jets, etc. are stored. A direct reference from high-level objects to low-level objects will be possible, to avoid duplication of information. In addition the RECO format will preserve links to the RAW information.

The RECO data includes quantities required for typical analysis usage patterns such as: track re-finding, calorimeter reclustering, and jet energy calibration. The RECO event content is documented in the Offline Guide at RECO Data Format Table.

AOD

AOD are derived from the RECO information to provide data for physics analysis in a convenient, compact format. AOD data are usable directly by physics analyses. AOD data will be produced by the same, or subsequent, processing steps as produce the RECO data; and AOD data will be made easily available at multiple sites to CMS members. The AOD will contain enough information about the event to support all the typical usage patterns of a physics analysis. Thus, it will contain a copy of all the high-level physics objects (such as muons, electrons, taus, etc.), plus a summary of the RECO information sufficient to support typical analysis actions such as track refitting with improved alignment or kinematic constraints, re-evaluation of energy and/or position of ECAL clusters based on analysis-specific corrections. The AOD, because of the limited size that will not allow it to contain all the hits, will typically not support the application of novel pattern recognition techniques, nor the application of new calibration constants, which would typically require the use of RECO or RAW information.

The AOD data tier will contain physics objects: tracks with associated Hits, calorimetric clusters with associated Hits, vertices, jets and high-level physics objects (electrons, muons, Z boson candidates, and so on).

Because the AOD data tier is relatively compact, all Tier-1 computing centres are able to keep a full copy of the AOD, while they will hold only a subset of the RAW and RECO data tiers. The AOD event content is documented in the Offline Guide at AOD Data Format Table.

PAT

The information is stored in RECO and AOD in a way that uses the least amount of space and allows for the greatest flexibility. This is particularly true for DataFormats that contain objects that link to each other. However, accessing these links between RECO or AOD objects requires more experience with C++. To simplify the user's analysis, a set of new data formats are created, which aggregate the related RECO information. These new formats, along with the tools used to make and manipulate them, are called Physics Analysis Toolkit, or PAT. The PAT is de facto the way how the users will access the physics objects which are the output of RECO.

PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier. The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.

PAT is discussed in more detail in WorkBookPATTupleCreationExercise. Here we continue the story of defining the user content of an analysis, in which PAT plays a crucial role.

Group and user skims: RECO, AOD and PAT-tuples

Now we can refine the descriptions of primary, group and user-defined skims, with some examples. In almost all cases the primary skims will read AOD and produce AOD with a reduced number of events. (During the physics commissioning, the primary skims may also read and write RECO instead of AOD.) The group and user skims may also read and write AOD (or RECO). However, they could also produce PAT-tuples, as decided by the group or the user. As an illustration, these steps could be:

  1. primary skims read AOD, write AOD.
  2. group-wide skim filters events in AOD, and produces PAT with lots of information. (Such PAT-tuples are sometimes called for as they need to benefit multiple efforts within the group)
  3. the user modifies the PAT workflow to read PAT and produce another version of PAT, but with much smaller content (stripping/slimming), and possibly even compressed PAT object (thinning).

All the operations that involve skimming, stripping, and thinning are done within the full-Framework. Therefore, every user needs to at least know what these jobs do in each of steps, even if s/he does not need to make any changes to any of the processing steps. However, it is more likely that some changes will be needed, especially in the last stage where the skimming and further processing is ran by the user. In some cases, the user may even need to write Framework modules like EDProducers -- to add new DataFormats to the events, or EDAnalyzers -- to compute quantities that require access to conditions.

In the above example, the end of the skimming chain produces a "PAT-tuple", which should be small enough to easily fit onto a laptop. Moreover, it should also fit within a memory of the ROOT process, thus facilitating interactive speed on par with TTrees. However, to be able to read CMS data (RECO, AOD or PAT) from ROOT, we need to teach ROOT to understand CMS DataFormats by loading the DataFormats libraries themselves, accompanied also by a couple of helper classes that simplify the user's manipulation of CMS "events" in ROOT. ROOT with these additional libraries installed is called Framework-lite, or FW Lite.

Tools for interactive analysis: FW Lite, edmBrowser, Fireworks

The interactive stage is where most of the analysis is actually done, and where most of the `analysis time' is actually spent. Every analysis is different, and many take a number of twists and turns towards its conclusion, solving an array of riddles on the way. However, most analyses need (or could benefit from):

  • a way to examine the content of CMS data files, especially PAT-tuples. CMS has several tools that can examine the file content, including stand-alone executables edmDumpEventContent (which dumps a list of the collections present in the file to the terminal), and edmBrowser (which has a nice graphical interface).
  • a way to obtain the history of the file. The CMS files contain embedded information sufficient to tell the history of the objects in the file. This information is called provenance, and is crucial for the analysis, as it allows the user to establish with certainty what kind of operations (corrections, calibrations, algorithms) were performed on the data present in the file. The stand-alone executable edmProvDump prints the provenance to the screen.
  • a way to visualize the event. CMS has two event viewers: Iguana is geared toward the detailed description of the event, and is described later, in the advanced section of the workbook. In contrast, the main objective of Fireworks is to display analysis-level quantities. Moreover, Fireworks is well-suited for investigating events in CMS data files, since it can read them directly.
  • a way to manipulate data quantitatively. In HEP, most quantitative analysis of data is performed within ROOT framework. ROOT has, over the years, subsumed an impressive collection of statistical and other tools, including the fitting package RooFit, or a collection of multi-variate analysis tools, TMVA. ROOT can access the CMS data directly, provided the DataFormats libraries are loaded, turning it into FW Lite.

The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite (WorkBookFWLite). But first, the choice of a release is discussed (WorkBookWhichRelease), and the ways to get data are illustrated (WorkBookDataSamples). At the end of the exercise in WorkBookDataSamples, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser (WorkBookEdmInfoOnDataFile) and Fireworks event display (WorkBookFireworks).

Review status

Reviewer/Editor and Date (copy from screen) Comments

PetarMaksimovic - 20 Jun 2009 Created.
PetarMaksimovic - 30 Nov 2009 Some clean-up.
XuanChen - 17 Jul 2014 Changed the links from cvs to github

I went through chapter 3 section 1. The information is relevant and clear. I created a few links that were sugested by Kati L. P.

created link at " edmDumpEventContent "

created link at " edmBrowser "

created link at " provenance "

created link at " edmProvDump "

created link at " Iguana "

created link at " Fireworks "

Responsible: SalvatoreRappoccio
Last reviewed by: PetarMaksimovic - 2 March 2009

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2014-07-17 - XuanChen


ESSENTIALS

ADVANCED TOPICS


 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback