Difference: WorkBookAnalysisOverviewIntroduction (1 vs. 22)

Revision 222014-07-17 - XuanChen

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Line: 54 to 54
  Stripping, slimming and thinning in the context of analysis is discussed more below.
Changed:
<
<
Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.
>
>
Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.
  Data formats (classes) for reconstructed data, for example, include Reco.Track, Reco.TrackExtra, and many more. See the Offline Guide section SWGuideRecoDataTable for the full listing.
Line: 139 to 139
 
<!-- Add your review status in this table structure with 2 columns delineated by three vertical bars. Add comments for editing, reviewing, etc -->

Reviewer/Editor and Date (copy from screen) Comments
Added:
>
>
 
PetarMaksimovic - 20 Jun 2009 Created.
PetarMaksimovic - 30 Nov 2009 Some clean-up.
Changed:
<
<
>
>
XuanChen - 17 Jul 2014 Changed the links from cvs to github
  %TWISTY{mode="div" showlink="Detailed comments 5-Oct-2012 " hidelink="Hide " firststart="hide" showimgright="/twiki/pub/TWiki/TWikiDocGraphics/toggleopen-small.gif"

Revision 212013-10-31 - TerjeAndersen

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Revision 202012-10-05 - AntonioMorelosPineda

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Line: 123 to 123
 

Tools for interactive analysis: FW Lite, edmBrowser, Fireworks

Deleted:
<
<
KLP: add links
 The interactive stage is where most of the analysis is actually done, and where most of the `analysis time' is actually spent. Every analysis is different, and many take a number of twists and turns towards its conclusion, solving an array of riddles on the way. However, most analyses need (or could benefit from):
Changed:
<
<
  • a way to examine the content of CMS data files, especially PAT-tuples. CMS has several tools that can examine the file content, including stand-alone executables edmDumpEventContent (which dumps a list of the collections present in the file to the terminal), and edmBrowser (which has a nice graphical interface).
  • a way to obtain the history of the file. The CMS files contain embedded information sufficient to tell the history of the objects in the file. This information is called provenance, and is crucial for the analysis, as it allows the user to establish with certainty what kind of operations (corrections, calibrations, algorithms) were performed on the data present in the file. The stand-alone executable edmProvDump prints the provenance to the screen.
  • a way to visualize the event. CMS has two event viewers: Iguana is geared toward the detailed description of the event, and is described later, in the advanced section of the workbook. In contrast, the main objective of Fireworks is to display analysis-level quantities. Moreover, Fireworks is well-suited for investigating events in CMS data files, since it can read them directly.
>
>
  • a way to examine the content of CMS data files, especially PAT-tuples. CMS has several tools that can examine the file content, including stand-alone executables edmDumpEventContent (which dumps a list of the collections present in the file to the terminal), and edmBrowser (which has a nice graphical interface).
  • a way to obtain the history of the file. The CMS files contain embedded information sufficient to tell the history of the objects in the file. This information is called provenance, and is crucial for the analysis, as it allows the user to establish with certainty what kind of operations (corrections, calibrations, algorithms) were performed on the data present in the file. The stand-alone executable edmProvDump prints the provenance to the screen.
  • a way to visualize the event. CMS has two event viewers: Iguana is geared toward the detailed description of the event, and is described later, in the advanced section of the workbook. In contrast, the main objective of Fireworks is to display analysis-level quantities. Moreover, Fireworks is well-suited for investigating events in CMS data files, since it can read them directly.
 
  • a way to manipulate data quantitatively. In HEP, most quantitative analysis of data is performed within ROOT framework. ROOT has, over the years, subsumed an impressive collection of statistical and other tools, including the fitting package RooFit, or a collection of multi-variate analysis tools, TMVA. ROOT can access the CMS data directly, provided the DataFormats libraries are loaded, turning it into FW Lite.

The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite (WorkBookFWLite). But first, the choice of a release is discussed (WorkBookWhichRelease), and the ways to get data are illustrated (WorkBookDataSamples). At the end of the exercise in WorkBookDataSamples, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser (WorkBookEdmInfoOnDataFile) and Fireworks event display (WorkBookFireworks).

Line: 144 to 143
 
PetarMaksimovic - 30 Nov 2009 Some clean-up.
Added:
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->

I went through chapter 3 section 1. The information is relevant and clear. I created a few links that were sugested by Kati L. P.

created link at " edmDumpEventContent "

created link at " edmBrowser "

created link at " provenance "

created link at " edmProvDump "

created link at " Iguana "

created link at " Fireworks "

<!--/twistyPlugin-->
 
<!-- Here the "responsible" is the contact person for maintenance of the page - often initially the page author -->
<!-- The "review" person is the most recent person to carry out a full review of the page  - on this field please include the date in the format shown -->
Responsible: SalvatoreRappoccio

Revision 192011-02-02 - KatiLassilaPerini

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Line: 28 to 28
  However, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever-tighter event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:
Changed:
<
<
  1. the primary datasets and skims are produced; they are defined using the trigger information (for stability) and produced centrally on Tier 1 systems
  2. the secondary skims are produced by the physics groups (say a Higgs group) by running on the primary skims; the secondary skims are usually produced by group members running on the Tier 2 clusters assigned to the given group
>
>
  1. the primary datasets and skims are produced; they are defined using the trigger information (for stability) and produced centrally on Tier 1 systems
  2. the secondary skims are produced by the physics groups (say a Higgs group) by running on the primary skims; the secondary skims are usually produced by group members running on the Tier 2 clusters assigned to the given group
 
  1. optionally, the user then skims once again, applying an ever tighter event selection
  2. the final sample (with almost final cuts) can then be analyzed by FW Lite. It can also be analyzed by the full framework, however we recommend using FW Lite as it is interactive and far more portable
Line: 37 to 37
  The secondary skimming (step 2 above) must be tight enough to make the secondary skims feasible in terms of size. And yet it must not be too tight since otherwise certain analyses might find themselves starved for data. However, in this case what is `tight' is analysis-dependent, so it is vital for the group members to be involved in the definition of their group's secondary skims!
Changed:
<
<
The user selection (step 3) is made on the Tier 2 by the user, and it's the main opportunity to reduce the size of the samples the user will need to deal with (and lug around). In many cases, this is where the preliminary event selection is done, and thus it is the foundation of the analysis. It is expected that the user may need to re-run this step (e.g., in case of finding out that the cuts were too tight), but this is not a problem since the tertiary skims are being run on the secondary skims which are already reduced in size.
>
>
The user selection (step 3) is made on the Tier 2 by the user, and it's the main opportunity to reduce the size of the samples the user will need to deal with (and lug around). In many cases, this is where the preliminary event selection is done, and thus it is the foundation of the analysis. It is expected that the user may need to re-run this step (e.g., in case of finding out that the cuts were too tight), but this is not a problem since the tertiary skims are being run on the secondary skims which are already reduced in size.
  That being said, it is important to tune the user's skim to be as close to `just right' as possible: the event selection should be looser than they are expected to be after the final cut optimization, but not too loose -- otherwise the skimming would not serve its purpose. If done right, this will not only save your own time, but also preserve the collaboration's CPU resources.
Changed:
<
<
>
>
 

Reduction in event size: CMS Data Formats and Data Tiers

(For a more thorough overview, please see WorkBookComputingModel; this section necessarily distills the information which was presented there in much more detail.)

Line: 129 to 129
 
  • a way to examine the content of CMS data files, especially PAT-tuples. CMS has several tools that can examine the file content, including stand-alone executables edmDumpEventContent (which dumps a list of the collections present in the file to the terminal), and edmBrowser (which has a nice graphical interface).
  • a way to obtain the history of the file. The CMS files contain embedded information sufficient to tell the history of the objects in the file. This information is called provenance, and is crucial for the analysis, as it allows the user to establish with certainty what kind of operations (corrections, calibrations, algorithms) were performed on the data present in the file. The stand-alone executable edmProvDump prints the provenance to the screen.
  • a way to visualize the event. CMS has two event viewers: Iguana is geared toward the detailed description of the event, and is described later, in the advanced section of the workbook. In contrast, the main objective of Fireworks is to display analysis-level quantities. Moreover, Fireworks is well-suited for investigating events in CMS data files, since it can read them directly.
Changed:
<
<
  • a way to manipulate data quantitatively. In HEP, most quantitative analysis of data is performed within ROOT framework. ROOT has, over the years, subsumed an impressive collection of statistical and other tools, including the fitting package RooFit, or a collection of multi-variate analysis tools, TMVA. ROOT can access the CMS data directly, provided the DataFormats libraries are loaded, turning it into FW Lite.
>
>
  • a way to manipulate data quantitatively. In HEP, most quantitative analysis of data is performed within ROOT framework. ROOT has, over the years, subsumed an impressive collection of statistical and other tools, including the fitting package RooFit, or a collection of multi-variate analysis tools, TMVA. ROOT can access the CMS data directly, provided the DataFormats libraries are loaded, turning it into FW Lite.
  The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite (WorkBookFWLite). But first, the choice of a release is discussed (WorkBookWhichRelease), and the ways to get data are illustrated (WorkBookDataSamples). At the end of the exercise in WorkBookDataSamples, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser (WorkBookEdmInfoOnDataFile) and Fireworks event display (WorkBookFireworks).

Revision 182010-04-27 - RogerWolf

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Line: 103 to 103
  PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier. The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.
Changed:
<
<
PAT is discussed in more detail in WorkBookPATTuple. Here we continue the story of defining the user content of an analysis, in which PAT plays a crucial role.
>
>
PAT is discussed in more detail in WorkBookPATTupleCreationExercise. Here we continue the story of defining the user content of an analysis, in which PAT plays a crucial role.
 

Revision 172010-04-27 - RogerWolf

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Line: 103 to 103
  PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier. The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.
Changed:
<
<
PAT is discussed in more detail in WorkBookPAT. Here we continue the story of defining the user content of an analysis, in which PAT plays a crucial role.
>
>
PAT is discussed in more detail in WorkBookPATTuple. Here we continue the story of defining the user content of an analysis, in which PAT plays a crucial role.
 

Revision 162010-03-02 - PetarMaksimovic

Line: 1 to 1
 
META TOPICPARENT name="WorkBook"

3.1 Analysis Overview: an Introduction

Line: 14 to 14
 This page presents a big-picture overview of performing an analysis at CMS.
  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.
  • We need to understand the most important CMS data formats, RECO and AOD, so they are described next. PAT is also mentioned, although it will be described in detail later.
Deleted:
<
<
  • Next, the ways to get data are illustrated. At the end of this exercise, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser and Fireworks event display.
 
  • Finally, we explore two options for a quantitative analysis of CMS events:
    • FW Lite -- using ROOT enhanced with libraries that can understand CMS data formats and aid in fetching object collections from the event
    • the full Framework -- using C++ modules in cmsRun
Line: 23 to 22
 

The data flow, from detector to analysis

Added:
>
>
(For a more thorough overview, please see WorkBookComputingModel; this section necessarily distills the information which was presented there in much more detail.)
 To enable the most effective access to CMS data, the data are first split into Physics Datasets (PDs) and then the events are filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets group or split triggers in order to achieve balance in their size.

However, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever-tighter event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:

Line: 44 to 45
 

Reduction in event size: CMS Data Formats and Data Tiers

Added:
>
>
(For a more thorough overview, please see WorkBookComputingModel; this section necessarily distills the information which was presented there in much more detail.)
 In addition to the reduction of the number of events, in steps 1-3 it is also possible to reduce the size of each event by

  • removing unneeded collections (e.g. after we make PAT candidates, for most purposes the rest of the AOD information is not needed); this is called stripping or slimming.
  • removing unneeded information from objects; this is called thinning . It is an advanced topic; it's still experimental and not covered in here.
Added:
>
>
Stripping, slimming and thinning in the context of analysis is discussed more below.
 Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.

Data formats (classes) for reconstructed data, for example, include Reco.Track, Reco.TrackExtra, and many more. See the Offline Guide section SWGuideRecoDataTable for the full listing.

Line: 98 to 103
  PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier. The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.
Added:
>
>
PAT is discussed in more detail in WorkBookPAT. Here we continue the story of defining the user content of an analysis, in which PAT plays a crucial role.
 

Group and user skims: RECO, AOD and PAT-tuples

Line: 124 to 131
 
  • a way to visualize the event. CMS has two event viewers: Iguana is geared toward the detailed description of the event, and is described later, in the advanced section of the workbook. In contrast, the main objective of Fireworks is to display analysis-level quantities. Moreover, Fireworks is well-suited for investigating events in CMS data files, since it can read them directly.
  • a way to manipulate data quantitatively. In HEP, most quantitative analysis of data is performed within ROOT framework. ROOT has, over the years, subsumed an impressive collection of statistical and other tools, including the fitting package RooFit, or a collection of multi-variate analysis tools, TMVA. ROOT can access the CMS data directly, provided the DataFormats libraries are loaded, turning it into FW Lite.
Changed:
<
<
The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite.
>
>
The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite (WorkBookFWLite). But first, the choice of a release is discussed (WorkBookWhichRelease), and the ways to get data are illustrated (WorkBookDataSamples). At the end of the exercise in WorkBookDataSamples, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser (WorkBookEdmInfoOnDataFile) and Fireworks event display (WorkBookFireworks).
 

Line: 139 to 146
 
<!-- Here the "responsible" is the contact person for maintenance of the page - often initially the page author -->
<!-- The "review" person is the most recent person to carry out a full review of the page  - on this field please include the date in the format shown -->
Changed:
<
<
Responsible: PetarMaksimovic
Last reviewed by: PetarMaksimovic - 30 Nov 2009
>
>
Responsible: SalvatoreRappoccio
Last reviewed by: PetarMaksimovic - 2 March 2009
 

META FILEATTACHMENT attachment="whats_in_aod_reco.gif" attr="" comment="" date="1259610756" name="whats_in_aod_reco.gif" path="whats_in_aod_reco.gif" size="29241" stream="whats_in_aod_reco.gif" tmpFilename="/usr/tmp/CGItemp13217" user="petar" version="1"

Revision 152010-02-17 - KatiLassilaPerini

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WorkBook"
 

3.1 Analysis Overview: an Introduction

<!-- Enter an integer between 0 (zero) and 5 after the word COMPLETE, below, to indicate how complete the page/topic is; 0 means empty, 1 is for very incomplete, ascending to 5 for complete.  -->

Revision 142010-02-16 - KatiLassilaPerini

Line: 1 to 1
 

3.1 Analysis Overview: an Introduction

<!-- Enter an integer between 0 (zero) and 5 after the word COMPLETE, below, to indicate how complete the page/topic is; 0 means empty, 1 is for very incomplete, ascending to 5 for complete.  -->
Line: 143 to 143
 

META FILEATTACHMENT attachment="whats_in_aod_reco.gif" attr="" comment="" date="1259610756" name="whats_in_aod_reco.gif" path="whats_in_aod_reco.gif" size="29241" stream="whats_in_aod_reco.gif" tmpFilename="/usr/tmp/CGItemp13217" user="petar" version="1"
Deleted:
<
<
META PREFERENCE name="ALLOWTOPICVIEW" title="ALLOWTOPICVIEW" type="Set" value="cms-web-access, cms-cern-it-web-access, KatiLassilaPerini"

Revision 132010-02-12 - KatiLassilaPerini

Line: 1 to 1
 

3.1 Analysis Overview: an Introduction

<!-- Enter an integer between 0 (zero) and 5 after the word COMPLETE, below, to indicate how complete the page/topic is; 0 means empty, 1 is for very incomplete, ascending to 5 for complete.  -->
Line: 143 to 143
 

META FILEATTACHMENT attachment="whats_in_aod_reco.gif" attr="" comment="" date="1259610756" name="whats_in_aod_reco.gif" path="whats_in_aod_reco.gif" size="29241" stream="whats_in_aod_reco.gif" tmpFilename="/usr/tmp/CGItemp13217" user="petar" version="1"
Changed:
<
<
META PREFERENCE name="ALLOWTOPICVIEW" title="ALLOWTOPICVIEW" type="Set" value="cms-web-access, cms-cern-it-web-access"
>
>
META PREFERENCE name="ALLOWTOPICVIEW" title="ALLOWTOPICVIEW" type="Set" value="cms-web-access, cms-cern-it-web-access, KatiLassilaPerini"

Revision 122010-01-01 - KatiLassilaPerini

Line: 1 to 1
 

3.1 Analysis Overview: an Introduction

<!-- Enter an integer between 0 (zero) and 5 after the word COMPLETE, below, to indicate how complete the page/topic is; 0 means empty, 1 is for very incomplete, ascending to 5 for complete.  -->
Line: 143 to 143
 

META FILEATTACHMENT attachment="whats_in_aod_reco.gif" attr="" comment="" date="1259610756" name="whats_in_aod_reco.gif" path="whats_in_aod_reco.gif" size="29241" stream="whats_in_aod_reco.gif" tmpFilename="/usr/tmp/CGItemp13217" user="petar" version="1"
Added:
>
>
META PREFERENCE name="ALLOWTOPICVIEW" title="ALLOWTOPICVIEW" type="Set" value="cms-web-access, cms-cern-it-web-access"

Revision 112009-12-11 - KatiLassilaPerini

Line: 1 to 1
 

3.1 Analysis Overview: an Introduction

<!-- Enter an integer between 0 (zero) and 5 after the word COMPLETE, below, to indicate how complete the page/topic is; 0 means empty, 1 is for very incomplete, ascending to 5 for complete.  -->
Line: 115 to 115
 

Tools for interactive analysis: FW Lite, edmBrowser, Fireworks

Added:
>
>
KLP: add links
 The interactive stage is where most of the analysis is actually done, and where most of the `analysis time' is actually spent. Every analysis is different, and many take a number of twists and turns towards its conclusion, solving an array of riddles on the way. However, most analyses need (or could benefit from):

  • a way to examine the content of CMS data files, especially PAT-tuples. CMS has several tools that can examine the file content, including stand-alone executables edmDumpEventContent (which dumps a list of the collections present in the file to the terminal), and edmBrowser (which has a nice graphical interface).

Revision 102009-12-03 - PetarMaksimovic

Line: 1 to 1
 

3.1 Analysis Overview: an Introduction

Added:
>
>
<!-- Enter an integer between 0 (zero) and 5 after the word COMPLETE, below, to indicate how complete the page/topic is; 0 means empty, 1 is for very incomplete, ascending to 5 for complete.  -->

Complete: 5
Detailed Review status

Goals of this page:

 This page presents a big-picture overview of performing an analysis at CMS.
  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.
  • We need to understand the most important CMS data formats, RECO and AOD, so they are described next. PAT is also mentioned, although it will be described in detail later.
Line: 132 to 141
 Last reviewed by: PetarMaksimovic - 30 Nov 2009
Deleted:
<
<

-- PetarMaksimovic - Last edit 11 June 2009

 
META FILEATTACHMENT attachment="whats_in_aod_reco.gif" attr="" comment="" date="1259610756" name="whats_in_aod_reco.gif" path="whats_in_aod_reco.gif" size="29241" stream="whats_in_aod_reco.gif" tmpFilename="/usr/tmp/CGItemp13217" user="petar" version="1"

Revision 92009-11-30 - PetarMaksimovic

Line: 1 to 1
 

3.1 Analysis Overview: an Introduction

This page presents a big-picture overview of performing an analysis at CMS.

Line: 26 to 26
  The secondary skimming (step 2 above) must be tight enough to make the secondary skims feasible in terms of size. And yet it must not be too tight since otherwise certain analyses might find themselves starved for data. However, in this case what is `tight' is analysis-dependent, so it is vital for the group members to be involved in the definition of their group's secondary skims!
Changed:
<
<
The user selection (step 3) is made on the Tier 2 by the user, and it's the main opportunity to reduce the size of the samples the user will need to deal with (and lug around). In many cases, this is where the preliminary event selection is done, and thus it is the foundation of the analysis. It is expected that the user may need to re-run this step (e.g., in case of finding out that the cuts were too tight), but this is not a problem since the tertiary skims are ran on the already reduced secondary skims.
>
>
The user selection (step 3) is made on the Tier 2 by the user, and it's the main opportunity to reduce the size of the samples the user will need to deal with (and lug around). In many cases, this is where the preliminary event selection is done, and thus it is the foundation of the analysis. It is expected that the user may need to re-run this step (e.g., in case of finding out that the cuts were too tight), but this is not a problem since the tertiary skims are being run on the secondary skims which are already reduced in size.
 
Changed:
<
<
That being said, it is important to preserve the collaboration's CPU resources (as well as one's own time) and tune the user's skim to be as close to `just right' as possible -- the cuts should be looser than they are expected to be after the final cut optimization, but not too loose otherwise the skimming would not serve its purpose.
>
>
That being said, it is important to tune the user's skim to be as close to `just right' as possible: the event selection should be looser than they are expected to be after the final cut optimization, but not too loose -- otherwise the skimming would not serve its purpose. If done right, this will not only save your own time, but also preserve the collaboration's CPU resources.
 

Reduction in event size: CMS Data Formats and Data Tiers

Line: 115 to 116
 The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite.
Added:
>
>

Review status

<!-- Add your review status in this table structure with 2 columns delineated by three vertical bars. Add comments for editing, reviewing, etc -->

Reviewer/Editor and Date (copy from screen) Comments
PetarMaksimovic - 20 Jun 2009 Created.
PetarMaksimovic - 30 Nov 2009 Some clean-up.

<!-- Here the "responsible" is the contact person for maintenance of the page - often initially the page author -->
<!-- The "review" person is the most recent person to carry out a full review of the page  - on this field please include the date in the format shown -->
Responsible: PetarMaksimovic
Last reviewed by: PetarMaksimovic - 30 Nov 2009

 -- PetarMaksimovic - Last edit 11 June 2009
Added:
>
>
META FILEATTACHMENT attachment="whats_in_aod_reco.gif" attr="" comment="" date="1259610756" name="whats_in_aod_reco.gif" path="whats_in_aod_reco.gif" size="29241" stream="whats_in_aod_reco.gif" tmpFilename="/usr/tmp/CGItemp13217" user="petar" version="1"

Revision 82009-08-25 - KatiLassilaPerini

Line: 1 to 1
Changed:
<
<

Analysis Overview: an Introduction

>
>

3.1 Analysis Overview: an Introduction

  This page presents a big-picture overview of performing an analysis at CMS.
  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.

Revision 72009-07-22 - JeffDandoy

Line: 1 to 1
 

Analysis Overview: an Introduction

This page presents a big-picture overview of performing an analysis at CMS.

  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.
  • We need to understand the most important CMS data formats, RECO and AOD, so they are described next. PAT is also mentioned, although it will be described in detail later.
Changed:
<
<
  • Next, the ways how to get the data are illustrated. At the end of this exercise, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser and Fireworks event display.
>
>
  • Next, the ways to get data are illustrated. At the end of this exercise, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser and Fireworks event display.
 
  • Finally, we explore two options for a quantitative analysis of CMS events:
    • FW Lite -- using ROOT enhanced with libraries that can understand CMS data formats and aid in fetching object collections from the event
    • the full Framework -- using C++ modules in cmsRun
Line: 13 to 13
 

The data flow, from detector to analysis

Changed:
<
<
To enable the most effective access to CMS data, the data is first split into Physics Datasets (PDs) and then the events filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets group or split triggers in order to achieve balance in their size.
>
>
To enable the most effective access to CMS data, the data are first split into Physics Datasets (PDs) and then the events are filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets group or split triggers in order to achieve balance in their size.
  However, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever-tighter event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:
Line: 46 to 46
 

Data Tiers: Reconstructed (RECO) Data and Analysis Object Data (AOD)

Changed:
<
<
Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a data tier, which have already been introduced in the Workbook section describing the Computing Model. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term GenSimDigi includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are RECO (all reconstructed objects and hits) and AOD (a smaller subset of RECO which is needed by analysis).
>
>
Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a data tier, which has already been introduced in the Workbook section describing the Computing Model. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term GenSimDigi includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are RECO (all reconstructed objects and hits) and AOD (a smaller subset of RECO which is needed by analysis).
 
Changed:
<
<
RECO data contains objects from all stages of reconstruction. AOD are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.
>
>
RECO data contains objects from all stages of reconstruction. AOD data are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.
  whats_in_aod_reco.gif

Revision 62009-06-11 - PetarMaksimovic

Line: 1 to 1
 

Analysis Overview: an Introduction

This page presents a big-picture overview of performing an analysis at CMS.

Line: 83 to 83
 

PAT

Changed:
<
<
The information is stored in DataFormats in RECO and AOD in a way that uses the least amount of space and allows for the greatest flexibility. However, accessing connections between various RECO objects requires more experience with C++. To simplify the user's analysis, a set of new data formats are created, which aggregate the related RECO information. These new formats, along with the tools used to make and manipulate them, are called Physics Analysis Toolkit, or PAT. The PAT is de facto the way how the users will access the physics objects which are the output of RECO.
>
>
The information is stored in RECO and AOD in a way that uses the least amount of space and allows for the greatest flexibility. This is particularly true for DataFormats that contain objects that link to each other. However, accessing these links between RECO or AOD objects requires more experience with C++. To simplify the user's analysis, a set of new data formats are created, which aggregate the related RECO information. These new formats, along with the tools used to make and manipulate them, are called Physics Analysis Toolkit, or PAT. The PAT is de facto the way how the users will access the physics objects which are the output of RECO.
 
Changed:
<
<
PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier! The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT is important because it defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.
>
>
PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier. The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.
 

Group and user skims: RECO, AOD and PAT-tuples

Changed:
<
<
Now we can refine the descriptions of primary, group and user-defined skims, with some examples. In almost all cases the primary skims will read AOD and produce AOD with a reduced number of events. (During the physics commissioning, the primary skims may also read and write RECO instead of AOD.) The group and user skims may also read and write AOD (or RECO). However, they could also produce PAT-tuples, as decided by the group or the user. For example
  1. primary skims read AOD, write AOD
  2. group-wide skim filters events, and produces PAT with lots of information (as it needs to benefit multiple efforts within the group)
  3. the user modifies the PAT workflow to read PAT and produce another version of PAT, with much smaller content (stripping/slimming), and possibly even compressed PAT object (thinning).
>
>
Now we can refine the descriptions of primary, group and user-defined skims, with some examples. In almost all cases the primary skims will read AOD and produce AOD with a reduced number of events. (During the physics commissioning, the primary skims may also read and write RECO instead of AOD.) The group and user skims may also read and write AOD (or RECO). However, they could also produce PAT-tuples, as decided by the group or the user. As an illustration, these steps could be:
 
Changed:
<
<
The resulting "PAT-tuple" should be small enough to not only fit onto a laptop, but also to fit within a memory of the ROOT process, thus facilitating interactive use with identical speed with TTrees. However, to be able to read CMS data (RECO, AOD or PAT) from ROOT, we first need to teach ROOT to understand CMS DataFormats. We do it by loading CMS DataFormats libraries themselves, as well as a couple of helper classes that simplify the user's manipulation of CMS "events" in ROOT. ROOT with these additional libraries installed is called Framework-lite, or FW Lite.
>
>
  1. primary skims read AOD, write AOD.
  2. group-wide skim filters events in AOD, and produces PAT with lots of information. (Such PAT-tuples are sometimes called for as they need to benefit multiple efforts within the group)
  3. the user modifies the PAT workflow to read PAT and produce another version of PAT, but with much smaller content (stripping/slimming), and possibly even compressed PAT object (thinning).

All the operations that involve skimming, stripping, and thinning are done within the full-Framework. Therefore, every user needs to at least know what these jobs do in each of steps, even if s/he does not need to make any changes to any of the processing steps. However, it is more likely that some changes will be needed, especially in the last stage where the skimming and further processing is ran by the user. In some cases, the user may even need to write Framework modules like EDProducers -- to add new DataFormats to the events, or EDAnalyzers -- to compute quantities that require access to conditions.

In the above example, the end of the skimming chain produces a "PAT-tuple", which should be small enough to easily fit onto a laptop. Moreover, it should also fit within a memory of the ROOT process, thus facilitating interactive speed on par with TTrees. However, to be able to read CMS data (RECO, AOD or PAT) from ROOT, we need to teach ROOT to understand CMS DataFormats by loading the DataFormats libraries themselves, accompanied also by a couple of helper classes that simplify the user's manipulation of CMS "events" in ROOT. ROOT with these additional libraries installed is called Framework-lite, or FW Lite.

 

Tools for interactive analysis: FW Lite, edmBrowser, Fireworks

Changed:
<
<
The interactive stage is where most of the analysis is actually done, and where most of the `analysis time' is actually spent. Every analysis is different, but most take a number of twists and turns towards its conclusion, solving an array of riddles on the way.
>
>
The interactive stage is where most of the analysis is actually done, and where most of the `analysis time' is actually spent. Every analysis is different, and many take a number of twists and turns towards its conclusion, solving an array of riddles on the way. However, most analyses need (or could benefit from):

  • a way to examine the content of CMS data files, especially PAT-tuples. CMS has several tools that can examine the file content, including stand-alone executables edmDumpEventContent (which dumps a list of the collections present in the file to the terminal), and edmBrowser (which has a nice graphical interface).
  • a way to obtain the history of the file. The CMS files contain embedded information sufficient to tell the history of the objects in the file. This information is called provenance, and is crucial for the analysis, as it allows the user to establish with certainty what kind of operations (corrections, calibrations, algorithms) were performed on the data present in the file. The stand-alone executable edmProvDump prints the provenance to the screen.
  • a way to visualize the event. CMS has two event viewers: Iguana is geared toward the detailed description of the event, and is described later, in the advanced section of the workbook. In contrast, the main objective of Fireworks is to display analysis-level quantities. Moreover, Fireworks is well-suited for investigating events in CMS data files, since it can read them directly.
  • a way to manipulate data quantitatively. In HEP, most quantitative analysis of data is performed within ROOT framework. ROOT has, over the years, subsumed an impressive collection of statistical and other tools, including the fitting package RooFit, or a collection of multi-variate analysis tools, TMVA. ROOT can access the CMS data directly, provided the DataFormats libraries are loaded, turning it into FW Lite.

The following pages in this Chapter of the WorkBook will illustrate each of these steps, especially data analysis (including making plots) in FW Lite.

 
Changed:
<
<
-- PetarMaksimovic - 17 Mar 2009
>
>
-- PetarMaksimovic - Last edit 11 June 2009
 

Revision 52009-06-11 - PetarMaksimovic

Line: 1 to 1
 

Analysis Overview: an Introduction

This page presents a big-picture overview of performing an analysis at CMS.

Line: 54 to 54
  whats_in_aod_reco.gif

Changed:
<
<

RECO

>
>

RECO

 RECO is the name of the data-tier which contains objects created by the event reconstruction program. It is derived from RAW data and provides access to reconstructed physics objects for physics analysis in a convenient format. Event reconstruction is structured in several hierarchical steps:

  1. Detector-specific processing: Starting from detector data unpacking and decoding, detector calibration constants are applied and cluster or hit objects are reconstructed.
Line: 70 to 71
 

Changed:
<
<

AOD

>
>

AOD

  AOD are derived from the RECO information to provide data for physics analysis in a convenient, compact format. AOD data are usable directly by physics analyses. AOD data will be produced by the same, or subsequent, processing steps as produce the RECO data; and AOD data will be made easily available at multiple sites to CMS members. The AOD will contain enough information about the event to support all the typical usage patterns of a physics analysis. Thus, it will contain a copy of all the high-level physics objects (such as muons, electrons, taus, etc.), plus a summary of the RECO information sufficient to support typical analysis actions such as track refitting with improved alignment or kinematic constraints, re-evaluation of energy and/or position of ECAL clusters based on analysis-specific corrections. The AOD, because of the limited size that will not allow it to contain all the hits, will typically not support the application of novel pattern recognition techniques, nor the application of new calibration constants, which would typically require the use of RECO or RAW information.
Line: 88 to 89
 

Changed:
<
<

Group and user skims: RECO, AOD and PAT-tuples

>
>

Group and user skims: RECO, AOD and PAT-tuples

Now we can refine the descriptions of primary, group and user-defined skims, with some examples. In almost all cases the primary skims will read AOD and produce AOD with a reduced number of events. (During the physics commissioning, the primary skims may also read and write RECO instead of AOD.) The group and user skims may also read and write AOD (or RECO). However, they could also produce PAT-tuples, as decided by the group or the user. For example

  1. primary skims read AOD, write AOD
  2. group-wide skim filters events, and produces PAT with lots of information (as it needs to benefit multiple efforts within the group)
  3. the user modifies the PAT workflow to read PAT and produce another version of PAT, with much smaller content (stripping/slimming), and possibly even compressed PAT object (thinning).
 
Added:
>
>
The resulting "PAT-tuple" should be small enough to not only fit onto a laptop, but also to fit within a memory of the ROOT process, thus facilitating interactive use with identical speed with TTrees. However, to be able to read CMS data (RECO, AOD or PAT) from ROOT, we first need to teach ROOT to understand CMS DataFormats. We do it by loading CMS DataFormats libraries themselves, as well as a couple of helper classes that simplify the user's manipulation of CMS "events" in ROOT. ROOT with these additional libraries installed is called Framework-lite, or FW Lite.
 

Tools for interactive analysis: FW Lite, edmBrowser, Fireworks

Changed:
<
<
>
>
The interactive stage is where most of the analysis is actually done, and where most of the `analysis time' is actually spent. Every analysis is different, but most take a number of twists and turns towards its conclusion, solving an array of riddles on the way.
 

-- PetarMaksimovic - 17 Mar 2009

Revision 42009-06-10 - PetarMaksimovic

Line: 1 to 1
 

Analysis Overview: an Introduction

This page presents a big-picture overview of performing an analysis at CMS.

Line: 43 to 43
 Data formats (classes) for reconstructed data, for example, include Reco.Track, Reco.TrackExtra, and many more. See the Offline Guide section SWGuideRecoDataTable for the full listing.
Added:
>
>

Data Tiers: Reconstructed (RECO) Data and Analysis Object Data (AOD)

Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a data tier, which have already been introduced in the Workbook section describing the Computing Model. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term GenSimDigi includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are RECO (all reconstructed objects and hits) and AOD (a smaller subset of RECO which is needed by analysis).

RECO data contains objects from all stages of reconstruction. AOD are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.

whats_in_aod_reco.gif

RECO

RECO is the name of the data-tier which contains objects created by the event reconstruction program. It is derived from RAW data and provides access to reconstructed physics objects for physics analysis in a convenient format. Event reconstruction is structured in several hierarchical steps:

  1. Detector-specific processing: Starting from detector data unpacking and decoding, detector calibration constants are applied and cluster or hit objects are reconstructed.
  2. Tracking: Hits in the silicon and muon detectors are used to reconstruct global tracks. Pattern recognition in the tracker is the most CPU-intensive task.
  3. Vertexing: Reconstructs primary and secondary vertex candidates.
  4. Particle identification: Produces the objects most associated with physics analyses. Using a wide variety of sophisticated algorithms, standard physics object candidates are created (electrons, photons, muons, missing transverse energy and jets; heavy-quarks, tau decay).

The normal completion of the reconstruction task will result in a full set of these reconstructed objects usable by CMS physicists in their analyses. You would only need to rerun these algorithms if your analysis requires you to take account of such things as trial calibrations, novel algorithms etc.

Reconstruction is expensive in terms of CPU and is dominated by tracking. The RECO data-tier will provide compact information for analysis to avoid the necessity to access the RAW data for most analysis. Following the hierarchy of event reconstruction, RECO will contain objects from all stages of reconstruction. At the lowest level it will be reconstructed hits, clusters and segments. Based on these objects reconstructed tracks and vertices are stored. At the highest level reconstructed jets, muons, electrons, b-jets, etc. are stored. A direct reference from high-level objects to low-level objects will be possible, to avoid duplication of information. In addition the RECO format will preserve links to the RAW information.

The RECO data includes quantities required for typical analysis usage patterns such as: track re-finding, calorimeter reclustering, and jet energy calibration. The RECO event content is documented in the Offline Guide at RECO Data Format Table.

AOD

AOD are derived from the RECO information to provide data for physics analysis in a convenient, compact format. AOD data are usable directly by physics analyses. AOD data will be produced by the same, or subsequent, processing steps as produce the RECO data; and AOD data will be made easily available at multiple sites to CMS members. The AOD will contain enough information about the event to support all the typical usage patterns of a physics analysis. Thus, it will contain a copy of all the high-level physics objects (such as muons, electrons, taus, etc.), plus a summary of the RECO information sufficient to support typical analysis actions such as track refitting with improved alignment or kinematic constraints, re-evaluation of energy and/or position of ECAL clusters based on analysis-specific corrections. The AOD, because of the limited size that will not allow it to contain all the hits, will typically not support the application of novel pattern recognition techniques, nor the application of new calibration constants, which would typically require the use of RECO or RAW information.

The AOD data tier will contain physics objects: tracks with associated Hits, calorimetric clusters with associated Hits, vertices, jets and high-level physics objects (electrons, muons, Z boson candidates, and so on).

Because the AOD data tier is relatively compact, all Tier-1 computing centres are able to keep a full copy of the AOD, while they will hold only a subset of the RAW and RECO data tiers. The AOD event content is documented in the Offline Guide at AOD Data Format Table.

PAT

The information is stored in DataFormats in RECO and AOD in a way that uses the least amount of space and allows for the greatest flexibility. However, accessing connections between various RECO objects requires more experience with C++. To simplify the user's analysis, a set of new data formats are created, which aggregate the related RECO information. These new formats, along with the tools used to make and manipulate them, are called Physics Analysis Toolkit, or PAT. The PAT is de facto the way how the users will access the physics objects which are the output of RECO.

PAT's content is flexible -- it is up to the user to define it. For this reason, PAT is not a data tier! The content of PAT may change from one analysis to another, let alone from one PAG to another. However, PAT is important because it defines a standard for the physics objects and variables stored in those physics objects. It is like a menu in a restaurant -- every patron can choose different things from the menu, but everybody is reading from the same menu. This facilitates sharing both tools and people between analyses and physics groups.

Group and user skims: RECO, AOD and PAT-tuples

Tools for interactive analysis: FW Lite, edmBrowser, Fireworks

 

Revision 32009-06-10 - PetarMaksimovic

Line: 1 to 1
 

Analysis Overview: an Introduction

Changed:
<
<
This intention of this chapter is to present a big-picture overview of performing an analysis on data at CMS. The objective is running on the data, so
>
>
This page presents a big-picture overview of performing an analysis at CMS.
 
  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.
Changed:
<
<
  • Next, the ways how to get the data are illustrated.
  • Once we lay our hands on the data, we need to know how to acesss the data; there are basically three options:
    • the full Framework -- using C++ modules in cmsRun
>
>
  • We need to understand the most important CMS data formats, RECO and AOD, so they are described next. PAT is also mentioned, although it will be described in detail later.
  • Next, the ways how to get the data are illustrated. At the end of this exercise, we will end up with one or more small files, which we explore next, first using command-line utilities, and then with graphical tools like edmBrowser and Fireworks event display.
  • Finally, we explore two options for a quantitative analysis of CMS events:
 
    • FW Lite -- using ROOT enhanced with libraries that can understand CMS data formats and aid in fetching object collections from the event
Changed:
<
<
    • bare ROOT -- accessing CMS events this way is still possible, but it is not recommended; however it is sufficient for TNtuples and TTrees
  • In the case of using either the full Framework or FW Lite, we need to understand various CMS data formats, so they are described next.
>
>
    • the full Framework -- using C++ modules in cmsRun
 

The data flow, from detector to analysis

Changed:
<
<
To enable the most effective access to CMS data, the data will be both split into Physics Datasets (PDs) and filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets will group or split triggers in order to achieve balance in their size.
>
>
To enable the most effective access to CMS data, the data is first split into Physics Datasets (PDs) and then the events filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets group or split triggers in order to achieve balance in their size.
 
Changed:
<
<
Eventually, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever tigher event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:
>
>
However, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever-tighter event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:
 
  1. the primary datasets and skims are produced; they are defined using the trigger information (for stability) and produced centrally on Tier 1 systems
  2. the secondary skims are produced by the physics groups (say a Higgs group) by running on the primary skims; the secondary skims are usually produced by group members running on the Tier 2 clusters assigned to the given group
  3. optionally, the user then skims once again, applying an ever tighter event selection
  4. the final sample (with almost final cuts) can then be analyzed by FW Lite. It can also be analyzed by the full framework, however we recommend using FW Lite as it is interactive and far more portable
Changed:
<
<
The secondary skimming (step 2 above) is of utmost importance! The selection must be tight enough to make the secondary skims feasible in terms of size. And yet it must not be too tight since otherwise certain analyses might find themselves starved for data. However, in this case what is `tight' is analysis-dependent, so it is vital for the group members to be involved in the definition of their group's secondary skims!
>
>
The primary skims (step 1 above) reduce the size of the primary datasets in order to reduce the time of subsequent layers of skimming. The target of the primary skims is a reduction of about a factor of 10 in size with respect to the primary datasets.

The secondary skimming (step 2 above) must be tight enough to make the secondary skims feasible in terms of size. And yet it must not be too tight since otherwise certain analyses might find themselves starved for data. However, in this case what is `tight' is analysis-dependent, so it is vital for the group members to be involved in the definition of their group's secondary skims!

 
Changed:
<
<
The user selection (step 3) is the user's main opportunity to reduce the size of the samples s/he will need to deal with (and lug around). In many cases, this is where the preliminary selection is done, and thus it should be viewed as the foundation of the analysis. It is extremely important -- although in this case finding out that the cuts were too tight is not as disastrous since the tertiary skims could then be remade on the Tier 2 by the user. The existence of the secondary skims is precisely what makes this possible.
>
>
The user selection (step 3) is made on the Tier 2 by the user, and it's the main opportunity to reduce the size of the samples the user will need to deal with (and lug around). In many cases, this is where the preliminary event selection is done, and thus it is the foundation of the analysis. It is expected that the user may need to re-run this step (e.g., in case of finding out that the cuts were too tight), but this is not a problem since the tertiary skims are ran on the already reduced secondary skims.
  That being said, it is important to preserve the collaboration's CPU resources (as well as one's own time) and tune the user's skim to be as close to `just right' as possible -- the cuts should be looser than they are expected to be after the final cut optimization, but not too loose otherwise the skimming would not serve its purpose.
Changed:
<
<

Reduction in event size: CMS Data Tiers

>
>

Reduction in event size: CMS Data Formats and Data Tiers

  In addition to the reduction of the number of events, in steps 1-3 it is also possible to reduce the size of each event by
Changed:
<
<
  • removing unneeded collections (e.g. after we make PAT candidates, for most purposes the rest of the AOD information is not needed); this is called stripping
>
>
  • removing unneeded collections (e.g. after we make PAT candidates, for most purposes the rest of the AOD information is not needed); this is called stripping or slimming.
 
  • removing unneeded information from objects; this is called thinning . It is an advanced topic; it's still experimental and not covered in here.
Changed:
<
<
Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.
>
>
Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.
  Data formats (classes) for reconstructed data, for example, include Reco.Track, Reco.TrackExtra, and many more. See the Offline Guide section SWGuideRecoDataTable for the full listing.
Deleted:
<
<

About Data Tiers

Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a data tier. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term GenSimDigi includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are probably RECO (all reconstructed objects and hits) and AOD (a smaller subset of RECO). The following table gives an overview.

E.g., the RAW data tier collects detector data after online formatting plus some trigger results, while the RECO tier collects reconstructed objects.

Data Tier Listing

Event Format Contents Purpose Data Type Ref Event Size (MB)
DAQ-RAW Detector data
<!--in FED format-->
from front end electronics + L1 trigger result.
Primary record of physics event. Input to online HLT   1-1.5
RAW Detector data after online formatting, the L1 trigger result, the result of the HLT selections (HLT trigger bits), potentially some of the higher level quantities calculated during HLT processing. Input to Tier-0 reconstruction. Primary archive of events at CERN.   1.5
RECO Reconstructed objects (tracks, vertices, jets, electrons, muons, etc.) and reconstructed hits/clusters Output of Tier-0 reconstruction and subsequent rereconstruction passes. Supports re-finding of tracks, etc. RECO & AOD 0.25
AOD Subset of RECO. Reconstructed objects (tracks, vertices, jets, electrons, muons, etc.). Possible small quantities of very localised hit information. Physics analysis, limited refitting of tracks and clusters RECO & AOD 0.05
TAG Run/event number, high-level physics objects, e.g. used to index events. Rapid identification of events for further study (event directory).   0.01
FEVT Full Event: Term used to refer to RAW+RECO together (not a distinct format). multiple   1.75
GEN Generated Monte Carlo event -   -
SIM Energy depositions of MC particles in detector (sim hits). -   -
DIGI Sim hits converted into detector response. Basically the same as the RAW output of the detector. -   1.5
The Data Type Ref column entries point to the CMSSW Reference Manual, which is not complete.

Data Tiers: Reconstructed (RECO) Data and Analysis Object Data (AOD)

RECO data contains objects from all stages of reconstruction. AOD are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.

whats_in_aod_reco.gif

RECO

RECO is the name of the data-tier which contains objects created by the event reconstruction program. It is derived from RAW data and provides access to reconstructed physics objects for physics analysis in a convenient format. Event reconstruction is structured in several hierarchical steps:

  1. Detector-specific processing: Starting from detector data unpacking and decoding, detector calibration constants are applied and cluster or hit objects are reconstructed.
  2. Tracking: Hits in the silicon and muon detectors are used to reconstruct global tracks. Pattern recognition in the tracker is the most CPU-intensive task.
  3. Vertexing: Reconstructs primary and secondary vertex candidates.
  4. Particle identification: Produces the objects most associated with physics analyses. Using a wide variety of sophisticated algorithms, standard physics object candidates are created (electrons, photons, muons, missing transverse energy and jets; heavy-quarks, tau decay).

The normal completion of the reconstruction task will result in a full set of these reconstructed objects usable by CMS physicists in their analyses. You would only need to rerun these algorithms if your analysis requires you to take account of such things as trial calibrations, novel algorithms etc.

Reconstruction is expensive in terms of CPU and is dominated by tracking. The RECO data-tier will provide compact information for analysis to avoid the necessity to access the RAW data for most analysis. Following the hierarchy of event reconstruction, RECO will contain objects from all stages of reconstruction. At the lowest level it will be reconstructed hits, clusters and segments. Based on these objects reconstructed tracks and vertices are stored. At the highest level reconstructed jets, muons, electrons, b-jets, etc. are stored. A direct reference from high-level objects to low-level objects will be possible, to avoid duplication of information. In addition the RECO format will preserve links to the RAW information.

The RECO data includes quantities required for typical analysis usage patterns such as: track re-finding, calorimeter reclustering, and jet energy calibration. The RECO event content is documented in the Offline Guide at RECO Data Format Table.

AOD

AOD are derived from the RECO information to provide data for physics analysis in a convenient, compact format. AOD data are usable directly by physics analyses. AOD data will be produced by the same, or subsequent, processing steps as produce the RECO data; and AOD data will be made easily available at multiple sites to CMS members. The AOD will contain enough information about the event to support all the typical usage patterns of a physics analysis. Thus, it will contain a copy of all the high-level physics objects (such as muons, electrons, taus, etc.), plus a summary of the RECO information sufficient to support typical analysis actions such as track refitting with improved alignment or kinematic constraints, re-evaluation of energy and/or position of ECAL clusters based on analysis-specific corrections. The AOD, because of the limited size that will not allow it to contain all the hits, will typically not support the application of novel pattern recognition techniques, nor the application of new calibration constants, which would typically require the use of RECO or RAW information.

The AOD data tier will contain physics objects: tracks with associated Hits, calorimetric clusters with associated Hits, vertices, jets and high-level physics objects (electrons, muons, Z boson candidates, and so on).

Because the AOD data tier is relatively compact, all Tier-1 computing centres are able to keep a full copy of the AOD, while they will hold only a subset of the RAW and RECO data tiers. The AOD event content is documented in the Offline Guide at AOD Data Format Table.

Reference Documentation for RECO and AOD Data Format Packages

The reference documentation is provided in the Offline Guide sections: These links provide a list of all packages related to the RECO and AOD data formats within the CMSSW repository. Links there point to the class documentation of each data object. A short instructions how to access the object are given.
 
Deleted:
<
<
More about the different data inside an event can be found in the "ADVANCED TOPICS" part of the WorkBook where the objects and their creation are explained in detail. In the "ESSENTIALS" part, the section about Particle Candidates gives an overview of the candidate model.
 

Revision 22009-03-18 - PetarMaksimovic

Line: 1 to 1
 

Analysis Overview: an Introduction

This intention of this chapter is to present a big-picture overview of performing an analysis on data at CMS. The objective is running on the data, so

Changed:
<
<
  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD, PAT) and defines all of them.
>
>
  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD) and defines all of them. It also introduces the PAT data format which is described in detail in Chapter 4. This is the scope of this section.
 
  • Next, the ways how to get the data are illustrated.
  • Once we lay our hands on the data, we need to know how to acesss the data; there are basically three options:
    • the full Framework -- using C++ modules in cmsRun
Line: 10 to 10
 
    • bare ROOT -- accessing CMS events this way is still possible, but it is not recommended; however it is sufficient for TNtuples and TTrees
  • In the case of using either the full Framework or FW Lite, we need to understand various CMS data formats, so they are described next.
Added:
>
>

The data flow, from detector to analysis

To enable the most effective access to CMS data, the data will be both split into Physics Datasets (PDs) and filtered. The division into the Physics Datasets is done based on the trigger decision. The primary datasets are structured and placed to make life as easy as possible, e.g. to minimize the need of an average user to run on very large amounts of data. The datasets will group or split triggers in order to achieve balance in their size.

Eventually, the Primary Datasets will be too large to make direct access by users reasonable or even feasible. The main strategy in dealing with such a large number of events is to filter them, and do that in layers of ever tigher event selection. (After all, the Level 1 trigger and HLT are doing the same online.) The process of selecting events and saving them in output is called `skimming'. The intended modus operandi of CMS analysis groups is the following:

  1. the primary datasets and skims are produced; they are defined using the trigger information (for stability) and produced centrally on Tier 1 systems
  2. the secondary skims are produced by the physics groups (say a Higgs group) by running on the primary skims; the secondary skims are usually produced by group members running on the Tier 2 clusters assigned to the given group
  3. optionally, the user then skims once again, applying an ever tighter event selection
  4. the final sample (with almost final cuts) can then be analyzed by FW Lite. It can also be analyzed by the full framework, however we recommend using FW Lite as it is interactive and far more portable

The secondary skimming (step 2 above) is of utmost importance! The selection must be tight enough to make the secondary skims feasible in terms of size. And yet it must not be too tight since otherwise certain analyses might find themselves starved for data. However, in this case what is `tight' is analysis-dependent, so it is vital for the group members to be involved in the definition of their group's secondary skims!

The user selection (step 3) is the user's main opportunity to reduce the size of the samples s/he will need to deal with (and lug around). In many cases, this is where the preliminary selection is done, and thus it should be viewed as the foundation of the analysis. It is extremely important -- although in this case finding out that the cuts were too tight is not as disastrous since the tertiary skims could then be remade on the Tier 2 by the user. The existence of the secondary skims is precisely what makes this possible.

That being said, it is important to preserve the collaboration's CPU resources (as well as one's own time) and tune the user's skim to be as close to `just right' as possible -- the cuts should be looser than they are expected to be after the final cut optimization, but not too loose otherwise the skimming would not serve its purpose.

Reduction in event size: CMS Data Tiers

In addition to the reduction of the number of events, in steps 1-3 it is also possible to reduce the size of each event by

  • removing unneeded collections (e.g. after we make PAT candidates, for most purposes the rest of the AOD information is not needed); this is called stripping
  • removing unneeded information from objects; this is called thinning . It is an advanced topic; it's still experimental and not covered in here.

Starting from the detector output ("RAW" data), the information is being refined and what is not needed is being dropped. This defines the CMS data tiers. Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The DataFormats package and the SimDataFormats package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your EDProducer module must instantiate one or more of these data format classes.

Data formats (classes) for reconstructed data, for example, include Reco.Track, Reco.TrackExtra, and many more. See the Offline Guide section SWGuideRecoDataTable for the full listing.

About Data Tiers

Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a data tier. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term GenSimDigi includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are probably RECO (all reconstructed objects and hits) and AOD (a smaller subset of RECO). The following table gives an overview.

E.g., the RAW data tier collects detector data after online formatting plus some trigger results, while the RECO tier collects reconstructed objects.

Data Tier Listing

Event Format Contents Purpose Data Type Ref Event Size (MB)
DAQ-RAW Detector data
<!--in FED format-->
from front end electronics + L1 trigger result.
Primary record of physics event. Input to online HLT   1-1.5
RAW Detector data after online formatting, the L1 trigger result, the result of the HLT selections (HLT trigger bits), potentially some of the higher level quantities calculated during HLT processing. Input to Tier-0 reconstruction. Primary archive of events at CERN.   1.5
RECO Reconstructed objects (tracks, vertices, jets, electrons, muons, etc.) and reconstructed hits/clusters Output of Tier-0 reconstruction and subsequent rereconstruction passes. Supports re-finding of tracks, etc. RECO & AOD 0.25
AOD Subset of RECO. Reconstructed objects (tracks, vertices, jets, electrons, muons, etc.). Possible small quantities of very localised hit information. Physics analysis, limited refitting of tracks and clusters RECO & AOD 0.05
TAG Run/event number, high-level physics objects, e.g. used to index events. Rapid identification of events for further study (event directory).   0.01
FEVT Full Event: Term used to refer to RAW+RECO together (not a distinct format). multiple   1.75
GEN Generated Monte Carlo event -   -
SIM Energy depositions of MC particles in detector (sim hits). -   -
DIGI Sim hits converted into detector response. Basically the same as the RAW output of the detector. -   1.5
The Data Type Ref column entries point to the CMSSW Reference Manual, which is not complete.

Data Tiers: Reconstructed (RECO) Data and Analysis Object Data (AOD)

RECO data contains objects from all stages of reconstruction. AOD are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.

whats_in_aod_reco.gif

RECO

RECO is the name of the data-tier which contains objects created by the event reconstruction program. It is derived from RAW data and provides access to reconstructed physics objects for physics analysis in a convenient format. Event reconstruction is structured in several hierarchical steps:

  1. Detector-specific processing: Starting from detector data unpacking and decoding, detector calibration constants are applied and cluster or hit objects are reconstructed.
  2. Tracking: Hits in the silicon and muon detectors are used to reconstruct global tracks. Pattern recognition in the tracker is the most CPU-intensive task.
  3. Vertexing: Reconstructs primary and secondary vertex candidates.
  4. Particle identification: Produces the objects most associated with physics analyses. Using a wide variety of sophisticated algorithms, standard physics object candidates are created (electrons, photons, muons, missing transverse energy and jets; heavy-quarks, tau decay).

The normal completion of the reconstruction task will result in a full set of these reconstructed objects usable by CMS physicists in their analyses. You would only need to rerun these algorithms if your analysis requires you to take account of such things as trial calibrations, novel algorithms etc.

Reconstruction is expensive in terms of CPU and is dominated by tracking. The RECO data-tier will provide compact information for analysis to avoid the necessity to access the RAW data for most analysis. Following the hierarchy of event reconstruction, RECO will contain objects from all stages of reconstruction. At the lowest level it will be reconstructed hits, clusters and segments. Based on these objects reconstructed tracks and vertices are stored. At the highest level reconstructed jets, muons, electrons, b-jets, etc. are stored. A direct reference from high-level objects to low-level objects will be possible, to avoid duplication of information. In addition the RECO format will preserve links to the RAW information.

The RECO data includes quantities required for typical analysis usage patterns such as: track re-finding, calorimeter reclustering, and jet energy calibration. The RECO event content is documented in the Offline Guide at RECO Data Format Table.

AOD

AOD are derived from the RECO information to provide data for physics analysis in a convenient, compact format. AOD data are usable directly by physics analyses. AOD data will be produced by the same, or subsequent, processing steps as produce the RECO data; and AOD data will be made easily available at multiple sites to CMS members. The AOD will contain enough information about the event to support all the typical usage patterns of a physics analysis. Thus, it will contain a copy of all the high-level physics objects (such as muons, electrons, taus, etc.), plus a summary of the RECO information sufficient to support typical analysis actions such as track refitting with improved alignment or kinematic constraints, re-evaluation of energy and/or position of ECAL clusters based on analysis-specific corrections. The AOD, because of the limited size that will not allow it to contain all the hits, will typically not support the application of novel pattern recognition techniques, nor the application of new calibration constants, which would typically require the use of RECO or RAW information.

The AOD data tier will contain physics objects: tracks with associated Hits, calorimetric clusters with associated Hits, vertices, jets and high-level physics objects (electrons, muons, Z boson candidates, and so on).

Because the AOD data tier is relatively compact, all Tier-1 computing centres are able to keep a full copy of the AOD, while they will hold only a subset of the RAW and RECO data tiers. The AOD event content is documented in the Offline Guide at AOD Data Format Table.

Reference Documentation for RECO and AOD Data Format Packages

The reference documentation is provided in the Offline Guide sections: These links provide a list of all packages related to the RECO and AOD data formats within the CMSSW repository. Links there point to the class documentation of each data object. A short instructions how to access the object are given.

More about the different data inside an event can be found in the "ADVANCED TOPICS" part of the WorkBook where the objects and their creation are explained in detail. In the "ESSENTIALS" part, the section about Particle Candidates gives an overview of the candidate model.

 -- PetarMaksimovic - 17 Mar 2009
Added:
>
>

Revision 12009-03-17 - PetarMaksimovic

Line: 1 to 1
Added:
>
>

Analysis Overview: an Introduction

This intention of this chapter is to present a big-picture overview of performing an analysis on data at CMS. The objective is running on the data, so

  • The first task is to describe how the data flows within CMS, from data taking through various layers of skimming. This also introduces a concept of a data tier (RECO, AOD, PAT) and defines all of them.
  • Next, the ways how to get the data are illustrated.
  • Once we lay our hands on the data, we need to know how to acesss the data; there are basically three options:
    • the full Framework -- using C++ modules in cmsRun
    • FW Lite -- using ROOT enhanced with libraries that can understand CMS data formats and aid in fetching object collections from the event
    • bare ROOT -- accessing CMS events this way is still possible, but it is not recommended; however it is sufficient for TNtuples and TTrees
  • In the case of using either the full Framework or FW Lite, we need to understand various CMS data formats, so they are described next.

-- PetarMaksimovic - 17 Mar 2009

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback