5.10 Data Organization Explained

Complete: 5
Detailed Review status

Goals of this page:

This page is intended to provide you with an overview of the terms used in Data Management in CMS, thus providing you an appreciation to how data is organized. It is background information only.


Dataset Bookkeeping System (DBS): “Which data exist?”

The Dataset Bookkeeping System (DBS) provides the means to define, discover and use CMS event data. The main features that DBS provides are:

  • Data Description: keeps dataset definition along with attributes characterising the dataset like the application that produced the data, the type of content resulting from a degree of processing applied to the data (RAW, RECO, etc),etc… The DBS also provides information regarding the “provenance” of the data it describes.
  • Data Discovery: stores information about (real and simulated) CMS data in a queryable format. The supported queries allow users to discover available data and how they are organized (logically) in term of packaging units (files and file-blocks).
Answers the question “Which data exist?”

  • Easiest way for user to query this information is via the Data Aggregation Service (DAS) as described in Chapter Locating Data Samples

Data Location Service (DLS): "Where is the data?"

The Data Location Service (DLS) provides the means to locate replicas of data in the distributed computing system. The DLS provide the names of Storage Elements of sites hosting the data. Answers the question “Where is the data?”

The Event Data Model (EDM) in CMSSW is based on simple files. In the data management you will see two terms used when discussing files:

Logical File Name (LFN)

  • This is a site-independent name for a file.
  • It doesn't contain either the actual protocol used to read the file or any of the site-specific information about the place where it is located.
  • it is preferred that you use this for all production files as then it is possible for a site to change specifics of the access and location without breaking your config file.
  • A production LFN in general begins with /store and looks like this in a cmsRun cfg file:
process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(

Physical File Name (PFN)

  • This is site-dependent name for a file.
  • Local access to a file at a site. (Note that reading files at remote sites specifying protocol in PFN doesn’t work)
  • The cmsRun application will automatically convert production LFN's into the appropriate PFN for the site where you are running. So you don't need to know the PFN yourself!!
  • If you really want to know the PFN, the algorithm that convert LFN to PFN is site dependent and is defined in the so called TrivialFileCatalog at the site ( TrivialFileCatalog of the various sites are in CVS COMP/SITECONF/SiteName/PhEDEx/storage.xml )

The EdmFileUtil utility in your CMSSW environment can be used to get the PFN from a given LFN:

cd work/CMSSW_5_3_5/src/
edmFileUtil -d /store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root
will results in:

For example accessing data locally at CERN you have the algorithm:

  PFN = root://eoscms//eos/cms/ + LFN

and the cmsRun cfg file looks like:

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(

File Blocks

  • Files are grouped together into FileBlock.
  • A file block is the minimum quantum of data that is replicated between sites.
  • Each given file block may be at one or more sites.


  • Fileblocks are grouped in datasets.
  • Dataset is a set of fileblocks corresponding to a single sample and produced with a single cfg file.


The DatasetPath is a string that identifies a dataset. It consists of 3 parts:



  • Primary dataset: name that describes the physics channel

  • Processed dataset: name that describe the kind of processing applied

  • Data Tier: describes the kind of event information stored from each step in the simulation and reconstruction chain. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A given dataset may consist of multiple data tiers, e.g., the term GEN-SIM-DIGI-RECO includes the generation (MC), the simulation (Geant), digitalization and reconstruction steps.

Review status

Reviewer/Editor and Date (copy from screen) Comments
StefanoBelforte - 29-Aug-2013 replace reference to DBS/DAS with ref. to Chapter 5.4
JohnStupak - 2-July-2013 Review and update to 5_3_5
NitishDhingra - 07-Apr-2012 See the detailed comments below
FrankWuerthwein - 04-Dec-2009 Complete Reorganization 1st draft ready for review

Complete review. Information regarding deprecation of DBS and migration to DAS has been added. Figures have been added for better understanding.

Last reviewed by: Main.David L Evans - fill in date when done - Responsible: StefanoBelforte

-- FrankWuerthwein - 06-Dec-2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng DAS.png r1 manage 64.6 K 2012-04-07 - 12:24 NitishDhingra DAS screenshot
PNGpng DataSet.png r2 r1 manage 31.4 K 2012-04-07 - 12:32 NitishDhingra DataSet screenshot
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-07-02 - TitasRoy



    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback