5.10 Data Organization Explained
Complete:
Detailed Review status
Goals of this page:
This page is intended to provide you with an overview of the terms used in Data Management in CMS, thus providing you an appreciation to how
data is organized. It is background information only.
Contents
Dataset Bookkeeping System (DBS): “Which data exist?”
The Dataset Bookkeeping System (DBS) provides the means to define, discover and use CMS event data.
The main features that DBS provides are:
- Data Description: keeps dataset definition along with attributes characterising the dataset like the application that produced the data, the type of content resulting from a degree of processing applied to the data (RAW, RECO, etc),etc… The DBS also provides information regarding the “provenance” of the data it describes.
- Data Discovery: stores information about (real and simulated) CMS data in a queryable format. The supported queries allow users to discover available data and how they are organized (logically) in term of packaging units (files and file-blocks).
Answers the question “
Which data exist?”
- Easiest way for user to query this information is via the Data Aggregation Service (DAS) as described in Chapter Locating Data Samples
Data Location Service (DLS): "Where is the data?"
The Data Location Service (DLS) provides the means to locate replicas of data in the distributed computing system.
The DLS provide the names of Storage Elements of sites hosting the data.
Answers the question “
Where is the data?”
The Event Data Model (EDM) in CMSSW is based on simple files.
In the data management you will see two terms used when discussing files:
Logical File Name (LFN)
- This is a site-independent name for a file.
- It doesn't contain either the actual protocol used to read the file or any of the site-specific information about the place where it is located.
- it is preferred that you use this for all production files as then it is possible for a site to change specifics of the access and location without breaking your config file.
- A production LFN in general begins with /store and looks like this in a cmsRun cfg file:
process.source = cms.Source("PoolSource",
fileNames = cms.untracked.vstring(
'/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root'
)
)
Physical File Name (PFN)
- This is site-dependent name for a file.
- Local access to a file at a site. (Note that reading files at remote sites specifying protocol in PFN doesn’t work)
- The cmsRun application will automatically convert production LFN's into the appropriate PFN for the site where you are running. So you don't need to know the PFN yourself!!
- If you really want to know the PFN, the algorithm that convert LFN to PFN is site dependent and is defined in the so called TrivialFileCatalog at the site ( TrivialFileCatalog of the various sites are in CVS COMP/SITECONF
/SiteName/PhEDEx/storage.xml )
The EdmFileUtil utility in your CMSSW environment can be used to get the PFN from a given LFN:
cd work/CMSSW_5_3_5/src/
cmsenv
edmFileUtil -d /store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root
will results in:
root://eoscms//eos/cms/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root?svcClass=default
For example accessing data locally at CERN you have the algorithm:
PFN = root://eoscms//eos/cms/ + LFN
and the cmsRun cfg file looks like:
process.source = cms.Source("PoolSource",
fileNames = cms.untracked.vstring(
'root://eoscms//eos/cms/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root?svcClass=default'
)
)
File Blocks
- Files are grouped together into FileBlock.
- A file block is the minimum quantum of data that is replicated between sites.
- Each given file block may be at one or more sites.
Dataset
- Fileblocks are grouped in datasets.
- Dataset is a set of fileblocks corresponding to a single sample and produced with a single cfg file.
DatasetPath
The DatasetPath is a string that identifies a dataset. It consists of 3 parts:
/Primarydataset/Processeddataset/DataTier
where:
- Primary dataset: name that describes the physics channel
- Processed dataset: name that describe the kind of processing applied
- Data Tier: describes the kind of event information stored from each step in the simulation and reconstruction chain. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A given dataset may consist of multiple data tiers, e.g., the term GEN-SIM-DIGI-RECO includes the generation (MC), the simulation (Geant), digitalization and reconstruction steps.
Review status
Complete review. Information regarding deprecation of DBS and migration to DAS has been added. Figures have been added for better understanding.
Last reviewed by: Main.David L Evans - fill in date when done -
Responsible:
StefanoBelforte
--
FrankWuerthwein - 06-Dec-2009