Datasets and Data Preparation Exercise


This exercise page is an updated version of The original exercises are created by Giovanni Franzoni (, big thanks to him!!


In this set of exercises you will learn how to look for datasets and find out their key properties relevant for analysis: how to navigate their parent-child relationship and determine the software release, production configuration and alignment-calibration conditions they've been produced with. You'll be exposed to using the principal services providing status and details of datasets: das, brilcalc, McM, pMp, cmsDBbrowser. You'll also find out how to compute the integrated luminosity for your analysis.

Introduction to datasets

Summarise there the content of the slides, add a pointer to them:

Getting ready for these exercises

A bit of preparation will ease following this tutorial. Below a few concrete actions you can take before starting the hands-on session.


In order to carry out the exercises of this session you need:

  • a CMS account at CERN to access web services (you can access this twiki you have it) and log into lxplus

Color coding conventions

There's color scheme followed in this page exercises:

  • Commands to be typed into a shell will be embedded in a grey box:
    ls -ltr
  • Other code, scripts and queries will be displayed in a pink box:
     const TString var = "ST";
  • Output and screen printouts will be embedded in a green box:
     Number of events with ST > 2000 and N>= 3 is 10099. Acceptance = 0.842075

It is expected that you can cut-and-paste from the command box into the command line. Similarly, you should be able to cut-and-paste from the configuration fragments directly into a text editor, when necessary.

Setup a CMSSW area

Most of the exercises will be carried out using your web browser.

A CMSSW work area will also be needed; please create a working directory on your lxplus home

ssh -XY
mkdir data-preparation-exercises/
cd data-preparation-exercises/
cmsrel CMSSW_8_0_21
cd CMSSW_8_0_21/src

Now you are ready to proceed to the exercises.

Note: CMSSW_8_0_21 is the release which contain changes for L1 software, desired for Moriond17 Monte Carlo (DIgi-Reco) production.

Exercise 1: find an accessible file with single electron events in miniaod format from the latest data reprocessing of 2016D

Reminder about the general structure of a dataset name:

 dataset = /PrimaryDataset/ProcessingVersion/DataTier


dataset = /SingleElectron/Run2016D-23Sep2016-v1/MINIAOD

To start with, you need to establish the full dataset name "with single electron". Either you know it from a reference, or you need to construct a key elements of it, and put them together in a search. With the key elements at hand, you'll be able to use the das web interface and queries with wildcard in order to establish or confirm the complete dataset name.

  • The PrimaryDataset string for real data (i.e. collected at CMS) is, strictly speaking, specified in the HLT configuration accessible via the HLT browser; however in most cases you can do reasonable guess to find what you need:
  • SingleElectron.

  • the latest reprocessing of the 2015 data can be found look at the PdmVDataReprocessing: data reprocessing campaigns documentation twiki: the version of the reprocessing is indicated by a date
  • 23Sep2016
  • the acquisition era is part of ProcessingVersion and indicates the portion of the 2015 run when the data were collected
  • 2016D


You can now place a query to das to find the dataset; using the wildcard increases the chances of finding at the first try what you're looking for with no need to remember the details of the naming conventions - of course, you ought to use the wildcard with a pinch of salt, not to be flooded with too many results matching your query.

dataset dataset=/SingleElectron*/*Run2016D*23Sep2016*/MINIAOD 

  • → check the Sites
  • ex1-Sites-in-das.png
  • note that some sites are not accessible to the users (e.g. : tape storage)
  • ex1-list-of-sites.png

Place your final query to das looking for the file you want at a site where the dataset has a presence. (note: the site where the dataset is present might change with time, thus the site chosen in detailed query which follows might need to be updated):

file dataset=/SingleElectron/Run2016D-23Sep2016-v1/MINIAOD site=T1_US_FNAL_Buffer
Does the query return a file ? If not, why ?

You can now run on one of the files and find out its basic properties, exploiting the fact that xrootd will serve the file you've chosen from the CMS site where it's available on disk to your cmsRun process:

 edmFileUtil   --eventsInLumis  -P   root:// 
 edmFileUtil   --eventsInLumis  -P   /store/data/Run2016D/SingleElectron/MINIAOD/23Sep2016-v1/70000/04E8F72C-AF89-E611-9D2F-FA163E1D7951.root (*update with the actual file you've found*) 

Exercise 2: compute the integrated luminosity collected by CMS in run 276775

The data collected by CMS are certified on a luminosity-section basis to determine which data is of good quality to be included in physics analyses. The data certification is carried out taking into account both the health in operation of the sub-detectors at and the scrutiny of the reconstructed physics objects by DPG and POG experts. The outcome of the certification process as more data gets collected and for each new version of the data processing is regularly updated by the DQM-DataCertification with reports at the PPD General Meeting and by means of json files, also available in this certification repository:

ls -ltrFh /afs/

The json files from the certification are used to restrict the events to be included in analysis, typically setting the lumiMask in the crab configuration.

You can see the run and luminosity section structure by opening one of the files:

cat  /afs/

Only successfully processed luminosity sections should be used to compute the integrated luminosity of your analysis: that's typically achieved by asking for the crab report, which is also in json format, and provides a summary file of the runs and luminosity sections processed by completed jobs. Here, for semplicity, we'll use directly the certification exercise for luminosity calculation, assuming all processing jobs for run 276775 have been successful.

The luminosity information can be accessed via the BRIL Work Suite , which needs a simple installation procedure:

*bash* : export PATH=$HOME/.local/bin:/afs/$PATH
*tcsh* : setenv PATH $HOME/.local/bin:/afs/$PATH

pip install --install-option="--prefix=$HOME/.local" brilws

(Do it again after installation)
*bash* : export PATH=$HOME/.local/bin:/afs/$PATH
*tcsh* : setenv PATH $HOME/.local/bin:/afs/$PATH

The integrated luminosity as measured during the data taking ( Norm tag: onlineresult), delivered and recored, is provided for the luminosity sections specified in the json, limiting to the run 276775:

brilcalc lumi --help
brilcalc lumi -b 'STABLE BEAMS' -r 276775 -i /afs/ [--byls]

#Data tag : v1 , Norm tag: onlineresult
| run:fill    | time              | nls  | ncms | delivered(/ub) | recorded(/ub) |
| 276775:5093 | 07/12/16 21:26:20 | 1165 | 1165 | 222078295.493  | 210692911.427 |
| nfill | nrun | nls  | ncms | totdelivered(/ub) | totrecorded(/ub) |
| 1     | 1    | 1165 | 1165 | 222078295.493     | 210692911.427    |

You can verify that you get the same output of you constrict yourself a json file limited to run 256843 and process it without run restrictions:

cd CMSSW_8_0_21/src 
cmsenv --min=276775 --max=276775 /afs/  | tee 276775.txt
cat 276775.txt
brilcalc lumi -b 'STABLE BEAMS'    -i    276775.txt

Exercise 3: for a given Monte Carlo AODSIM sample, find: the global tag, the digi-reco configuration, the production history/advancement and all the pile up scenarios available for it

The sample we start from is the neutrino gun overlaid with the pile up which matches the profile of instantaneous luminosity of the 2015 data taking:

dataset dataset=/SingleNeutrino/RunIIFall15DR76-PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/AODSIM 

  • The global tag is fully specified in the ProcessingVersion, following the campaign name (RunIIFall15DR76) and preceding the processing string (here absent) and the dataset version (v1)
  • 76X_mcRun2_asymptotic_v12

  • ex1-Sites-in-das.png
  • You find multiple files output-config-* , one for each step: digitization, reconstruction, miniaod/PAT

Any Monte Carlo sample is associated to a prepID, a unique identifier of the production request which has produced it. prepID 's are strings like HCA-RunIIFall15DR76-00002, formed by the physics group which has placed the production request, the production campaign and an integer number.

prepID 's are used by the Monte Carlo Management Meeting, where production requests are notified and prioritized, and by the computing operation teams, and are the identifier used in its two key web based platforms: Monte Carlo Management (McM) and production Monitoring platform (pMp).

  • Click on the Search tab, enter HCA-RunIIFall15DR76-00002 in the prepId field and click Search ex2-mcm-search-by-prepid.png

  • Each column shows different elements of the request. You can view more using "Select View" ex2-mcm-onerequest-user.png

  • Click the tick icon in the column Actions
  • ex2-mcm-onerequest-user-test.png
  • The sequence of commands can be executed to run over a few events all the steps of the digi-reco processing

The production Monitoring platform (pMp) is a service available to CMS members to monitor the status of progress of single Monte Carlo production requests, full campaigns, and group of requests (defined by physics working group, processing configuration, priority etc). It can be accessed directly or linked from Monte Carlo Management (McM).

  • Click the movie-film icon in the column Actions (see screen-shot of the previous bullet) to get the status of the request: events for a given production request are split across the different status they traverse in production, new, approved, submitted, done
  • ex2-pmp-present-growing.png
  • Click the vide-camera icon in the column Actions to get the historical development of the events produced in this request
  • ex2-pmp-history.png

Most datasets in any campaign is produced with a single pile up scenario: the pile up scenario typical of the campaign. In 2015, because of the transition from 50 to 25 ns data taking, the main production campaign had 2 pile up scenarios for a large set of requests. Some datasets, however, get processed with multiple pile up configuration to support specific studies. Different version of the pile up in the DIGI-RECO step all process the same parent GEN-SIM: this is the simplest way of finding out if there's more than one pile up version for a physics dataset.

  • ex2-found-parent.png
  • Click on the Children link in the das presentation of the dataset
  • ex2-found-children.png
  • Multiple children are found, spanning over 6 pile up scenario
  • Note that:
    • the ProcessingVersion contains also the optional strings indicating special processing configuration (related to ECAL zero-suppression settings)
    • the datasets with special processing configuration also have a dedicated data tier

Exercise 4: Simulate your private sample

-- PhatSrimanobhas - 2016-11-11

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2016-11-12 - PhatSrimanobhas
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback