5.4 Locating Data Samples

Complete: 5
Detailed Review status

Goals of this page:

This page describes how to find collision data and Monte Carlo (MC) samples. In particular, you will learn:

  • How to use the Data Aggregation System (DAS) to locate your samples.
  • How to transfer a few files to your desktop/laptop so you can test your executable interactively.
  • Where to find information on what collision data and MC samples exist, and are most recent.

Contents

How to find samples with DAS Interface

All published samples whether official or unofficial are searchable through DAS web interface. In order to open it, one needs to have a valid grid certificate installed in the browser.

Open the DAS web page.

As you will see, the DAS interface is quite simple:

DAS_Interface_v2.png

To perform your search, you need to know the DAS query language. For an explanation and examples you can either refer to the DAS FAQ page or to the DAS documentation guide. Essentially, the query for a specific data set should be of the form:

dataset=/PrimaryDataset/ProcessedDataset/DataTier/
For example, to find Z->ee samples your query will look like:
dataset=/*Zee*/*/*
To find a specific file for a known dataset you will use:
file dataset=/PrimaryDataset/ProcessedDataset/DataTier/

Command line interface for DAS (das client)

You can copy/paste or download the DAS client script from the DAS web page using the "CLI" link in the upper menu. The DAS CLI is a simple python script (you need to use python 2.7 and above) and its usage is quite trivial. One needs to create a proxy for using it by the command voms-proxy-init -voms cms -rfc. Assuming you saved it as das_client.py:

python das_client.py --help

Usage: das_client.py [options]
For more help please visit https://cmsweb.cern.ch/das/faq

Options:
  -h, --help            show this help message and exit
  -v VERBOSE, --verbose=VERBOSE
                        verbose output
  --query=QUERY         specify query for your request
  --host=HOST           host name of DAS cache server, default is
                        https://cmsweb.cern.ch
  --idx=IDX             start index for returned result set, aka pagination,
                        use w/ limit (default is 0)
  --limit=LIMIT         number of returned results (default is 10), use
                        --limit=0 to show all results
  --format=FORMAT       specify return data format (json or plain), default
                        plain.
  --threshold=THRESHOLD
                        query waiting threshold in sec, default is 5 minutes
  --key=CKEY            specify private key file name, default
                        $X509_USER_PROXY
  --cert=CERT           specify private certificate file name, default
                        $X509_USER_PROXY
  --capath=CAPATH       specify CA path, default currently is /etc/grid-
                        security/certificates
  --retry=RETRY         specify number of retries upon busy DAS server message
  --das-headers         show DAS headers in JSON format (obsolete, keep for
                        backward compatibility)
  --base=BASE           specify power base for size_format, default is 10 (can
                        be 2)
  --cache=CACHE         a file which contains a cached json dictionary for
                        query -> files mapping
  --query-cache=QCACHE  a query cache value
  --list-attributes=KEYS_ATTRS
                        List DAS key/attributes, use "all" or specific DAS key
                        value, e.g. site
Thus, if you want to run your query you'll type (we'll use the same query example as shown above):
python das_client.py --query="dataset=/*Zee*/*/*"

To specify the dbs instance (that in the web interface would be selected with the drop-down menu), include it in the query command, for example:

python das_client.py --query="dataset=/*Zee*/*/* instance=prod/phys03"

Suggestion: One can store the das_client.py python script somewhere in the home dir and make an alias to it. For example, at lxplus.:

alias dasCLI 'python /afs/cern.ch/user/x/xyz/tools/das_client.py’

and then run from anywhere as:

dasCLI --query="dataset=/*Zee*/*/* instance=prod/phys03" 

Using DBS python client

DAS is a tool which aggregates data from several sources: DBS, PhEDEx, ReqMgr, SiteDB etc. But not all details of the information stored in those DB's is available, nor is the query as efficient as asking directly one of those services. Therefore DAS should be the first choice when looking for dataset informations, but sophisticated users that find the details or the performance inadequate to their needs can query DBS directly via its python client API. Instructions, examples and guidelines are in this twiki.

Accessing Remote Samples For interactive testing

The ability to access remote files (i.e. located at some Tier2) of various samples is essential to users for interactive testing and debugging. A remote file can be either copied to a local space (e.g. desktop/laptop) or directly opened inside cmsRun, using the Xrootd Service. Please refer to the dedicated chapter in this workbook: Using Xrootd Service (AAA) for Remote Data Access.

Finding existing MC samples for various physics processes

A list of MC samples requested in the latest production campaign can be found at MC co-ordination twiki. The collision data (MINI)AOD from the last year and all (MINI)AODSIM samples from the past couple of production campaigns will always be available at some disk site. If a sample is popular in CRAB, Dynamic Data Management (DDM) team at CMS distributes replicas of such hot samples via an automatic procedure which removes extra-copies of unused datasets.

As an example, to search for samples corresponding to RunII Summer16 MINIAODSIM production campaign one has to do a dump query like:

dataset=/*/RunIISummer16DR80X*/MINIAODSIM

Availability of Samples

The recent collision data and MC samples are always on the disk at some site. If something is not found on disk or is already archived to tape, one can file a ticket at the JIRA link.

Release Validation (CMS.RelVal) samples

As new releases are integrated, and readied for large scale MC production, or data reprocessing, CMS goes through a process referred to as "Release Validation" (CMS.RelVal). As part of that, the Data Operations team makes a variety of samples with that release at small scale. These CMS.RelVal samples are often your best opportunity to develop analysis code for a new release, as they are the first to appear.

  • There are CMS.RelVal samples for all major releases.
  • These samples have been produced to validate CMSSW pre-releases and releases and the production workflow.
  • In general you should run on these with the release with which they were produced (in particular for K_L_M_preX releases)
  • One can easily find CMS.RelVal samples using DAS interface. For example, to find ttbar samples your query will look like:
dataset=/*RelValTTbar*/*/*


Review status

Reviewer/Editor and Date (copy from screen) Comments
NitishDhingra - 2017-08-30 Revision with updated information on DAS interface, CLI, RelVal samples. Some modifications in subsection structure.
StefanoBelforte - 2015-08-19 point to DDM for data distribution
JohnStupak - 4-June-2013 Minor revisions
NitishDhingra - 01-Apr-2012 See detailed comments below
StefanoBelforte - 29-Jan-2010 Complete Expert Review, minor changes
FrankWuerthwein - 06-Dec-2009 complete reorganization
SudhirMalik- 4 Nov 2009 updated examples to CMSSW_3_3_1, updated DBS snapshots
KatiLassilaPerini - 28 Feb 2008 removed the LPC samples
CMSUserSupport - 05 Sep 2007 added CSA07 samples from Filip Moortgat's presentation August07 Physics Days
AlessandraFanfani - 21 Jun 2007 updated DM concepts and Data discovery examples
KatiLassilaPerini - 17 Apr 2007 updated DBS description, added a simple example search
JennyWilliams - 15 Sep 2006 Slight editing, added some comments
AnneHeavey - 30 Aug 2006 Created new page; info from Peter Elmer

Complete review, Some broken links have been fixed. The page provides complete information regarding data samples finding using DAS for physics analysis.

Responsible: StefanoBelforte
Last reviewed by: StefanoBelforte -

-- FrankWuerthwein - 04-Dec-2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng DAS_Interface_v2.png r1 manage 62.2 K 2017-08-22 - 08:43 NitishDhingra  
Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2017-08-30 - NitishDhingra


ESSENTIALS

ADVANCED TOPICS


 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback