5.4 Locating Data Samples
Complete:
Detailed Review status
Goals of this page:
This page describes how to find collision data and Monte Carlo (MC) samples. In particular, you will learn:
- How to use the Data Aggregation System (DAS) to locate your samples.
- How to transfer a few files to your desktop/laptop so you can test your executable interactively.
- Where to find information on what collision data and MC samples exist, and are most recent.
Contents
How to find samples with DAS Interface
All published samples whether official or unofficial are searchable through DAS web interface. In order to open it, one needs to have a valid grid certificate installed in the browser.
Open the
DAS web page
.
As you will see, the DAS interface is quite simple:
To perform your search, you need to know the DAS query language. For an explanation and examples you can either refer to the
DAS FAQ page
or to the
DAS documentation guide
.
Essentially, the query for a specific data set should be of the form:
dataset=/PrimaryDataset/ProcessedDataset/DataTier/
For example, to find Z->ee samples your query will look like:
dataset=/*Zee*/*/*
To find a specific file for a known dataset you will use:
file dataset=/PrimaryDataset/ProcessedDataset/DataTier/
Command line interface for DAS (das client)
You can copy/paste or download the DAS client script from the
DAS web page
using the "CLI" link in the upper menu.
The DAS CLI is a simple python script (you need to use python 2.7 and above) and its usage is quite trivial. One needs to create a proxy for using it by the command
voms-proxy-init -voms cms -rfc
. Assuming you saved it as
das_client.py
:
python das_client.py --help
Usage: das_client.py [options]
For more help please visit https://cmsweb.cern.ch/das/faq
Options:
-h, --help show this help message and exit
-v VERBOSE, --verbose=VERBOSE
verbose output
--query=QUERY specify query for your request
--host=HOST host name of DAS cache server, default is
https://cmsweb.cern.ch
--idx=IDX start index for returned result set, aka pagination,
use w/ limit (default is 0)
--limit=LIMIT number of returned results (default is 10), use
--limit=0 to show all results
--format=FORMAT specify return data format (json or plain), default
plain.
--threshold=THRESHOLD
query waiting threshold in sec, default is 5 minutes
--key=CKEY specify private key file name, default
$X509_USER_PROXY
--cert=CERT specify private certificate file name, default
$X509_USER_PROXY
--capath=CAPATH specify CA path, default currently is /etc/grid-
security/certificates
--retry=RETRY specify number of retries upon busy DAS server message
--das-headers show DAS headers in JSON format (obsolete, keep for
backward compatibility)
--base=BASE specify power base for size_format, default is 10 (can
be 2)
--cache=CACHE a file which contains a cached json dictionary for
query -> files mapping
--query-cache=QCACHE a query cache value
--list-attributes=KEYS_ATTRS
List DAS key/attributes, use "all" or specific DAS key
value, e.g. site
Thus, if you want to run your query you'll type (we'll use the same query example as shown above):
python das_client.py --query="dataset=/*Zee*/*/*"
To specify the dbs instance (that in the web interface would be selected with the drop-down menu), include it in the
query
command, for example:
python das_client.py --query="dataset=/*Zee*/*/* instance=prod/phys03"
Suggestion: One can store the
das_client.py
python script somewhere in the home dir and make an alias to it. For example, at lxplus.:
alias dasCLI 'python /afs/cern.ch/user/x/xyz/tools/das_client.py’
and then run from anywhere as:
dasCLI --query="dataset=/*Zee*/*/* instance=prod/phys03"
Using DBS python client
DAS is a tool which aggregates data from several sources: DBS, PhEDEx, ReqMgr, SiteDB etc. But not all details of the information stored in those DB's is available, nor is the query as efficient as asking directly one of those services.
Therefore DAS should be the first choice when looking for dataset informations, but sophisticated users that find the details or the performance inadequate to their needs can query
DBS directly via its python client API.
Instructions, examples and guidelines are in this
twiki.
Accessing Remote Samples For interactive testing
The ability to access remote files (i.e. located at some Tier2) of various samples is essential to users for interactive testing and debugging. A remote file can be either copied to a local space (e.g. desktop/laptop) or directly opened inside cmsRun, using the Xrootd Service. Please refer to the dedicated chapter in this workbook:
Using Xrootd Service (AAA) for Remote Data Access.
Finding existing MC samples for various physics processes
A list of MC samples requested in the latest production campaign can be found at
MC co-ordination twiki. The collision data (MINI)AOD from the last year and all (MINI)AODSIM samples from the past couple of production campaigns will always be available at some disk site. If a sample is popular in CRAB, Dynamic Data Management (DDM) team at CMS distributes replicas of such hot samples via an automatic procedure which removes extra-copies of unused datasets.
As an example, to search for samples corresponding to RunII Summer16 MINIAODSIM production campaign one has to do a dump query like:
dataset=/*/RunIISummer16DR80X*/MINIAODSIM
Availability of Samples
The recent collision data and MC samples are always on the disk at some site. If something is not found on disk or is already archived to tape, one can file a ticket at the
JIRA link
.
Release Validation (CMS.RelVal) samples
As new releases are integrated, and readied for large scale MC production, or data reprocessing, CMS goes through a process referred to as "Release Validation" (CMS.RelVal). As part of that, the Data Operations team makes a variety of samples with that release at small scale. These CMS.RelVal samples are often your best opportunity to develop analysis code for a new release, as they are the first to appear.
- There are CMS.RelVal samples for all major releases.
- These samples have been produced to validate CMSSW pre-releases and releases and the production workflow.
- In general you should run on these with the release with which they were produced (in particular for K_L_M_preX releases)
- One can easily find CMS.RelVal samples using DAS interface. For example, to find ttbar samples your query will look like:
dataset=/*RelValTTbar*/*/*
Review status
Complete review, Some broken links have been fixed. The page provides complete information regarding data samples finding using DAS for physics analysis.
Responsible:
StefanoBelforte
Last reviewed by:
StefanoBelforte -
--
FrankWuerthwein - 04-Dec-2009