Process CMS OpenData with CRAB

Introduction

This is a TWiki page purposed for helping users to get and handle with files, which are stored at CERN Open Data, and how to access datasets at T3_CH_CERN_OpenData site with CRAB3. You may face such a problem, when you working with a dataset, which has DISK status at T3_CH_CERN_OpenData site and only TAPE status at other sites. You can check it through the DAS (see locating data samples).

Although data at T3_CH_CERN_OpenData are not accessible as for a standard site, you can though get access to files from datasets located at CERN Open Data, for both specified files directly and via CRAB requests, and it is easier and more quickly than requesting a tape recall of the dataset.

Accessing to a file directly

You want to get a file from (for example) /MuOniaParked/Run2012B-22Jan2013-v1/AOD dataset. If you see through the DAS that it is at T3_CH_CERN_OpenData site, you can find it at CMS Open Data (this link). Just type the dataset name to a search bar at the top of CERN Open Data website.

Now you can get Logical File Name (LFN) of dataset files through the DAS, for example:

/store/data/Run2012B/MuOniaParked/AOD/22Jan2013-v1/30000/F65C0D24-DC6F-E211-AFA6-E41F131815B8.root 

To access it, you need special CMS Open Data redirector: root://eospublic.cern.ch//eos/opendata/cms/. You need to replace /store/data/ part of LFN with this redirector. Thus now you can access to this file with ROOT or CMSSW (with cmsRun etc) through this path:

 root://eospublic.cern.ch//eos/opendata/cms/Run2012B/MuOniaParked/AOD/22Jan2013-v1/30000/F65C0D24-DC6F-E211-AFA6-E41F131815B8.root 

You can get file paths with an already set redirector on a page of a dataset on the CERN Open Data website from .txt File Indexes (like this for example).

Accessing to a dataset via CRAB

You can't just send a CRAB requset for a dataset at T3_CH_CERN_OpenData with such a CRAB config:

config.Data.inputDataset = '/MuOniaParked/Run2012B-22Jan2013-v1/AOD' 

However, you can use config.Data.userInputFiles and put to it a list of all dataset files with CMS Open Data redirector. Below is an example of how to send a CRAB request with this tip for a /MuOniaParked/Run2012B-22Jan2013-v1/AOD dataset. You can use it for any dataset which is at CERN Open Data. Of course, you can write your own custom method that will set list with few thousands strings of filenames as a value of config.Data.userInputFiles variable.

Getting all filenames of a dataset

At first, you need to create a .txt file with all filenames of a dataset. Go to a page of a dataset and download all .txt File Indexes. For /MuOniaParked/Run2012B-22Jan2013-v1/AOD we have 15 .txt files with 4335 filenames within these 15 .txt files. You can also note that the DAS also shows 4335 files.

Now in a directory with downloaded files:

 ls 
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20000_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20001_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20002_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20003_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20013_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20014_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20020_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20025_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20030_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20034_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20035_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20040_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_210000_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_30000_file_index.txt
CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_30001_file_index.txt

You can run such a simple Python script to merge all of these files:

file_names = [
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20000_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20001_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20002_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20003_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20013_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20014_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20020_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20025_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20030_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20034_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20035_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_20040_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_210000_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_30000_file_index.txt',
    'CMS_Run2012B_MuOniaParked_AOD_22Jan2013-v1_30001_file_index.txt',
]

output_file = open('MuOniaParkedRun2012B.txt', 'w')

for txtfile in file_names:
    file = open(txtfile)
    strings = file.readlines()
    for s in strings:
        output_file.write(s)

output_file.close() 

Now you have MuOniaParkedRun2012B.txt file with 4335 filenames of an entire MuOniaParked/Run2012B-22Jan2013-v1/AOD dataset.

CRAB config

Now you need to set these filenames as the value of config.Data.userInputFiles. You can do it with such a way. In CRAB config:

file_in = open('/path/to/file/MuOniaParkedRun2012B.txt') # replace 'path/to/file/' with the actual path to a merged file you want to use

files_for_run = file_in.readlines()
for f in files_for_run:
    for_crab_files.append(f[:-1])

config.Data.userInputFiles = for_crab_files

That's all. It is quite easy. Thus you send one CRAB request for the entire MuOniaParked/Run2012B-22Jan2013-v1/AOD dataset with specification each file of a dataset directly.

Note that you can't be used it with config.Data.inputDataset. Also you should use only 'FileBased' in config.Data.splitting:

config.Data.splitting = 'FileBased'

You may also want to set

config.Data.unitsPerJob = 1

for better splitting and more stable work. Then you will have a job for each file of a dataset.

Example sources

You can find example sourses MuOniaParkedRun2012B.txt file and full CRAB config file attached to this TWiki page.

-- KirillIvanov - 2020-01-30

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt MuOniaParkedRun2012B.txt r1 manage 550.4 K 2020-01-30 - 19:16 KirillIvanov  
Texttxt crab_B_C.py.txt r1 manage 2.8 K 2020-01-30 - 19:18 KirillIvanov  
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2020-01-30 - KirillIvanov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback