CRAB Logo

Data handling in CRAB

Complete: 4 Go to SWGuideCrab

THIS TWIKI IS BEING REVISITED. ALTHOUGH MOST OF THE INFORMATION IS OK, SOME DETAILS HAVE TO STILL BE POLISHED.

Introduction

All successful jobs in a task produce output files which are eventually copied (staged-out) into the CMS namespace in a sites permanent storage element specified in the Site.storageSite parameter of the CRAB configuration file. The jobs also produce out/err log files from the job running code, which are by default not copied to the permanent storage. The user can force the stage-out of the log files to the permanent storage by setting General.transferLogs = True in the CRAB configuration file. The user can also disable the stage-out of the output files by means of the General.transferOutputs parameter, but in that case publication will not be possible. If the output files are successfully transferred to the permanent storage, CRAB will automatically (and by default) publish the output dataset in DBS. This is done by the same service (ASO) which does the stage-out. If the user wants to disable the publication, he/she can set Data.publication = False in the CRAB configuration file. The only output files that CRAB can publish are those of EDM format; other output files can be transferred, but will not be published even if Data.publication was enabled.

Logical File Names of output (and log) files

All successful (and sometimes also not-successful) jobs produce output (and log) files which eventually are copied into the CMS namespace in a user-specified final location. The LFN of these files are of the following form:

/store/[user|group|local]/<dir>[/<subdirs>]/<primary-dataset>/<publication-name>/<time-stamp>/<counter>[/log]/<file-name>

For example: /store/user/atanasi/GenericTTbar/CRAB3_tutorial_MC_analysis_1/140503_173849/0000/output_1.root.

The log subdirectory is used for the jobs stdout and stderr log files. Each jobs stdout and stderr log files are zipped into an archive file together with a framework job report file.

Before the files are transferred to the permanent storage, a "local" stage-out into a temporary storage in the running site (or another configured site) is performed. The naming of the temporary files is the same, except one has to change /store/[group|local|user/<username>] by /store/temp/user/<username>.<DN-hash> (here, <DN-hash> is the SHA-1 hash of the user's DN used to ease the transition for users to new DNs). If the files don't have to be transferred, they are discarded and not even copied into the temporary storage.

The three LFNs detailed above can be summarized in the following general form:

<lfn-prefix>/<primary-dataset>/<publication-name>/<time-stamp>/<counter>[/log]/<file-name>

The table below provides an explanation of each of the fields appearing in this generic LFN form.

Field Description
dir A directory. For /store/user/, by default CRAB will use the CERN primary account username of the user who submitted the task.
subdirs A series of (optional) subdirectories.
lfn-prefix The first part of the LFN as specified by the user in the Data.outLFN parameter in the CRAB configuration file.
primary-dataset The primary dataset name as specified in the Data.outputPrimaryDataset parameter in the CRAB configuration file (when running over private users input files or running MC generation) or as extracted from the Data.inputDataset parameter (when running an analysis over an input dataset). For example, the primary dataset name of /GenericTTbar/HC-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO is GenericTTbar.
publication-name The user-specified portion of the publication name of the dataset as specified in the Data.outputDatasetTag parameter in the CRAB configuration file. Defaults to crab_<General.requestName>. This name is used even if Data.publication = False.
time-stamp A timestamp, based on when the task was submitted. A task submitted at 17:38:49 on 27 April 2014 would result in a timestamp of 140427_173849. The timestamp is used to prevent multiple otherwise-identical user tasks from overwriting each others' files.
counter A four-digit counter, used to prevent more than 1000 files residing in the same directory. Files 1-999 are kept in directory 0000; files 1000-1999 are kept in directory 0001; files 2000-2999 in directory 0002; etc.
file-name For output files, this is the file name specified in the user's CMSSW parameter-set configuration file (or the name of an additional output file specified in the JobType.outputFiles CRAB configuration parameter), with the job counter added in. If the parameter-set specifies an output file named output.root, the output file name from job N will be output_N.root. For log files, this is cmsRun_<ID>.log.tar.gz, where <ID> is the ID of the job within the task.

Input Data

The following CRAB3 configuration parameters control the input data behavior:

Parameter Name Default Explanation
Data.inputDataset None The (primary) input dataset for CRAB3 to analyze.
Data.lumiMask None The lumi-mask JSON file to apply to the input dataset before analysis.
Data.ignoreLocality False Set to True to allow jobs to run at any site, regardless of whether the dataset is located at that site. The user-specified whitelist and blacklist are still respected.
Data.inputDBS https://cmsweb.cern.ch/dbs/prod/global/DBSReader The URL of the DBS instance for the input dataset. Defaults to the global DBS3 instance.
Data.allowNonValidInputDataset False Set to True to allow CRAB to run over (the valid files of) the input dataset given in Data.inputDataset even if its status in DBS is not VALID.

If the job requires multiple simultaneous input files (a RAW-RECO analysis), set Data.inputDataset to the primary input. The user is responsible to make sure that the secondary input dataset is available to the job via whitelisting or reliance on AAA.

Output Data

The following CRAB3 configuration parameters control the output data behavior.

Parameter Name Default Explanation
Data.publication True Whether to publish the EDM output files in DBS or not. Notice that for publication to be possible, the corresponding output files have to be transferred to the permanent storage element.
Data.publishDBS https://cmsweb.cern.ch/dbs/prod/phys03/DBSWriter/ The URLof the DBS instance where output files will be published.
Data.outputDatasetTag crab_ + General.requestName The user-specified portion of the output dataset name; publication-name above.
Data.publishWithGroupName False If Data.outLFNDirBase starts with /store/group/<groupname>, use the groupname instead of the username in the publication dataset name.
Data.outputPrimaryDataset CRAB_UserFiles or CRAB_PrivateMC When running an analysis over private input files or running MC generation, this parameter specifies the primary dataset name that should be used in the LFN of the output/log files and in the publication dataset name.
Site.storageSite None The final storage site where CRAB3 should copy outputs. Job outputs will initially go to the runtime storage, the site storage where the job was run.
General.transferOutputs True Whether or not to transfer the output files to the storage site. If set to False, the output files are discarded and the user can not recover them.
General.transferLogs False Whether or not to copy the jobs log files to the storage site. If set to False, the log files are discarded and the user can not recover them. Notice however that a short version of the log files containing the first 1000 lines and the last 3000 lines are still available through the monitoring web pages.

Note that only output files produced with the cmsRun PoolOutputModule can be published into DBS.

After the job finishes, the AsynchronousStageOut component of CRAB3 will copy the files from the runtime site to the user-specified storage site. This will use an external FTS3 server and, for most sites, do the transfer using the SRMv2 protocol. After the stageout is complete, outputs can be retrieved with crab getoutput. After the stageout is complete, files are published into DBS. Partial publication is done; by default, every 100 completed output files are published as a new block.

Publication in DBS

Output dataset names in DBS

Only EDM output files (i.e. output files produced by PoolOutputModule) can be published in DBS. The output files from each PoolOutputModule instance will be published in a different dataset.

For jobs with a single PoolOutputModule instance, the output dataset name has the following format for DBS3:

/<primary-dataset>/<CERN-username_or_groupname>-<publication-name>-<pset-hash>/USER

Note that this is different from CRAB2; this is done to better align dataset names with production. The variables are documented in the table below.

For jobs with multiple PoolOutputModule instances, the output dataset names have the following format:

/<primary-dataset>/<CERN-username_or_groupname>-<publication-name>-<module-label>-<pset-hash>/USER

For example, if user bbockelm has a job with pset-hash = 12345 analyzing /GenericTTbar/HC-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO, with publication name set to test1, and with the following output modules defined in the CMSSW parameter-set configuration:

process.bar = cms.OutputModule("PoolOutputModule", 
 fileName = cms.untracked.string("nothing.root"),
 outputCommands=cms.untracked.vstring("drop *"),
)
process.baz = cms.OutputModule("PoolOutputModule",
 fileName = cms.untracked.string("everything.root"),
 outputCommands=cms.untracked.vstring("keep *"),
)

then the resulting output datasets would be:

/GenericTTbar/bbockelm-test1-bar-12345/USER

/GenericTTbar/bbockelm-test1-baz-12345/USER

The following table explains the variables in the above templates.

Variable Explanation
primary-dataset The value of the CRAB3 configuration parameter Data.outputPrimaryDataset.
CERN-username_or_groupname The task owner's CERN username as registered in SiteDB, or if Data.publishWithGroupName = True and Data.outLFNDirBase starts with /store/group/<groupname> the groupname extracted from the third field of Data.outLFNDirBase.
publication-name The value of the CRAB3 configuration parameter Data.outputDatasetTag.
pset-hash The PSet hash for cmsRun job. This is not the output of edmConfigHash but taken from edmProvDump.
module-label The module label of the PoolOutputModule.

The fields primary-dataset, CERN-username and publication-name are the same as used for the naming of the output files. The remaining parameter, pset-hash, is a hash produced from the CMSSW code used by the cmsRun job. The hash guarantees that every different CMSSW code has a distinct output dataset name in DBS, even if publication-name is not changed. It also allows to keep the same publication-name when re-doing a dataset after a modification in the CMSSW code (e.g. a modification to fix a bug in the previously produced dataset). Notice also that, by keeping the same publication-name, a user can add new files to an already published dataset by running a new task, as long as the CMSSW code to produce the new files is the same as the one used to produce the original files. This is the case in which a user wants to extend a dataset as data become available with time over the course of an LHC run, or when the user wants to run a second task for only the failed jobs of the first task (as opposed to doing a resubmission of the failed jobs).

Changing a dataset or file status in DBS

note.gif Note: The instructions in this section are independent on whether the publication was done with CRAB3, with CRAB2, or any other service.

It is common to have users publishing in DBS datasets that are, sooner or later, not going to be used anymore, or that are buggy to begin with. A frequent question is then: can a user delete a dataset from DBS? The answer is no. Users do not have permissions to delete a dataset or a file (i.e. completely remove an entry) from DBS. Instead, what users can do is to invalidate a dataset, or individual files in a dataset, they have published. This corresponds to changing the status in DBS of the dataset or file to invalid. It is important to know that:

  • Invalidating a dataset will automatically invalidate all the files in the dataset.
  • The invalidation of a dataset or file is a reversible procedure.
  • By default invalidated datasets will not be listed by Data Discovery.
  • Invalid datasets can still be used in CRAB (although it requires to set Data.allowNonValidInputDataset = True in the CRAB configuration), but any invalid file will be skipped. The rational is that a file that is marked as invalid means it is corrupted, deleted, lost or anyhow unreadable. On the other hand, a dataset that is marked as invalid simply means it is not good for physics, but there could be reasons to look at it anyway (e.g. to figure out why it is marked as invalid and if it can be turned into valid by removing some files or similar).

The DBS client provides corresponding scripts DBS3SetDatasetStatus.py and DBS3SetFileStatus.py for changing the status of a dataset or file in DBS. The DBS client is shipped together with the CRAB client. The above scripts are available in $DBS3_CLIENT_ROOT/examples/. After sourcing the CRAB environment, the environment variable $DBS3_CLIENT_ROOT will point to the DBS client location.

Next there are examples of how to use the scripts to invalidate a dataset or a file published in the phys03 local DBS instance:

python $DBS3_CLIENT_ROOT/examples/DBS3SetDatasetStatus.py --dataset=<datasetname> --url=https://cmsweb.cern.ch/dbs/prod/phys03/DBSWriter --status=INVALID --recursive=False

python $DBS3_CLIENT_ROOT/examples/DBS3SetFileStatus.py --url=https://cmsweb.cern.ch/dbs/prod/phys03/DBSWriter --status=invalid --recursive=False  --files=<LFN>

The scripts have also a help option:

python $DBS3_CLIENT_ROOT/examples/DBS3SetDatasetStatus.py --help

python $DBS3_CLIENT_ROOT/examples/DBS3SetFileStatus.py --help

The same scripts can be used to turn the status of a dataset or file back to valid by specifying --status=VALID (for datasets) and --status=valid (for files) respectively. Changing the status of a dataset to valid will automatically change the status of all the files in the dataset also to valid.

General remarks about publication in DBS

Elevating a dataset from local DBS to global DBS

Users are not allowed to publish data in the global DBS; only dataOperation people can do it. So if a user needs to elevate a dataset, he/she has to first of all publish the complete dataset in a local DBS instance and then follow https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookGroupActivities to request the transfer of the data to the "right" directory and the publication in the global DBS.

  • Warning: The Store Results service used for dataset migration and elevation does not support growing datasets. That means that the dataset has to be completed before the elevation request. One can not make another request for elevation with the same dataset name. If one produces more similar data, one needs to do something like adding _v1, _v2, _v3 etc. to the dataset name.

The import of parent dataset is not allowed from a local DBS to another local DBS

One can not publish a user dataset in a local DBS instance, analize it to produce a new user dataset and publish the new dataset in a different local DBS. This is because the import of parents (done by default during publication) can not transfer data from one local DBS to another local DBS; publication will simply fail. The import is allowed only from the global DBS instance to a local DBS instance. One can publish the new dataset only in the same local DBS instance where the "user" parent dataset is published.

-- AndresTanasijczuk - 07 Oct 2014

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2017-10-18 - StefanoBelforte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback