CRAB Logo

CRAB3 Frequently Asked Questions

Complete: 3 Go to SWGuideCrab

Help Notice: This is a large page, it works best if you search in it for your problem using the browser search function.
By default all answers are collapsed and search only uses the questions text. If you do not find what you need, you can use the buttons below to search inside answers as well:
 

Help if you still have problems, check the CRAB3Troubleshoot guide before asking for support

Contents:

Certificates, proxies and all that stuff

crab command fails with Impossible to retrieve proxy from myproxy.cern.ch ...

This can be due to a stale credential in myproxy. CRAB client always tries to keep a valid one there, but there are some known edge cases where this fails, e.g. https://github.com/dmwm/CRABServer/issues/5168.

Therefore you should remove credentials from myproxy and then issue the crab commad again. To remove stale credentials:

grep myproxy-info <CRAB project directory>/crab.log
# example:  grep myproxy-info crab_20160308_140433/crab.log 
you will get something like
 command: myproxy-info -l ec95456d3589ed395dc47d3ada8c94c67ee588f1 -s myproxy.cern.ch
and/or
 command : myproxy-info -l belforte_CRAB -s myproxy.cern.ch
# In this  case you will see your CERN computer username in place of "belforte", of course

Ideally the humanly-named credential is what matters and can be located easily, but for reasons you do not want to know, at times CRAB needs the horrible hex string

then simply issue a myproxy-destroy command with same arguments:

# example. In real life replace the long hex string with the one from your crab.log
myproxy-destroy -l ec95456d3589ed395dc47d3ada8c94c67ee588f1 -s myproxy.cern.ch
# example. In real life put your CERN username
myproxy-destroy -l <username>_CRAB -s myproxy.cern.ch

If things still fail after than, send the following additional info in your request for support, replacing the long hex string with the one that you found in crab.log (ec95456d3589ed395dc47d3ada8c94c67ee588f1 in the above example):

  • output of voms-proxy-info -all
  • output of myproxy-info -d -l <long-hex-string> -s myproxy.cern.ch
  • output of myproxy-info -d -l _CRAB -s myproxy.cern.ch
  • content of you crab.log as an attachment

crab commands fails with =Error UsernameException: Error contacting CRIC. =

In particular crab output contains something like this
Error UsernameException: Error contacting CRIC.
Details follow:
  Executed command: curl -sS --capath /etc/grid-security/certificates --cert /tmp/x509up_u8516 --key /tmp/x509up_u8516 'https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=whoami'
    Stdout:
      
    Stderr:
      curl: (35) error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

This is known to happen with CMSSW_7_* and very likley older releases as well both in SL6 and SL7. The problem is with the version of curl binary in CMSSW release, which is not up to date with security requirements in modern https servers.

Best is of course to use a more recent CMSSW release, but if you are really stuck in the past a workaround is to issue this command right after cmsenv

BEWARE THIS WILL MAKE CRAB WORK BUT BREAK YOUR gcc ENVIRONMENT
YOU WILL NOT BE ABLE TO COMPILE OR DO scram build UNTIL YOU OPEN A NEW SHELL

# for SL6
source /cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/curl/7.59.0/etc/profile.d/init.sh
# for SL7
source /cvmfs/cms.cern.ch/slc7_amd64_gcc630/external/curl/7.59.0/etc/profile.d/init.sh

CRAB setup

Does CRAB setup conflict with CMSSW setup

No. CRAB client runs within the CMSSW environment.
Make sure you always do cmsenv before using CRAB

I need to use a (old ?) CMSSW release where CRAB client fails, what can I do ?

CRAB is currently tested and compatible with any SL7 CMSSW release: CMSSW_8,9,10,11,12 It is possible to use CRAB for SL6 releases CMSSW_7_x using CMSSW singularity/apptainer container cmssw_cc6 There is an important caveat though:
  • the CMS SL6 container does not have the myproxy-* commands, therefore from there crab commands can not create the long lived credential in myproxy.cern.ch needed to run a CRAB tasks , nor verify it
  • therefore if you need to use CRAB on SL6, you need to:
    1. take care of myproxy in separate SL7 session by executing a crab command at least once a week (e.g. crab createmyproxy)
    2. make sure to have a valid local grid proxy to use in the SL6 container to connect to CMSWEB via voms-proxy-init -voms cms -rfc -valis 192:00
    3. tell CRAB to use this proxy and skip any attempt to check myproxy credentials
      • export X509_USER_PROXY=`voms-proxy-info -file`
      • crab submit --proxy=$X509_USER_PROXY ....
      • and similarly for any other crab command always do
      • crab   --proxy=$X509_USER_PROXY  ...

The same procedure as above can work for earlier CMSSW versions on SL6, but those are not currently validated.,

More information about CMSSW singularity containers at http://cms-sw.github.io/singularity.html.

CRAB configuration file

Documentation about the CRAB configuration file: CRAB3ConfigurationFile.

What is the maximum memory per job (maxMemoryMB) I can request?

CRAB requests by default a maximum memory of 2000 MB. This is the maximum memory per core all sites guarantee they will run. Some sites, but not many, offer a bit more (typically 2500 MB); and some sites even offer 4000 MB for special users. Memory limits offered by each site are accessible in the GlideinWMS VO Factory Monitor page, http://glidein.grid.iu.edu/factory/monitor/ (choose "Current Status of the Factory" and click on a site CE listed under "Entry Name" in the table), but this should not be considered a documentation. The best advice we can give is: stick to the default, and if you think you need more, find out first if there are sites (and which ones) which can run jobs in that case. If you need help, you can write to us.

note.gif Note: In case of a multi-threaded job (config.JobType.numCores > 1) most likely the default memory value is not enough. The user share of computing resources accounts for the requested memory per core.

What is the 'Automatic' splitting mode?

The Data.splitting parameter has now a default value: 'Automatic'.
With such a setting the task processing is split into three stages:
  1. A "probe" stage, where some probe jobs are submitted to estimate the event throughput of the CMSSW parameter-set configuration provided by the user in the JobType.psetName parameter and possible further arguments. Probe jobs have a job id of the form 0-[1,2,3,...], they can not be resubmitted and the task will fail if none of the probe jobs complete successfully. The output files transfer is disabled for probe jobs.
  2. A "main" stage, very similar to the conventional stage for other splitting modes, in which a number of main jobs (automatically determined by the probe stage) will process the dataset. These jobs can not be manually resubmitted and have a fixed maximum runtime (specified in the Data.unitsPerJob parameter), after which they gracefully stop processing input data. The remaining data will be processed in the next stage (tail) and their jobs labelled as "rescheduled" in the main stage (in the dashboard they will always appear as "failed").
  3. Some possible "tail" stages. If some main job does not finish successfully ("rescheduled" in the previous stage) or does not completely process the amount of data assigned to it due to the automatically configured maximum job run time, tail jobs are created and submitted in order to fully process the dataset. Tail jobs have a job id of the form n-[1,2,3,...], where n=1,2,... represents the tail stage number. For small tasks, less than 100 jobs, one tail stage is started when all jobs have completed (successfully or failed). For larger tasks, a first tail stage collects all remaining input data from the first 50% of completed jobs, followed by a stage that processes data when 80% of jobs have completed, and finally a stage collecting leftover input data at 100% job completion.
    Failed tail jobs can be manually resubmitted by the users.

Once the probe stage is completed, the plain crab status command shows only the main and tail jobs. For the list of all jobs add the --long option.

In the short format of crab status command the total number of jobs is computed by removing from the list of main jobs those which "failed" and are being "rescheduled" and adding the current number of "tail" jobs. Note that the total number of submitted jobs will increase at every "tail stage" and that the grafana dashboard will instead show every job : probes plus main plus tails, therefore the total number of jobs in the dashboard will be different from what printed by crab status. As usual: when in doubt, trust crab status output.

Should I use Automatic splitting or dryrun or ... ?

PRELIMINARY WRITEUP CRAB tries to support these two use case. For which the only solution was --dryrun when we initially introduced CRAB3 and so many users are stickting to it
  1. quick check that this config runs with CRAB
  2. some preliminary look at time/memory usage for tuning splitting
But other tools are now availalbe (crab preparelocal and Automatic Splitting) which are better suited for this:

  1. can be done with preparelocal, but we need to write down some details on how to do it and possibly add some options to make it easy. Work currently planned for 2022.
  2. one of the main reason for introducing automatic splitting (by the way by the very same people who developed --dryrun) was the fact that dryrun does a poor estimate of running time and memory needs, since it only samples the first events of the first file. Many datasets show large variations from one lumi to another for those, and there are also variations due to the hardware jobs are executed on. So automatic splitting runs 5 jobs which sample random files and run about 15min each, and has a refined way to recover jobs which did not fit the initial guess. This latter part is the real strong point for automatic splitting.
    Of course there are times when the automatic split machinery may be an overkill, and running on a handful of events is good enough to decide how to organize the processing. One could certainly submit a crab task which only processes one lumi. Or simply run cmsRun locally. Of course performance on lxplus can randomly be much better of much worse than on the grid, and so any estimate obtained there or on a local desktop should be taken with care before extrapolating to a possibly large task.
    A "probe" task on a small fraction of a large dataset is always a very good thing.

Can I use CRAB to process data described only as a Rucio container ?

Yes. You can indicate a Rucio DID (scope:containerName) as input to a CRAB job in the Data.inputDataset configuration parameter:
config.JobType.inputDataset = 'user.belforte:/GenericTTbar/belforte-crab_20230306_162534-94ba0e06145abd65ccb1d21786dc7e1d/USER'
Important Notes:
  • This feature is meant to address the use case of dataset produced by previous CRAB tasks, but not saved in DBS or files simply placed in /store/user/username/... at some site leaving to the user the creation and fillings of the Rucio container. A properly formed Rucio container is also a byproduct of running standard CRAB tasks using Rucio for Asynchronous Stage Out (i.e. sending outpu to /store/user/rucio/username/...
  • This is not meant to be an alternative way to access official CMS data described in DBS.
  • Proper creation of a Rucio container is not described here
  • Each file replica must be registered on a disk CMS RucioStorageElement (RSE)
  • There is no tape recall
  • The containerName needs to follow CMS DBS dataset naming syntax (like all DIDs in CMS instance of Rucio). Then CRAB will use Rucio to find blocks (in DBS term) location and file names and create jobs as usual.
  • Warning, important Since Rucio does not have run/lumi/event information, you can only use fileBased splitting, and can not ask for parents or secondary datasets which rely on lumi information to connect primary/secondary files.
  • Warning, important Since there is no tape recall, if some Rucio-datasets/DBS-blocks do not have full disk replica(s), they will be skipped and CRAB will process what's available

My dataset is (partially) on tape (and partially on disk), how can I process it with CRAB?

When the input dataset is not fully on disk, CRAB automatically creates Rucio's rule to CMSRucio to copy data from the tape site to any site with available disk space. CRAB task status will change from NEW to TAPERECALL.
Every 4 hour, CRAB pickup TAPERECALL task and check Rucio's rules. If the rule changes state from REPLICATING to OK, CRAB will change the task status back to NEW, waiting for the main thread to process it like the usual tasks.
We call this procedure "automatic tape recall".

Note that recall of the whole dataset in CRAB works only for *AOD* tiers. Becasue other data tiers are usually very large, and hardly needed in their entirety.

Still, you can request tape recall for other data tiers if you limit processing to a few blocks via Data.inputBlocks parameter. In this case blocks will be automatically recalled from disk, but only if total size is small (10TB currently)

Another option is to force CRAB to process only data currently available on disk via Data.partialDataset = True parameter to skip all tape recall procedures.

As an example of use of Data.inputBlocks parameter: let's assume we want to recall /ZeroBias/Commissioning2021-v1/RAW#214fbb11-2d5c-4b8c-8ee5-b68de1d89153 and /ZeroBias/Commissioning2021-v1/RAW#8760f6e6-6c5e-4c65-854d-02d4bc3312c5 from /ZeroBias/Commissioning2021-v1/RAW dataset.

CRAB config will be:

config.JobType.pluginName = 'Analysis'
config.Data.inputDataset = '/ZeroBias/Commissioning2021-v1/RAW'
config.Data.inputBlocks = [
    '/ZeroBias/Commissioning2021-v1/RAW#214fbb11-2d5c-4b8c-8ee5-b68de1d89153',
    '/ZeroBias/Commissioning2021-v1/RAW#8760f6e6-6c5e-4c65-854d-02d4bc3312c5',
]
# or provide only block UUID
#config.Data.inputBlocks = [
#    '214fbb11-2d5c-4b8c-8ee5-b68de1d89153',
#    '8760f6e6-6c5e-4c65-854d-02d4bc3312c5',
#]

CRAB cache

User quota in the CRAB cache

There is no quota on CRAB Cache anymore

The CRAB User File Cache, or CRAB cache for short, is the place where:

  • the CRAB client puts the user input sandboxes when submitting a task to the CRAB server;
  • the CRAB client puts the crab.log files when they are uploaded via crab uploadlog;
  • the CRAB server puts the archives produced by the dry run submissions;
  • etc.

What is the maximum allowed size of the user input sandbox?

120 MB.

What are the files CRAB adds to the user input sandbox?

CRAB adds to the user input sandbox the following directories/files, when a directory is incuded all the containted subdirectories are also recursively included:
  • The directories $CMSSW_BASE/lib, $CMSSW_BASE/biglib and $CMSSW_BASE/module. One can also tell CRAB to include the directory $CMSSW_BASE/python by setting JobType.sendPythonFolder = True in the CRAB configuration.
  • Any data, interface and python directory recursively found in $CMSSW_BASE/src.
  • All additional directories/files specified in the CRAB configuration parameter JobType.inputFiles.
  • The original CRAB configuration file (added as debug/crabConfig.py).
  • The original CMSSW parameter-set configuration file (added as debug/originalPSet.py).
  • The tweaked CMSSW parameter-set configuration file in pickle format (added as PSet.pkl) plus a simple PSet.py file to load the pickle file.

How are the inputFiles handled in the user input sandbox?

Depending on whether filenames or directories are used in the config.JobType.inputFiles parameter, the directory structure inside the sandbox may be different and affect where the files are placed in the working directory of the job.
  • Specific file names are always added to the root directory of the sandbox, whether an absolute or relative file name is used. For example, both /afs/cern.ch/user/e/erupeika/supportFiles/foo.root or myfiles/foo.root will appear as foo.root in the sandbox and will be extracted as foo.root to the job's root working directory.
  • The directory structure inside each additional input file directory is maintained in the sandbox. The additional directories themselves will be located in the root directory of the sandbox. For example, if a directory foo with files bar1 and bar2 inside it is specified in the inputFiles parameter, the sandbox will contain foo, foo/bar1 and foo/bar2 (the working directory of the job will therefore also contain a directory foo with files bar1 and bar2).
  • For example if your application expects to find mydir/file1 you should put in crab configuration config.JobType.inputFiles='mydir' and of course avoid having extra stuff in that directory. While if you put config.Data.inputFiles='mydir/file1' your application needs to open file1

Stageout and publication

Documentation about (input/output) data handling in CRAB: Crab3DataHandling.

What are the allowed stageout LFN paths with CRAB?

CRAB allows only the following LFN directory paths for stageout:

  • /store/user/[rucio/]<username>[/<subdirs>] where username is the CERN primary account username;
  • /store/group[/rucio]/<groupname>[/<subdirs>] where groupname can be any already existing directory under /store/group/.

If not publishing, /store/local/<dir>[/<subdirs>] is also allowed.

If /rucio is present, CRAB will use Rucio for stageout. User will need to have Rucio quota at the destination storage site.

These are all the allowed paths that can be set in the CRAB configuration parameter Data.outLFNDirBase. If any other path is given, the submission of the task will fail.

Can I stage out my files into a /store/user/ area that uses a different username than the one of my CERN primary account?

Short answer: No. Explanation and details: CRAB will look up for the user's username registered in CMS (which is the username of the CERN primary account) using for the query the user's DN (which in turn is extracted from the user's credentials) and will try to stage out to /store/user/<username>/ (by default). If the store user area uses a different username, it’s up to the destination site to remap that (via a symbolic link or something similar). The typical case is Fermilab; to request the mapping of the store user area, FNAL users should follow the directions on the usingEOSatLPC web page to open a ServiceNow ticket to get this fixed.

To prevent stage out failures, and in case the user has provided in the Data.outLFN parameter of the CRAB configuration file an LFN directory path of the kind /store/user/[<some-username>/<subdir>*] (i.e. a store path that starts with /store/user/), CRAB will check if some-username matches with the globally unique username extracted from the credential. If it doesn't, it will give an error message and not submit the task. The error message would be something like this:

Error contacting the server.
Server answered with: Invalid input parameter
Reason is: The parameter Data.outLFN in the CRAB configuration file must start with either '/store/user/<username>/' or '/store/group/<groupname>/' (or '/store/local/<something>/' if publication is off), wher...

Unfortunately the "Reason is:" message it cut at 200 characters. The message should read:

Reason is: The parameter Data.outLFN in the CRAB configuration file must start with either '/store/user/<username>/' or '/store/group/<groupname>/' (or '/store/local/<something>/' if publication is off), where username is your username as registered in CMS services (i.e. the username of your CERN primary account).

A similar message should be given by crab checkwrite if the user does crab checkwrite --site=<CMS-site-name> --lfn=/store/user/<some-username>.

Why is CRAB not transferring an output file?

First of all, does CRAB know at all that the job should produce the output file in question? To check that, open one of the job log files linked from the task monitoring pages. Very close to the top it is printed the list of output files that CRAB expects to see once the job finishes (shown below is the case of job number 1 in the task):

======== HTCONDOR JOB SUMMARY at ... START ========
CRAB ID: 1
Execution site: ...
Current hostname: ...
Destination site: ...
Output files: my_output_file.root=my_output_file_1.root 

If the output file in question doesn't appear in that list, then CRAB doesn't know about it, and of course it will not be transferred. This doesn't mean that the output file was not produced; it is simply that CRAB has to know beforehand what are the output files that the job produces.

If the output file is produced by either PoolOutputModule or TFileService, CRAB will automatically recognize the name of the output file when the user submits the task and it will add the output file name to the list of expected output files. On the other hand, if the output file is produced by any other module, the user has to specify the output file name in the CRAB configuration parameter JobType.outputFiles in order for CRAB to know about it. Note that this parameter takes a python list, so the right way to specify it is:

config.JobType.outputFiles = ['my_output_file.root']

Can I delete a dataset I published in DBS?

Users do not have permissions to delete a dataset or a file from DBS. Instead, what users can do is to change the status of the dataset or of individual files in the dataset. For more details see Changing a dataset or file status in DBS.

Can I send CRAB output to CERNBOX ?

Yes, by simply indicating T3_CH_CERNBOX as storage location in CRAB configuration

Be aware that CERNBOX is a service offered by CERN, not CMS. CMS simply offers a way for CRAB users to send CRAB output data to their CERNBOX disk space via the T3_CH_CERNBOX site name

Explanation:

The T3_CH_CERNBOX site is not a real CMS computing site, but a fake one created in https://cms-cric.cern.ch/cms/site/index/ so that a trivial file catalog exist for it, and it is known to Rucio, i.e. it is a known storage location for CMS which allows CRAB to use the T3_CH_CERNBOX string and the Rucio client service to map logical file names of the kind /store/user/somename to gsiftp://eosuserftp.cern.ch/eos/user/s/somename which is the proper end point for writing to CERNBOX.

NOTE: since CERNBOX is NOT a CMS storage, files in there can not be listed in DBS nor moved with Rucio, nor transparently accessed by grid jobs.

Note about certificate: since June 2021 (at least) there is no need anymore to contact CERN IT support as indicated previously. A certificate which works for CRAB submission already satisfies the requirements for working for CERNBOX as well. If you have problems you should verify in a CRAB-independent way via the following commands on lxplus (make sure to replace s/somename with <the first letter of your CERN username>/<your CERN username> e.g. b/belforte )

  • voms-proxy-init -voms cms
  • gfal-ls gsiftp://eosuserftp.cern.ch/eos/user/s/somename

Stageout with Rucio

Please see CrabASOwithRucio

Jobs/Task status

My jobs are still idle/pending/queued. How can I know why and what can I do?

If jobs are pending for more than ~12 hours, there is certainly a problem somewhere. The first thing to do is to identify to which site(s) the jobs were submitted and check the site(s) status in the Site Readiness Monitor page, http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html. For example, the "HammerCloud" row will tell whether analysis jobs are running at the site and their success rate, and the "Maintenance" row will tell whether the site had/has a downtime (clicking on the corresponding date inset in the table will open a new web page where the downtime reason is explained). If everything looks fine with the site(s) status, it may be that the user jobs are not running because they requested more resources (memory per core) than what the site(s) can offer (see What is the maximum memory per job (maxMemoryMB) I can request?).

crab status says that Task is FAILED w/o any other information

This can happen when using Automatic Splitting if all of the probe jobs failed (see CRAB3FAQ#What_is_the_Automatic_splitting for a description of probe jobs). Example:
CRAB project directory:      /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/zuolo/crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1
Task name:         190823_123943:dzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1
Grid scheduler - Task Worker:   crab3@vocms0107.cern.ch - crab-prod-tw01
Status on the CRAB server:   SUBMITTED
Task URL to use for HELP:   https://cmsweb.cern.ch/crabserver/ui/task/190823_123943%3Adzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1
Dashboard monitoring URL:   http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=dzuolo&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=190823_123943%3Adzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1
New dashboard monitoring URL:   https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=dzuolo&var-task=190823_123943%3Adzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1
In case of issues with the new dashboard, please provide feedback to hn-cms-computing-tools@cern.ch
Status on the scheduler:   FAILED

No publication information (publication has been disabled in the CRAB configuration file)
Log file is /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/zuolo/crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1/crab.log

in this case you can more information with crab status --long

When using automatic splitting all probe jobs failes with 50664 (time limit exceed)

User configuration parameter have no effect on the time limit for probe jobs. The are always configured for terminating after 15min, but cmsRun can only stop on lumi boundaries. Probe jobs are allowed to run up to 1h, but if not even that is sufficient, they will fail as in example below. In this case you should avoid using Automatic Splitting and fall back to Lumiosity Splitting with a few lumis per job.

 Job State        Most Recent Site        Runtime   Mem (MB)      CPU %    Retries   Restarts      Waste       Exit Code
 0-1 no output    T1_US_FNAL              1:00:18       1548         97          0          0    0:00:10           50664
 0-2 no output    T1_US_FNAL              1:00:15       1509         97          0          0    0:00:10           50664
 0-3 no output    T1_US_FNAL              1:00:18       1445        100          0          0    0:00:10           50664
 0-4 no output    T1_US_FNAL              1:00:17       1402         98          0          0    0:00:11           50664
 0-5 no output    T2_US_MIT               1:00:16       1656         94          0          0    0:00:10           50664

CRAB commands

crab checkusername fails with "Error: Failed to retrieve username from CRIC."

OBSOLETE. Need to be modified for the new method based on CRIC crab checkusername uses the following sequence of bash commands, which you should try to execute one by one (make sure you have a valid proxy) to check if they return what is expected.

1) It gets the path to the users proxy file with the command

which scram >/dev/null 2>&1 && eval `scram unsetenv -sh`; voms-proxy-info -path

which should return something like

/tmp/x509up_u57506

2) It defines the path to the CA certificates directory with the following python command

import os
capath = os.environ['X509_CERT_DIR'] if 'X509_CERT_DIR' in os.environ else "/etc/grid-security/certificates"
print capath

which should be equivalent to the following bash command

if [ "x$X509_CERT_DIR" != "x" ]; then capath=$X509_CERT_DIR; else capath=/etc/grid-security/certificates; fi
echo $capath

and which in lxplus should result in

/etc/grid-security/certificates

3) It uses the proxy file and the capath to query https://cmsweb.cern.ch/sitedb/data/prod/whoami

curl -s --capath <output-from-command-2-above> --cert <output-from-command-1-above> --key <output-from-command-1-above> 'https://cmsweb.cern.ch/sitedb/data/prod/whoami'

which should return something like

{"result": [
 {"dn": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atanasi/CN=710186/CN=Andres Jorge Tanasijczuk", "login": "atanasi", "method": "X509Proxy", "roles": {"operator": {"group": ["crab3"], "site": []}}, "name": "Andres Jorge Tanasijczuk"}
]}

4) Finally it parses the output from the above query to extract the username from the "login" field (in my case it is atanasi).

When reporting a problem with crab checkusername with "Failed to retrieve username from SiteDB." to the CRAB experts, it would be useful to add the output from the above commands.

note.gif Note: Even if crab checkusername gives an error retrieving the username from SiteDB, this should not stop you from trying to submit jobs with CRAB, because the error might just be a problem with crab checkusername itself and not a real problem with your registration in SiteDB (CRAB uses a different mechanism than the one described above to check the users' registration in SiteDB when attempting to submit jobs to the grid).

crab submit fails with "Task could not be submitted because there is no DISK replica"

jobs can only read data which is on DISK. Data which on Tape needs to be recalled and placed on disk first.

If your desired input is on tape you have various options depending on the situation.

First check with your Physics Group if that input is the correct one. At times a dataset has been replaced with a better version and if you point to the old one you will find that is on tape, while the new one is on disk. In CMS it is expected that all analysis can be done from MINIAOD* and/or nanoAOD*, which are mostly on disk.

Recalling a dataset from tape takes days and adds load to CMS storage systems, competing for space and tape access with activities needed for productions, make sure you really need those tape data before going further.

If you absolutely need data which is not on disk now, here's are your options:

1. ask your physics group to request that the given dataset is put on disk, they have a communication channel with CMS Data Operations to do this.

2. CRAB will automatically submit a tape recall request to CMS Dynamic Data Management system (currently based on Rucio), which will act on it. Then CRAB will place your task in a specisal hold status called TAPERECALL , and resume the task automatically once data is on disk crab status command will tell you how to monitor progress. Note that rucio command are currently not compatible with CMSSW environment and you will need a new shell window. You can find documentation for Rucio in CSM Rucio twiki page

3. If you have a special need which can't be solved by the available automatic procedure, you can contact CMS Data Transfer team via mail: cms-comp-ops-transfer-team@cernNOSPAMPLEASE.ch

crab submit fails with "User quota limit reached; cannot upload the file"

When doing crab submit the user may get this error messages:

Error contacting the server.
Server answered with: Invalid input parameter
Reason is: User quota limit reached; cannot upload the file

Error explanation: The user has reached the limit of 4.88GB in its CRAB cache area. Read more in this FAQ.

What to do: Files in the CRAB cache are automatically deleted after 5 days, but the user can clean his/her cache area at any time. See how in this FAQ.

crab (re)submit fails with "Trapped exception in Dagman.Fork"

Typical error in crab status:

Failure message: The CRAB server backend was not able to (re)submit the task, because the Grid scheduler answered with an error. This is probably a temporary glitch. Please try again later. If the error persists send an e-mail to hn-cms-computing-tools@cern.ch<mailto:hn-cms-computing-tools@cern.ch>. Error reason: Trapped exception in Dagman.Fork: <type 'exceptions.RuntimeError'> Unable to edit jobs matching constraint <traceback object at 0xa113368>
  File "/data/srv/TaskManager/3.3.1512.rc6/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1512.rc6/lib/python2.6/site-packages/TaskWorker/Actions/DagmanResubmitter.py", line 113, in executeInternal
    schedd.edit(rootConst, "HoldKillSig", 'SIGKILL')

As the error message says, this should be a temporary failure. One should just keep trying until it works. But after doing crab resubmit, give it some time to process the resubmission request; it may take a couple of minutes to see the jobs reacting to the resubmission.

crab submit fails with "Task failed to bootstrap on schedd"

After doing crab submit and crab status the user may get this error message:

Task status: UNKNOWN

Error during task injection:    Task failed to bootstrap on schedd

Error explanation: The submission of the task to the scheduler machine has failed.

What to do: Submit again.

crab submit fails with "Failed to contact Schedd"

When doing crab status the user may get one of these error messages:

Error during task injection:        <task-name>: Failed to contact Schedd: Failed to fetch ads from schedd.

Error during task information retrieval:        <task-name>: Failed to contact Schedd: .

Error explanation: This is a temporary communication error with the scheduler machine (submission node), most probably because the scheduler is overloaded.

What to do: Try again after a couple of minutes.

crab submit fails with "Splitting task ... on dataset ... with ... method does not generate any job"

This is not a CRAB error.

This usually happens when there is no lumi to process. I.e.the intersection of

  1. the input lumimask (if any)
  2. the selected run range (if any)
  3. the set of runs and lumis in the input dataset
is empty. Typical reasons are using a golden json lumimask from some data acquisition era on data from a different era or looking for a specific run in a dataset which does not include that run.

You should carefully cross check what you are trying to select, possibly use lumi arithemetic to verify, and only report this as a problem if you are sure that there is a bug. Typical error in crab status:

crab submit fails with "Block ...  contains more than 100000 lumis and cannot be processed for splitting. For memory/time contraint big blocks are not allowed. Use another dataset as input."

The message is self explaining. CRAB server will die due to lack of memory if it needs to process luminosity lists with millions of entries per block. This can only happen with MC datasets which have been created with improper use of lumisections, since the limit at 100k lumisection in one block would correspond for real data to 100 days of continuous data taking. For MC lumi sections have no relation with luminosity but are used only to allow processing less than a file in one job via split by lumi algorithm, in this case it makes no sense to have more lumis than events.

Some more discussion is in this thread: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/2928.html

There are a few datasets in DBS which do no satisfy this limit, if someone really needs to process those, the only way is to do one job per file using the userInputFiles feature of CRAB. An annotated example of how to do this in python is below, note that you have to disable DBS publication, indicate split by file and provide input file locations, other configuaration parameters can be set as usual:


# this will use CRAB client API
from CRABAPI.RawCommand import crabCommand

# talk to DBS to get list of files in this dataset
from dbs.apis.dbsClient import DbsApi
dbs = DbsApi('https://cmsweb.cern.ch/dbs/prod/global/DBSReader')

dataset = '/BsToJpsiPhiV2_BFilter_TuneZ2star_8TeV-pythia6-evtgen/Summer12_DR53X-PU_RD2_START53_V19F-v3/AODSIM'
fileDictList=dbs.listFiles(dataset=dataset)

print ("dataset %s has %d files" % (dataset, len(fileDictList)))

# DBS client returns a list of dictionaries, but we want a list of Logical File Names
lfnList = [ dic['logical_file_name'] for dic in fileDictList ]

# this is now standard CRAB configuration

from WMCore.Configuration import Configuration
config = Configuration()

config.section_("General")
config.General.transferLogs = False

config.section_("JobType")
config.JobType.pluginName = 'Analysis'

# in following line of course replace with your favorite pset
config.JobType.psetName = 'demoanalyzer.py'
config.section_("Data")

# following 3 lines are the trick to skip DBS data lookup in CRAB Server
config.Data.userInputFiles = lfnList
config.Data.splitting = 'FileBased'
config.Data.unitsPerJob = 1

# since the input will have no metadata information, output can not be put in DBS
config.Data.publication = False

config.section_("User")
# 

config.section_("Site")

# since there is no data discovery and no data location lookup in CRAB
# you have to say where the input files are
config.Site.whitelist = ['T2_CH_CERN']

config.Site.storageSite = 'T2_CH_CERN'

result = crabCommand('submit', config = config)

print (result)

crab submit fails with "Block ...  contains more than 100000 lumis."

The message is self explaining. CRAB server will die due to lack of memory if it needs to process luminosity lists with millions of entries per block. There are two known cases where this can happen:
  • MC datasets which have been created with improper use of lumisections. MC lumi sections have no relation with luminosity but are used only to allow processing less than a file in one job via split by lumi algorithm, in this case it makes no sense to have more lumis than events.
  • nanoAOD or similar super-extra-high compact event formats where one year of data fits in a few files
Those dataset can only be processed if CRAB can ignore the lumi-list information, i.e. using `config.Data.splitting = 'FileBased' and avoiding any extra request which would eventually result in the need to use lumi information. This means no run range, no lumi mask, and no secondary dataset (since CRAB will need to use lum info to match input files from the two datasets). Note that useParent is allowed since in that case CRAB uses parentage information stored in DBS to match input files.

In practice your crabConfig file must have:

config.Data.splitting = 'FileBased'
config.Data.runRange = ''
config.Data.lumiMask  = ''

( the paremeters with an assigned null value `` can be omitted, but if present must indicate the null string )

and must NOT contain the following parameter

config.Data.secondaryInputDataset

CRAB fails to resubmit some jobs

It is important that you as a user are prepared for this to happen and know how to remain productive in your physics analysis with the least effort. While there is a long tradition of "resubmit them until they work", this is hardly useful any more. And while we can't prevent users from trying we can not guarantee that it will work, nor that resubmitted jobs will succeed.

In the case where the missing data sample is important, the best recommendation we can give to users is to
USE GENERAL/GENERIC RESCUE PROCEDURES, RATHER THEN TRY AT ALL COSTS TO REVIVE A DEAD TASK.
We will always welcome problem reports and will try to improve when resubmission failures can be due to CRAB internals, but surely you do not want to hold your breath in the meanwhile.

The safest path is therefore:

  1. let running jobs die or complete and dust settle
  2. use crab klill to make sure everything stops
  3. take stock of what's published in DBS at that point and make sure that it matches what's on disk
    • if your output is not in DBS, you can use crab report, but while DBS information is available forever, crab commands on a specific task may not
  4. assess whether it is more important to get the last percentage of statistics or go on with other work. Do you really need 100% completion in this task ?
  5. if full statistics is needed, create a recovery task for the missing lumis and run it writing to the same dataset as the original task.

Recovery task is an important concept that can be useful in many circumstances. Please find instructions in this FAQ

At the other extreme there's: forget about this, and resubmit a new task with new output dataset. In between it is a murky land where many recipes may be more efficient according to details, but no general simple rule can be given and there's space for individual creativity and/or desperation.

I get a "Syntax error in CRAB configuration"

When doing crab submit the user may get one of these error messages:

Syntax error in CRAB configuration:
invalid syntax (<CRAB-configuration-file-name>.py, <line-where-error-occurred>)

Syntax error in CRAB configuration:
'Configuration' object has no attribute '<attribute-name>'

Error explanation: The CRAB configuration file could not be loaded, because there is a syntax error somewhere in it.

What to do: Check the CRAB configuration file and fix it. There could be a misspelled parameter or section name, or you could be trying to use a configuration attribute (parameter or section) that was not defined. To get more details on where the error occurred, do:

python
import <CRAB-configuration-file-name> #without the '.py'

which gives:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<CRAB-configuration-file-name>.py", <line-where-error-occurred>
    <error-python-code>
                      ^

or

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<CRAB-configuration-file-name>.py", <line-where-error-occurred>, in <module>
    <error-python-code>
AttributeError: 'Configuration' object has no attribute '<attribute-name>'

For more information about the CRAB configuration file, see CRAB3ConfigurationFile.

Problems with the = .requestcache file= and/or the CRAB project directory

If a crab command fails with messages like "Cannot find .requestcache file" or "...  is not a valid CRAB project directory" or otherwise complains that it can not find the tasks you are trying to send the command to, a problem with local directory where crab submit caches relevant information is likely (maybe disk got full, or corrupted, or you removed a file unintentionally).

Please fine more information about the CRAB project directory and possible recovery action on your site in CRAB3Commands#CRAB_project_directory

Problems with job execution

Jobs status changes back to idle (my jobs are being resubmitted)

it means that jobs failed with a possibly transient error (like error opening a file) and CRAB is automatically resubmitting (up to 3 times). If problem persists and you want to reproduce it locally you must be sure to use exact same input file(s) as the failed job(s). Also check job stdout, maybe error only happens after some number of event(s) and/or file(s) are read.

Please see: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Troubleshoot

Exit code 8001

This indicate that cmsRun enconterred a bot better specified fatal exception. Usually means a problem in user code or configuration. You should inspect the stdout of one job to find the exception message and traceback which may guide you to the solution.

A particular case is when the exception says An exception of category 'DictionaryNotFound' occurred, like in this example:

----- Begin Fatal Exception 08-Jun-2017 18:18:04 CEST-----------------------
An exception of category 'DictionaryNotFound' occurred while
   [0] Constructing the EventProcessor
Exception Message:
No Dictionary for class: 'edm::Wrapper<edm::DetSetVector<CTPPSDiamondDigi> >'
----- End Fatal Exception -------------------------------------------------

in this case, most likely the input data have been produced with a CMSSW version not compatible with the one used in CRAB job. In general it's not supported reading data with a release older than what it was produced with.

To find out which release was used to produce a given dataser of file, adapt following examples to your situation:

belforte@lxplus045/~> dasgoclient --query "release dataset=/DoubleMuon/Run2016C-18Apr2017-v1/AOD"
["CMSSW_8_0_28"]
belforte@lxplus045/~> 

belforte@lxplus045/~> dasgoclient --query "release file=/store/data/Run2016C/DoubleMuon/AOD/18Apr2017-v1/100001/56D1FA6E-D334-E711-9967-0025905A48B2.root" 
["CMSSW_8_0_28"]
belforte@lxplus045/~> 

Exit code 8021

cmsRun could opern the file, but found a problem with the content
  1. look in the job stdout for the exact error message, at times it may clearly indicate a mismatch in the objects requested by the code and present in the file, i.e same prolem as the one below
  2. if your job failed multiple times check whehter the file was always read from same site or not, this may help to tell file corruption (most likely it will affect only one replica) from software mismatch (will affect all replicas)
  3. make sure that CMSSW code version you are using is compatible with the one used to create the file
  4. copy the file locally with xrdcp and reproduce the problem
  5. if there is no other explanation it is possible that the file is corrupted, ask DataTransfer team to verify checksum and in case re-transfer or invalidate the file replica

Exit code 8028

Normally this has same meaning as 8020: a needed file is not present at the site where it was expected to to be and the site needs to fix this. If the problem is reproducible after O(1 day) or is happening in large amount, write to the support list and it will be followed up.

Only in the special cases in which you intentionally try to read files from a site different from the one where jobs run, more investigation and action are needed. In those cases, keep reading the details below.

Exit code 8028 means "FileOpenError with fallback" (as documented here). That means that the some input file could not be opened neither in the first attempt from the local storage in the execution site nor in the fallback attempt from a remote site using AAA. Note that even if the file is not present at the execution site, the job will still try to find/open it from the local storage, and only in case of failure use the fallback procedure.

Leaving the AAA error aside for a while, the first thing to contemplate here is to understand why the file could not be loaded from the local storage. Was it because the file is not available at the execution site? And if so, was it supposed to be available? If not, can you force CRAB to submit the jobs to the site(s) where the file is hosted? CRAB should do that automatically if the input file is one from the input dataset specified in the CRAB configuration parameter Data.inputDataset, unless you have set Data.ignoreLocality = True, or except in cases like using a (secondary) pile-up dataset. If yours is the last case, please read Using pile-up in this same twiki.

If you intentionally wanted (and had a good reason) to run jobs reading the input files via AAA, then yes, we have to care about why AAA failed. Since AAA must be able to access any CMS site, the next thing is to discard a transient problem: you can submit your jobs again and see if the error persists. Ultimately, you should write to the Computing Tools HyperNews forum (this is a forum for all kind of issues with CMS computing tools, not only CRAB) following these instructions.

Exit code 50660

Exit code 50660 means "Application terminated by wrapper because using too much RAM (RSS)" (as documented here). The amount of RAM that a job can use on a grid node is always limited and if memory need keeps increasing as the job run (so called "memory leak") the job will need to be killed. Grid sites used by CMS guarantee at least 2.5 GB of RAM per core, so allowing for some overhead, CRAB default is to ask 2GB per job. This is usually enough to run full RECO and user jobs should not normally need more. So the user first action when getting this error is to make sure that code is not leaking memory nor allocating useless large structures. If more RAM is really needed, it can be requested via the JobType.maxMemoryMB parameter in CRAB configuration file. Uselessly requesting too much RAM is very likely to result in wasted CPU (we will run less jobs then there are CPU cores available in a node, to spread the available RAM in fewer, larger, chunks), so you have to be careful, abuse will be monitored and tasks may get killed.

Each user is responsible for her/his code and needs to make sure that memory usage is under control. Various tools exist to identify and prevent memory leaks in C++ which are not in CRAB documentation scope. Generally speaking when investigating memory usage you want to make sure that you run on the same input as one job which resulte in memory problems, as usage can depend on number, sequence and kind of event processed. User may also benefic from the crab preparelocal command to replay one specific job interactively and monitor memory usage.

An important exception is in case the user runs multi-threaded applications, in particular CMSSW. In that case a single job will use multiple cores and not only can, but must use more than the default 2GB of RAM. It is up to the user to request the proper amount of memory, e.g. after measuring it running the code interactively, or by looking up what Production is using in similar workflows. As a generic rule of thumb, (1+1*num_threads) GB may be a good starting point.

Illegal parameter found in configuration. The parameter is named: 'numberEventsInLuminosityBlock'

The most common reasons for this error are:
  1. The user is trying to analyze an input dataset, but he/she has specified in the CRAB configuration file JobType.pluginName = 'PrivateMC' instead of JobType.pluginName = 'Analysis'.
  2. The user is generating MC events, correctly specifying in the CRAB configuration file JobType.pluginName = 'PrivateMC', but in the CMSSW parameter set configuration he/she has specified a source of type PoolSource. The solution is to not specify a PoolSource. Note: This doesn't mean to remove process.source completely, as this attribute must be present. One could set process.source = cms.Source("EmptySource") if no input source is used.

Segmentation Fault (exit code 11 or 139)

Usually segmentation faults are well reproducible and can be debugged by running locally on same input files as the CRAB job (e.g. via using [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Commands#crab_preparelocal][crab preparelocal]). Here's some general hint on how to tackle them from https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/5166/2.html :

A segfault is typically caused by invalid memory access (e.g. reading out of bounds of an array or dereferencing a null or random pointer). A simple step forward is to recompile the offending code with debug symbols, e.g.

  • USER_CXXFLAGS="-g" scram b
and run again. Then the stack trace will show the source file and line number where the segfault occurred. If the cause is not evident, you can add printouts or use gdb.

[ERROR] Operation expired

Some sites configuration can not handle remote access of large files (> 10 GB) and XRootD fails with a message like
== CMSSW:    [1] Reading branch EventAuxiliary
== CMSSW:    [2] Calling XrdFile::readv()
== CMSSW:    Additional Info:
                 [a] Original error: '[ERROR] Operation expired' (errno=0, code=206, source=xrootd.echo.stfc.ac.uk:1094 (site T1_UK_RAL)).
As of winter 2019 this almost only happens for files stored at T1_UK_RAL. If you are in this situation, a way out is to submit a new task using CMSSW ≥ 10_4_0 with the following duplicateCheckMode option in the PSet PoolSource
process.source = cms.Source("PoolSource",
   [...]
   duplicateCheckMode = cms.untracked.string("noDuplicateCheck")
)

When that is not an option and the problem is persistent, you may need to ask for a replica of the data at another site.

CRAB Client API

Multiple submission fails with a CMSSW "duplicate process" or "A different CMSSSW configuration was already cached" error

The general problem is that CMSSW parameter-set configurations don't like to be loaded twice. In that respect, each time the CRAB client loads a CMSSW configuration, it saves it in a local (temporary) cache identifying the loaded module with a key constructed out of the following three pieces: the full path to the module and the python variables sys.path and sys.argv.

A problem arises when the CRAB configuration parameter JobType.pyCfgParams is used. The arguments in JobType.pyCfgParams are added by CRAB to sys.argv, affecting the value of the key that identifies a CMSSW parameter-set in the above mentioned cache. And that's in principle fine, as changing the arguments passed to the CMSSW parameter-set may change the event processor. But when a python process has to do more than one submission (like the case of multicrab for multiple submissions), the CMSSW parameter-set is loaded again every time the JobType.pyCfgParams is changed and this may result in "duplicate process" errors. Below are two examples of these kind of errors:

CmsRunFailure
CMSSW error message follows.
Fatal Exception
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=...... label=......
Exception Message:
Duplicate Process The process name ...... was previously used on these products.
Please modify the configuration file to use a distinct process name.

CmsRunFailure
CMSSW error message follows.
Fatal Exception
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
Exception Message:
MessageLogger PSet:
in vString categories duplication of the string ......
The above are from MessageLogger configuration validation.
In most cases, these involve lines that the logger configuration code
would not process, but which the cfg creator obviously meant to have effect.

FATAL ERROR: A different CMSSSW configuration was already cached.
 Either configuration file name or configuration parameters have been changed.
 But CMSSW configuration files can't be loaded more than once in memory.

One option would be to try to not use JobType.pyCfgParams. But if this is not possible, the more general ad-hoc solution would be to fork the submission into a different python process. For example, if you are doing something like documented in Multicrab using the crabCommand API then we suggest to replace each

submit(config)

by

from multiprocessing import Process
p = Process(target=submit, args=(config,))
p.start()
p.join()

(Of course, from multiprocessing import Process needs to be executed only once, so put it outside any loop.)

Multiple submission produces different PSetDump.py files

If the PSetDump.py file (found in task_directory/inputs) differs for the tasks from a multiple-submission python file, try forking the submission into different python processes, as recommended in the previous FAQ.

More on CRAB tasks

Recovery task

Recovery task: What

Recovery task is an important concept that can be useful in many circumstances.

The general idea is that a CRAB task has run to completion, all re-submission attempts done, but some of the necessary input data was not processed. A recovery task will run the same executable and configuration on the missed input data, and will add results to the same output destination (and DBS dataset) as the original task.

Recovery task: Why

CRAB developers try hard to give you a tool with perfect bookkeeping and full automation which brings each task to 100% success. Similarly do strive the operators of the global CMS submission infrastructure (aka HTCondor pool, aka glideIn) and the administrator of the many sites that contribute hardware resources for CMS data analysis. Yet at times things can go wrong, and we may not be able to investigate and fix every small glitch, and surely never within hours or days.

It is impossible to guarantee that a given task will always complete to 100% success in a short amount of time. At the same time it is impossible to make sure that all desired input data is available when the task is submitted. Moreover both good sense and experience show that the larger a task is, the larger is the chance it hits some problem. Large workflows therefore benefit from the possibility to run them sort of iteratively, with a short (hopefully one or two at most) succession of smaller and smaller tasks.

Recovery task: When

A partial list of real life events where a recovery task is user's fastest and simplest way to get work done:
  • Something went wrong in the global infrastructure and some jobs are lost beyond recovery
  • Something went wrong inside CRAB (bugs, hardware...) which can't be fixed by crab resubmit command
  • Some site went down for longer than it make sense to keep jobs in the queue
  • Some data was not available and had to be retransferred and took longer than... see above
  • More data have been added to the input dataset since the original task ran (pretty much as the above)
  • A new lumimask was prepared were lumis declared bad earlier are now good
  • ... more ...

Recovery task: How

You must of course have around the original

  • scram project area
  • crab configuration file including the pset and any other file referenced in there
  • original CRAB project directory if you did not publish output in DBS

The procedure to generate a recovery task is based on these simple steps:

  1. issue a crab kill . Killing the current task will guarantee that no change happens anymore
  2. make a list of lumis present in the desired input dataset (listIn)
  3. make a list of lumis successfully processed by original CRAB task A (listA)
  4. submit a new CRAB tasks B which process the missing lumis (listB = listIn - listA)

Details are slighly different if you published output in DBS or not:

output in DBS
follow the procedure in this FAQ
output not in DBS
follow the procedure in this Workbook example

Dealing with a growing input dataset and/or changing lumi-mask

While data taking is progressing, corresponding datasets in DBS and lumi-mask files are growing. Also data quality is sometimes improved for already existing data, leading to updated lumi-masks which compared to older lumi-masks include luminosity sections that were previously filtered out. Both of these situations lead to the common case where one would like to run a task (lets call it task B) over an input dataset partially analyzed already in a previous task (lets call it task A), where task B should skip the data already analyzed in task A.

This can be accomplished with a few lines in the CRAB configuration file, see an annotated example below.

from CRABClient.UserUtilities import config, getLumiListInValidFiles
from WMCore.DataStructs.LumiList import LumiList

config = config()

config.General.requestName = 'TaskB'
...
 # you want to use same Pset as in previous task, in order to publish in same dataset
config.JobType.psetName = <TaskA-psetName>
...
# and of course same input dataset
config.Data.inputDataset = <TaskA-input-dataset-name>
config.Data.inputDBS = 'global'  # but this will work for a dataset in phys03 as well

# now the list of lumis that you successfully processed in Task-A
# it can be done in two ways. Uncomment and edit the appropriate one:
#1. (recommended) when Task-A output was a dataset published in DBS
#taskALumis = getLumiListInValidFiles(dataset=<TaskA-output-dataset-name>, dbsurl='phys03')
# or 2. when output from Task-A was not put in DBS
#taskAlumis = LumiList(filename=<the LumiSummary.json file from running crab report on Task-A>)

# now the current list of golden lumis for the data range you are interested, can be different from the one used in Task-A
officalLumiMask = LumiList(filename='<some-kosher-name>.json') 

# this is the main trick. Mask out also the lumis which you processed already
newLumiMask = officialLumiMask - taskALumis 

# write the new lumiMask file, now you can use it as input to CRAB
newLumiMask.writeJSON('my_lumi_mask.json')
# and there we, process from input dataset all the lumi listed in the current officialLumiMask file, skipping the ones you already have.
config.Data.lumiMask = 'my_lumi_mask.json' 
config.Data.outputDatasetTag = <TaskA-outputDatasetTag> #  add to your existing dataset
...

IMPORTANT NOTE : in this way you will add any lumi section in the intial data set that was turned from bad to good in the golden list after you ran Task-A, but if some of those data evolved the other way around (from good to bad), there is no way to remove those from your published datasets.

Using pile-up

Important Instructions:
Make sure you run your jobs at the site where the pile-up sample is. Not where the signal is.
This requires you to overrdie the location list that CRAB would extract from the inputDataset.

Rational and details:
The pile-up files have be specified in the CMSSW parameter-set configuration file. There is no way yet to tell in the CRAB configuration file that one wants to use a pile-up dataset as a secondary input dataset. That means that CRAB doesn't know that the CMSSW code will want to access pile-up files; CRAB only knows about the primary input dataset (if any). This means that, assuming there is a primary input dataset, when CRAB does data discovery to figure out to which sites should it submit the jobs, it will only take into account the input dataset specified in the CRAB configuration file (in the Data.inputDataset parameter) and submit the jobs to sites where this dataset is hosted. If there is no primary input dataset, CRAB will submit the jobs to the less busy sites. In any case, if the pile-up files are not hosted in the execution sites, they will be accessed via AAA (Xrootd). But reading the "signal" events directly from the local storage and the pile-up events via AAA is more inefficient than doing the other way around, since for each "signal" event that is read one needs to read in general many (> 20) pile-up events. Therefore, it is highly recommended that the user forces CRAB to submit the jobs to the sites where the pile-up dataset is hosted by whitelisting these sites using the parameter Site.whitelist in the CRAB configuration file. Note that one also needs to set Data.ignoreLocality = True in the CRAB configuration file in case of using a primary input dataset so to avoid CRAB doing data discovery and eventually complain (and fail to submit) that the input dataset is not available in the whitelisted sites. One can use DAS to get the list of sites that host a dataset.

Miscellanea

How can I tell CRAB to run jobs on a GPU ?

If your code can make use of a GPU coprocessor you can tell CRAB to add the requirement that jobs in your task run on cores which have a GPU available. Of course this will be ANDed with any other requirement on data, memory, number of cores, time etc. This is done by adding in crab configuration file:

config.section_("Site")
config.Site.requireAccelerator = True

You can view the site and available GPU from CMS Submission Infrastructure: GPUs monitor dashboard. Be careful not to send your GPU jobs to a site that does not have a GPU.

Users which find current functionality insufficient and have a clear use scenario to describe are enouraged to do so with comments to the dmwm/CRABServer #6989.

How can I use CRAB to submit to my local batch system

CRAB3 only support submission to the CMS global pool, an HTCondor pool comprised of glideinWms pilots running on distributed grid sites. But it is possible to use CRAB machinery for data discovery, job splitting, and job preparation (i.e. configure a set of jobs to be executed to process a given input dataset) and execute those jobs locally, interactively or in users preferred batch system.

In this case the CRAB Server machinery will not be involved and the following differences must be kept in mind

  • crab submit will not be used and no crab commands will make sense
  • bookkeeping and resubmissions will be on user side
  • there will be no stageout and no publication, it will be up to users to user local batch system machinery to take care of output retrieval
  • CRAB will not prepare batch system specific submission instruction, only one generic script to be executed in each job and a set of files to be sent toghether with that script so that it can customize itself with the input specifics of each single job.
  • It will be up to the user to find and use whatever feature is available in the local system to pass to each job a different numeric argument so that they execute as jon 1...N in the task, rather than N copies of job 1
  • if the running jobs require a grid proxy (e.g. to user xrootd to read from remote sites) it is user responsibility to take care

The way to do this is via the crab preparelocal command. Please refer to crab preparelocal help

There is an example of how to use this to submit one CRAB task on the CERN HTCondor batch system. Those instructions will also work for FNAL LPC HTCondor.

How CRAB finds data in input datasets from DBS

The following remarks apply to the main input dataset provided to CRAB via the Data.inputDataset configuration parameter:

  • Dataset status: Datasets in DBS can have different status. This is controlled by Production and the norm is to use VALID datasets. Datasets with different status may occasionally be useful, e.g. for comparison or dedicated study of the problems which led them to be deprecated. In order to run over a dataset whose status is not VALID, one has to set Data.allowNonValidInputDataset = True in the CRAB configuration.
  • File status: Files have a is_file_valid flag in DBS, usually set to False when file is lost or corrupted. CRAB considers only valid files in the dataset. Invalid files are skipped.
  • Data location: A dataset in DBS is divided in blocks. Blocks can be migrated with PhEDEx, and PhEDEx is the only service that knows about the current locations (host sites) of a dataset block. Therefore CRAB queries PhEDEx to retrieve the locations of the blocks in a dataset. Next, a data location (aka PNN = Phedex None Name) is turned into a site where to run (aka PSN = Prosessing Site Name) using CRIC. If a block has no valid locations in PhEDEx or no PSN associated in CRIC, CRAB skips the block.
  • User datasets: For datasets created by users and published in DBS phys03 instance, the above is modified as follows:
    • Dataset status and File status flags are initially set to VALID by CRAB when the dataset is published; then can be changed by the user.
    • Data block location is tracked as origin_site_name in DBS and data are assumed to never move. If datasets are moved, the user can update the origin_site_name. There is no way to have multiple locations.

How many jobs can I run at the same time ?

CRAB runs jobs on the Grid using a global HTCondor pool created via glideInWms machinery, think of it like a global batch system whith execution nodes all over the places. The most important thing which controls how many jobs you can run is the overall number of execution slots (CPU's) available for your jobs, i.e. that match your requirement of data access, memory and running time. Then HTCondor tries hard to give to every user the same share of computing resources, i.e. equal resources to everyone at any given time. You are not penalized for having run more jobs yesterday, and not rewared either for not having used your share in the past. To assess the user share, HTCondor considers only the number of cores that you are using (until May 2017 also the number of RAM GB's was accounted for).

Beware thus of asking for much more memory per core than you need (see What is the maximum memory per job (maxMemoryMB) I can request?).

How to predict how long my jobs will run for?

Use the --dryrun option when doing crab submit. See crab submit --dryrun.

How much time to I have to complete a task

Tasks are purged from the system 30 days after submission to make room for new ones. You need to actively follow up on submitted tasks, make sure your output is collected and bookkeeping completed before that time.

What do I do if my task can not complete in 30 days

If things are not done by than, changes are that somehting in the task submission was wrong or jobs critically depend on one particular site and that is down or overfull. In any case usually after a week or so every job in the task was submitted at least once. Better to stop, take stock of the available output, review things and submit a new task for the remaining data, possibly improving something. CRAB automatically retries jobs up to 3 times in cases where the error message hints to a possible transient problem. There is also evidence that user resubmittion of jobs which failed becouse ot too much wall clock time or too much memory can sometimes succeed after resubmission (as long as those are isolated cases). But blindly hitting "crab resubmit" button in the hope that things change is a pointless waste of resources and will only delay completion of your workflow. We have see jobs resubmitted tens of times and every time failing with same error !

Why are tasks removed after 30 days

High level information on each task (name, config. param, input..) is kept almost forever in CRAB DataBase. But in order to run a task we need to have a working DAGMAN process on an HTCondor scheduler machine, which takes memory and disk space. We can not keep those around forever, the system would grind to halt. So we need to have a time after which "everything goes". This is currnetly configured to be 30 days based on our operational experience in order to keep the system smoothly running.

I can not find information on my old tasks (can not resubmit etc..)

Tasks are purged from the system 30 days after submission to make room for new ones. See above FAQ

Use the --dryrun option when doing crab submit. See crab submit --dryrun.

How to list/copy/remove files/directories in a storage element area?

You can use the gfal-* commands from a machine that has GFAL2 utility tools installed (e.g. lxplus). You have to pass Physical File Names (PFNs) as arguments to the commands. To get the Physical File Name given a Logical File Name and a CMS node name, you can use the lfn2pfn PhEDEx API. LFNs are names like /store/user/mario/myoutput; note that a directory is also a file name.

For example, for the LFN /store/user/username/myfile.root stored in T2_IT_Pisa you can do the following (make sure you did cmsenv before, so to use a new version of curl), where you can replace the first two lines with the values which are useful to you and simply copy/paste the long curl command:

site=T2_IT_Pisa
lfn=/store/user/username/myfile.root
curl -ks "https://cmsweb.cern.ch/phedex/datasvc/perl/prod/lfn2pfn?node=${site}&lfn=${lfn}&protocol=srmv2" | grep PFN | cut -d "'" -f4

which returns:

srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/cms/store/user/username/myfile.root

Before executing the gfal commands, make sure to have a valid proxy:

voms-proxy-init -voms cms

Enter GRID pass phrase for this identity:
Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u<user-id>.

Your proxy is valid until <some date-time 12 hours in the future>

The most useful gfal commands and their usage syntax for listing/removing/copying files/directories are in the examples below (it is recommended to unset the environment when executing gfal commands, i.e. to add env -i in front of the commands). See also the man entry for each command (man gfal-ls etc.):

List a (remote) path:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-ls <physical-path-name-to-directory>

Remove a (remote) file:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-rm <physical-path-name-to-file>

Recursively remove a (remote) directory and all files in it:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-rm -r <physical-path-name-to-directory>

Copy a (remote) file to a directory in the local machine:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-copy <physical-path-name-to-source-file> file://<absolute-path-to-local-destination-directory>
Note: the <absolute-path-to-local-destination-directory> starts with / therefore there are three consecutive / characters like file:///tmp/somefilename.root

Why are my jobs submitted to a site that I had explicitly blacklisted (not whitelisted)?

There is a site overflow mechanism in place, which takes place after CRAB submission. Sites are divided in regions of good WAN/xrootd connectivity (e.g. US, Italy, Germany etc.), then jobs queued at one site A for too long are allowed to overflow to a well connected site B which does not host the requested input data but from where data will be read over xrootd. Rationale is that even if those jobs were to fail due to unable to read data or a problem in site B, they will be automatically resubmitted, so nothing is lost with respect to keeping those jobs idle in the queue waiting for free slots at site A. The site overflow can be turned off via the Debug.extraJDL CRAB configuration parameter :

config.section_("Debug")
config.Debug.extraJDL = ['+CMS_ALLOW_OVERFLOW=False']

Note: if you change this configuration option for an already-created task (for instance if you noticed a lot of job failures at a particular site and even after blacklisting the jobs keep going back), you can't simply change the option in the configuration and resubmit. You'll have to kill the existing task and make a new task to get the option to be accepted. You can't simply change it during resubmission.

What is glideinWms Overflow and how can I avoid using it ?

See above FAQ

Doing lumi-mask arithmetics

There is a tool written in python called LumiList.py (available in the WMCore library; is the same code as cmssw/FWCore/PythonUtilities/python/LumiList.py ) that can be used to do lumi-mask arithmetics. The arithmetics can even be done inside the CRAB configuration file (that's the advantage of having the configuration file written in python). Below are some examples.

Example 1: A run range selection can be achieved by selecting from the original lumi-mask file the run range of interest.

from WMCore.DataStructs.LumiList import LumiList

lumiList = LumiList(filename='my_original_lumi_mask.json')
lumiList.selectRuns([x for x in range(193093,193999+1)])
lumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Example 2: Use a new lumi-mask file that is the intersection of two other lumi-mask files.

from WMCore.DataStructs.LumiList import LumiList

originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json')
originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json')
newLumiList = originalLumiList1 & originalLumiList2
newLumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Example 3: Use a new lumi-mask file that is the union of two other lumi-mask files.

from WMCore.DataStructs.LumiList import LumiList

originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json')
originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json')
newLumiList = originalLumiList1 | originalLumiList2
newLumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Example 4: Use a new lumi-mask file that is the subtraction of two other lumi-mask files.

from WMCore.DataStructs.LumiList import LumiList

originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json')
originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json')
newLumiList = originalLumiList1 - originalLumiList2
newLumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Obsolete/Deprecated stuff (kept here as permanent documentation just in case)

User quota in the CRAB scheduler machines

Each user has a home directory with 100GB of disk space in each of the scheduler machines (schedd for short) assigned to CRAB3 for submitting jobs to the Grid. Whenever a task is submitted by the CRAB server to a schedd, a task directory is created in this space containing among other things CRAB libraries and scripts needed to run the jobs. Log files from Condor/DAGMan and CRAB itself are also placed there. (What is not available in the schedds are the cmsRun log files, except for the snippet available in the CRAB job log file.) As a guidance, a task with 100 jobs uses on average 50MB of space, but this number depends a lot on the number of resubmissions, since each resubmission produces its log files. If a user reaches his/her quota in a given schedd, he/she will not be able to submit more jobs via that schedd (he/she may still be able to submit via other schedd, but since the user can not choose the schedd to which to submit -the choice is done by the CRAB server-, he/she would have to keep trying the submission until the task goes to a schedd with non-exahusted quota). To avoid that, task directories are automatically removed from the schedds after 30 days of their last modification. If a user reaches 50% of its quota in a given schedd, an automatic e-mail similar to the one shown below is sent to him/her.

Subject: WARNING: Reaching your quota

Dear analysis user <username>,

You are using <X>% of your disk quota on the server <schedd-name>. The moment you reach the disk quota of <Y>GB, you will be unable to
run jobs and will experience problems recovering outputs. In order to avoid that, you have to clean up your directory at the server. 
Here are the instructions to do so:
 https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#How_to_clean_up_your_directory_i
Here it is a more detailed description of the issue:
 https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Disk_space_for_output_files
If you have any questions, please contact hn-cms-computing-tools(AT)cern.ch
 Regards,
CRAB support

This e-mail has a link (https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#How_to_clean_up_your_directory_i) to the instructions on how to clean up space in the user's home directory in a schedd. A user can follow the instructions in that page, or alternatively use the crab purge command.

crab-env-bootstrap.sh script to overcome the CRAB3 and CMSSW environment conflicts

To overcome the CRAB3 vs CMSSW environment conflicts, you can use the following script available in CVMFS (/cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh) without need to source the CRAB3 environment. You could do something like this:

cmsenv
# DO NOT setup the CRAB3 environment
alias crab='/cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh'
crab submit
crab status
...
# check that you can run cmsRun locally

Details:

The usual way to setup CRAB3 is to first source the CMSSW environment using cmsenv and then source the CRAB3 environment using source /cvmfs/cms.cern.ch/crab3/crab.(c)sh. This setup procedure has the disadvantage that, depending on which CMSSW version is used, once the CRAB3 environment is sourced the CMSSW commands like cmsRun will stop working (also other useful commands like gfal-copy will not work). Solving this at the root and make the CRAB client RPM compatible to the CMSSW ones is not possible for the way the tools in the COMP repository are built, and because cmsweb has its own release cycle independent from CMSSW.

To overcome this limitation we are now providing a wrapper bash script that can be run in place of the usual crab command. This wrapper script will take care of setting the environment in the correct way before running the usual crab command, and will leave the environment as it was when exiting. The script will be soon available in the CMSSW distribution under the name 'crab' and its usage will be transparent to the user: you will just run the crab commands as you would have done before. In the meantime, the script is available for testing here: /cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh.

ERROR: SyntaxError: invalid syntax (Mixins.py, line 714)

The problematic pset is FWCore/ParameterSet/Mixins.py from CMSSW:

[line 714] p = tLPTest("MyType",** { "a"+str(x): tLPTestType(x) for x in xrange(0,300) } )

This uses dictionary comprehensions, a feature available in python > 2.7. While CMSSW (setup via cmsenv) uses python > 2.7, CRAB (setup via /cvmfs/cms.cern.ch/crab3/crab.sh) still uses python 2.6.8. To overcome this problem, don't setup the CRAB environment and instead use the crab-env-bootstrap.sh script (see this FAQ).

LeonardoCristella - 2018-04-23

Edit | Attach | Watch | Print version | History: r149 < r148 < r147 < r146 < r145 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r149 - 2023-03-22 - DarioMapelli
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback