CRAB Logo

CRAB3 Frequently Asked Questions

Complete: 3 Go to SWGuideCrab

Help Notice: This is a large page, it works best if you search in it for your problem using the browser search function.
By default all answers are collapsed and search only uses the questions text. If you do not find what you need, you can use the buttons below to search inside answers as well:
 

Help if you still have problems, check the CRAB3Troubleshoot guide before asking for support

Contents:

Certificates, proxies and all that stuff

crab command fails with Impossible to retrieve proxy from myproxy.cern.ch ...

This can be due to a stale credential in myproxy. CRAB client always tries to keep a valid one there, but there are some known edge cases where this fails, e.g. https://github.com/dmwm/CRABServer/issues/5168.

Therefore you should remove credentials from myproxy and then issue the crab commad again. To remove stale credentials:

grep myproxy-info <CRAB project directory>/crab.log
# example:  grep myproxy-info crab_20160308_140433/crab.log 
you will get something like
 command: myproxy-info -l ec95456d3589ed395dc47d3ada8c94c67ee588f1 -s myproxy.cern.ch

then simply issue a myproxy-destroy command with same arguments:

# example. In real life replace the long hex string with the one from your crab.log
myproxy-destroy -l ec95456d3589ed395dc47d3ada8c94c67ee588f1 -s myproxy.cern.ch

If things still fail after than, send the following additional info in your request for support, replacing the long hex string with the one that you found in crab.log (ec95456d3589ed395dc47d3ada8c94c67ee588f1 in the above example):

  • output of voms-proxy-info -all
  • output of myproxy-info -d -l <long-hex-string> -s myproxy.cern.ch
  • content of you crab.log as an attachment

CRAB setup

Does CRAB setup conflict with CMSSW setup

No. CRAB client runs within the CMSSW environment.
Make sure you always do cmsenv before source /cvmfs/cms.cern.ch/crab3/crab.sh

I need to use a (old ?) CMSSW release where CRAB client fails, what can I do ?

You can use the command below to get a fully consistent environment for CRAB, but be aware that cmsRun will not work anymore after that, you will need a separate shell for that:
  • source /cvmfs/cms.cern.ch/crab3/crab_standalone.sh

CRAB configuration file

Documentation about the CRAB configuration file: CRAB3ConfigurationFile.

What is the maximum memory per job (maxMemoryMB) I can request?

CRAB requests by default a maximum memory of 2000 MB. This is the maximum memory per core all sites guarantee they will run. Some sites, but not many, offer a bit more (typically 2500 MB); and some sites even offer 4000 MB for special users. Memory limits offered by each site are accessible in the GlideinWMS VO Factory Monitor page, http://glidein.grid.iu.edu/factory/monitor/ (choose "Current Status of the Factory" and click on a site CE listed under "Entry Name" in the table), but this should not be considered a documentation. The best advice we can give is: stick to the default, and if you think you need more, find out first if there are sites (and which ones) which can run jobs in that case. If you need help, you can write to us.

note.gif Note: In case of a multi-threaded job (config.JobType.numCores > 1) most likely the default memory value is not enough. The user share of computing resources accounts for the requested memory per core.

What is the 'Automatic' splitting mode?

The Data.splitting parameter has now a default value: 'Automatic'.
With such a setting the task processing is split into three stages:
  1. A "probe" stage, where some probe jobs are submitted to estimate the event throughput of the CMSSW parameter-set configuration provided by the user in the JobType.psetName parameter and possible further arguments. Probe jobs have a job id of the form 0-[1,2,3,...], they can not be resubmitted and the task will fail if none of the probe jobs complete successfully. The output files transfer is disabled for probe jobs.
  2. A "main" stage, very similar to the conventional stage for other splitting modes, in which a number of main jobs (automatically determined by the probe stage) will process the dataset. These jobs can not be manually resubmitted and have a fixed maximum runtime (specified in the Data.unitsPerJob parameter), after which they gracefully stop processing input data. The remaining data will be processed in the next stage (tail) and their jobs labelled as "rescheduled" in the main stage (in the dashboard they will always appear as "failed").
  3. Some possible "tail" stages. If some main job does not finish successfully ("rescheduled" in the previous stage) or does not completely process the amount of data assigned to it due to the automatically configured maximum job run time, tail jobs are created and submitted in order to fully process the dataset. Tail jobs have a job id of the form n-[1,2,3,...], where n=1,2,... represents the tail stage number. For small tasks, less than 100 jobs, one tail stage is started when all jobs have completed (successfully or failed). For larger tasks, a first tail stage collects all remaining input data from the first 50% of completed jobs, followed by a stage that processes data when 80% of jobs have completed, and finally a stage collecting leftover input data at 100% job completion.
    Failed tail jobs can be manually resubmitted by the users.

Once the probe stage is completed, the plain crab status command shows only the main and tail jobs. For the list of all jobs add the --long option.

CRAB cache

User quota in the CRAB cache

The CRAB User File Cache, or CRAB cache for short, is the place where:
  • the CRAB client puts the user input sandboxes when submitting a task to the CRAB server;
  • the CRAB client puts the crab.log files when they are uploaded via crab uploadlog;
  • the CRAB server puts the archives produced by the dry run submissions;
  • etc.
Each user has a quota of 4.88GB in the CRAB cache. If this limit is reached, new submissions will fail. Files in the CRAB cache are automatically deleted after 5 days. If besides that a user reaches the quota limit, he/she can free some space manually. See how in this FAQ.

What is the maximum allowed size of the user input sandbox?

100 MB.

What are the files CRAB adds to the user input sandbox?

CRAB adds to the user input sandbox the following directories/files, when a directory is incuded all the containted subdirectories are also recursively included:
  • The directories $CMSSW_BASE/lib, $CMSSW_BASE/biglib and $CMSSW_BASE/module. One can also tell CRAB to include the directory $CMSSW_BASE/python by setting JobType.sendPythonFolder = True in the CRAB configuration.
  • Any data and interface directory recursively found in $CMSSW_BASE/src.
  • All additional directories/files specified in the CRAB configuration parameter JobType.inputFiles.
  • The original CRAB configuration file (added as debug/crabConfig.py).
  • The original CMSSW parameter-set configuration file (added as debug/originalPSet.py).
  • The tweaked CMSSW parameter-set configuration file in pickle format (added as PSet.pkl) plus a simple PSet.py file to load the pickle file.

How are the inputFiles handled in the user input sandbox?

Depending on whether filenames or directories are used in the config.JobType.inputFiles parameter, the directory structure inside the sandbox may be different and affect where the files are placed in the working directory of the job.
  • Specific file names are always added to the root directory of the sandbox, whether an absolute or relative file name is used. For example, both /afs/cern.ch/user/e/erupeika/supportFiles/foo.root or myfiles/foo.root will appear as foo.root in the sandbox and will be extracted as foo.root to the job's root working directory.
  • The directory structure inside each additional input file directory is maintained in the sandbox. The additional directories themselves will be located in the root directory of the sandbox. For example, if a directory foo with files bar1 and bar2 inside it is specified in the inputFiles parameter, the sandbox will contain foo, foo/bar1 and foo/bar2 (the working directory of the job will therefore also contain a directory foo with files bar1 and bar2).
  • For example if your application expects to find mydir/file1 you should put in crab configuration config.JobType.inputFiles='mydir' and of course avoid having extra stuff in that directory. While if you put config.Data.inputFiles='mydir/file1' your application needs to open file1

How can I clean my area in the CRAB cache?

One can use the crab purge command to delete from the CRAB cache files associated to a given task. Actually, crab purge deletes only user input sandboxes (because there is no API to delete other files), but since they are supposed to be the main space consumers in the CRAB cache, this should be enough. If for some reason the crab purge command does not work, one can alternatively use the REST interface of the crabcache component. Instructions oriented for CRAB3 operators can be found here. Jordan Tucker has written the following script based on these instructions that removes all the input sandboxes from the user CRAB cache area (a valid proxy and the CRAB environment are required):

Show Hide script
#!/usr/bin/env python

import json
import os
import pycurl
from cStringIO import StringIO
from pprint import pprint
from CRABClient.UserUtilities import getUsernameFromSiteDB

class Crab3ToolsException(Exception):
    pass

class UserCacheHelper:
    def __init__(self, proxy=None, user=None):
        if proxy is None:
            proxy = os.getenv('X509_USER_PROXY')
        if not proxy or not os.path.isfile(proxy):
            raise Crab3ToolsException('X509_USER_PROXY is %r, get grid proxy first' % proxy)
        self.proxy = proxy

        if user is None:
            user = getUsernameFromSiteDB()
        if not user:
            raise Crab3ToolsException('could not get username from sitedb, returned %r' % user)
        self.user = user

    def _curl(self, url):
        buf = StringIO()
        c = pycurl.Curl()
        c.setopt(pycurl.URL, str(url))
        c.setopt(pycurl.WRITEFUNCTION, buf.write)
        c.setopt(pycurl.SSL_VERIFYPEER, False)
        c.setopt(pycurl.SSLKEY, self.proxy)
        c.setopt(pycurl.SSLCERT, self.proxy)
        c.perform()
        j = buf.getvalue().replace('\n','')
        try:
            return json.loads(j)['result']
        except ValueError:
            raise Crab3ToolsException('json decoding problem: %r' % j)

    def _only(self, l):
        if len(l) != 1:
            raise Crab3ToolsException('return value was supposed to have one element, but: %r' % l)
        return l[0]

    def listusers(self):
        return self._curl('https://cmsweb.cern.ch/crabcache/info?subresource=listusers')

    def userinfo(self):
        return self._only(self._curl('https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=' + self.user))

    def quota(self):
        return self._only(self.userinfo()['used_space'])

    def filelist(self):
        return self.userinfo()['file_list']

    def fileinfo(self, hashkey):
        return self._only(self._curl('https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=' + hashkey))

    def fileinfos(self):
        return [self.fileinfo(x) for x in self.filelist() if '.log' not in x] # why doesn't it work for e.g. '150630_200330:tucker_crab_repubmerge_tau0300um_M0400_TaskWorker.log' (even after quoting the :)?

    def fileremove(self, hashkey):
        x = self._only(self._curl('https://cmsweb.cern.ch/crabcache/info?subresource=fileremove&hashkey=' + hashkey))
        if x:
            raise Crab3ToolsException('fileremove failed: %r' % x)

if __name__ == '__main__':
    h = UserCacheHelper()
    for x in h.filelist():
        if '.log' in x:
            continue
        print 'remove', x
        h.fileremove(x)

note.gif Note: Once a task has been submitted, one can safely delete the input sandbox from the CRAB cache, as the sandbox is transferred to the worker nodes from the schedulers.

Stageout and publication

Documentation about (input/output) data handling in CRAB: Crab3DataHandling.

What are the allowed stageout LFN paths with CRAB?

CRAB allows only the following LFN directory paths for stageout:

  • /store/user/<username>[/<subdirs>] where username is the CERN primary account username;
  • /store/group/<groupname>[/<subdirs>] where groupname can be any already existing directory under /store/group/.

If not publishing, /store/local/<dir>[/<subdirs>] is also allowed.

These are all the allowed paths that can be set in the CRAB configuration parameter Data.outLFNDirBase. If any other path is given, the submission of the task will fail.

Can I stage out my files into a /store/user/ area that uses a different username than the one of my CERN primary account?

With CRAB3 this should not be any different than with CRAB2. CRAB will look up for the user's username registered in SiteDB (which is the username of the CERN primary account) using for the query the user's DN (which in turn is extracted from the user's credentials) and will try to stage out to /store/user/<username>/ (by default). If the store user area uses a different username, itís up to the destination site to remap that (via a symbolic link or something similar). The typical case is Fermilab; to request the mapping of the store user area, FNAL users should follow the directions on the usingEOSatLPC web page to open a ServiceNow ticket to get this fixed.

To prevent stage out failures, and in case the user has provided in the Data.outLFN parameter of the CRAB configuration file an LFN directory path of the kind /store/user/[<some-username>/<subdir>*] (i.e. a store path that starts with /store/user/), CRAB will check if some-username matches with the user's username extracted from SiteDB. If it doesn't, it will give an error message and not submit the task. The error message would be something like this:

Error contacting the server.
Server answered with: Invalid input parameter
Reason is: The parameter Data.outLFN in the CRAB configuration file must start with either '/store/user/<username>/' or '/store/group/<groupname>/' (or '/store/local/<something>/' if publication is off), wher...

Unfortunately the "Reason is:" message it cut at 200 characters. The message should read:

Reason is: The parameter Data.outLFN in the CRAB configuration file must start with either '/store/user/<username>/' or '/store/group/<groupname>/' (or '/store/local/<something>/' if publication is off), where username is your username as registered in SiteDB (i.e. the username of your CERN primary account).

A similar message should be given by crab checkwrite if the user does crab checkwrite --site=<CMS-site-name> --lfn=/store/user/<some-username>.

Why is CRAB not transferring an output file?

First of all, does CRAB know at all that the job should produce the output file in question? To check that, open one of the job log files linked from the task monitoring pages. Very close to the top it is printed the list of output files that CRAB expects to see once the job finishes (shown below is the case of job number 1 in the task):

======== HTCONDOR JOB SUMMARY at ... START ========
CRAB ID: 1
Execution site: ...
Current hostname: ...
Destination site: ...
Output files: my_output_file.root=my_output_file_1.root 

If the output file in question doesn't appear in that list, then CRAB doesn't know about it, and of course it will not be transferred. This doesn't mean that the output file was not produced; it is simply that CRAB has to know beforehand what are the output files that the job produces.

If the output file is produced by either PoolOutputModule or TFileService, CRAB will automatically recognize the name of the output file when the user submits the task and it will add the output file name to the list of expected output files. On the other hand, if the output file is produced by any other module, the user has to specify the output file name in the CRAB configuration parameter JobType.outputFiles in order for CRAB to know about it. Note that this parameter takes a python list, so the right way to specify it is:

config.JobType.outputFiles = ['my_output_file.root']

Can I delete a dataset I published in DBS?

Users do not have permissions to delete a dataset or a file from DBS. Instead, what users can do is to change the status of the dataset or of individual files in the dataset. For more details see Changing a dataset or file status in DBS.

Can I send CRAB output to CERNBOX ?

Yes, by doing both the following:

  1. indicating T2_CH_CERNBOX as storage location in CRAB configuration
  2. asking CERNBOX administrators (which are NOT in CMS) to grant proper permission to your DN

Explanation:

The T2_CH_CERNBOX site is not listed among CMS Sites e.g. in https://cms-cric.cern.ch/cms/site/index/ but a trivial file catalog exist for it, and it is known to PhEDEx, i.e. it is a known storage location for CMS in https://cmsweb.cern.ch/phedex/prod/Components::Status (and https://cms-cric.cern.ch/cms/storageunit/detail/T2_CH_CERNBOX/ ) which allows CRAB to use the T2_CH_CERNBOX string to map logical file names of the kind /store/user/somename to gsiftp://eosuserftp.cern.ch/eos/user/s/somename which is the proper end point for writing to CERNBOX.

But since CERNBOX is not part of CMS disk, but a space which CERN offers to all users, access to it is not controlled by CMS, so in order to be able to write there from a grid node using gsiftp (differntly from using e.g. CERNBOX client or fuse mount on lxplus), users need to ask for help from CERN, e.g. via CERN help desk or a SNOW ticket.

Jobs status

My jobs are still idle/pending/queued. How can I know why and what can I do?

If jobs are pending for more than ~12 hours, there is certainly a problem somewhere. The first thing to do is to identify to which site(s) the jobs were submitted and check the site(s) status in the Site Readiness Monitor page, http://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html. For example, the "HammerCloud" row will tell whether analysis jobs are running at the site and their success rate, and the "Maintenance" row will tell whether the site had/has a downtime (clicking on the corresponding date inset in the table will open a new web page where the downtime reason is explained). If everything looks fine with the site(s) status, it may be that the user jobs are not running because they requested more resources (memory per core) than what the site(s) can offer (see What is the maximum memory per job (maxMemoryMB) I can request?).

CRAB commands

crab checkusername fails with "Error: Failed to retrieve username from SiteDB."

crab checkusername uses the following sequence of bash commands, which you should try to execute one by one (make sure you have a valid proxy) to check if they return what is expected.

1) It gets the path to the users proxy file with the command

which scram >/dev/null 2>&1 && eval `scram unsetenv -sh`; voms-proxy-info -path

which should return something like

/tmp/x509up_u57506

2) It defines the path to the CA certificates directory with the following python command

import os
capath = os.environ['X509_CERT_DIR'] if 'X509_CERT_DIR' in os.environ else "/etc/grid-security/certificates"
print capath

which should be equivalent to the following bash command

if [ "x$X509_CERT_DIR" != "x" ]; then capath=$X509_CERT_DIR; else capath=/etc/grid-security/certificates; fi
echo $capath

and which in lxplus should result in

/etc/grid-security/certificates

3) It uses the proxy file and the capath to query https://cmsweb.cern.ch/sitedb/data/prod/whoami

curl -s --capath <output-from-command-2-above> --cert <output-from-command-1-above> --key <output-from-command-1-above> 'https://cmsweb.cern.ch/sitedb/data/prod/whoami'

which should return something like

{"result": [
 {"dn": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atanasi/CN=710186/CN=Andres Jorge Tanasijczuk", "login": "atanasi", "method": "X509Proxy", "roles": {"operator": {"group": ["crab3"], "site": []}}, "name": "Andres Jorge Tanasijczuk"}
]}

4) Finally it parses the output from the above query to extract the username from the "login" field (in my case it is atanasi).

When reporting a problem with crab checkusername with "Failed to retrieve username from SiteDB." to the CRAB experts, it would be useful to add the output from the above commands.

note.gif Note: Even if crab checkusername gives an error retrieving the username from SiteDB, this should not stop you from trying to submit jobs with CRAB, because the error might just be a problem with crab checkusername itself and not a real problem with your registration in SiteDB (CRAB uses a different mechanism than the one described above to check the users' registration in SiteDB when attempting to submit jobs to the grid).

crab submit fails with "User quota limit reached; cannot upload the file"

When doing crab submit the user may get this error messages:

Error contacting the server.
Server answered with: Invalid input parameter
Reason is: User quota limit reached; cannot upload the file

Error explanation: The user has reached the limit of 4.88GB in its CRAB cache area. Read more in this FAQ.

What to do: Files in the CRAB cache are automatically deleted after 5 days, but the user can clean his/her cache area at any time. See how in this FAQ.

crab (re)submit fails with "Trapped exception in Dagman.Fork"

Typical error in crab status:

Failure message: The CRAB server backend was not able to (re)submit the task, because the Grid scheduler answered with an error. This is probably a temporary glitch. Please try again later. If the error persists send an e-mail to hn-cms-computing-tools@cern.ch<mailto:hn-cms-computing-tools@cern.ch>. Error reason: Trapped exception in Dagman.Fork: <type 'exceptions.RuntimeError'> Unable to edit jobs matching constraint <traceback object at 0xa113368>
  File "/data/srv/TaskManager/3.3.1512.rc6/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1512.rc6/lib/python2.6/site-packages/TaskWorker/Actions/DagmanResubmitter.py", line 113, in executeInternal
    schedd.edit(rootConst, "HoldKillSig", 'SIGKILL')

As the error message says, this should be a temporary failure. One should just keep trying until it works. But after doing crab resubmit, give it some time to process the resubmission request; it may take a couple of minutes to see the jobs reacting to the resubmission.

crab submit fails with "Task failed to bootstrap on schedd"

After doing crab submit and crab status the user may get this error message:

Task status: UNKNOWN

Error during task injection:    Task failed to bootstrap on schedd

Error explanation: The submission of the task to the scheduler machine has failed.

What to do: Submit again.

crab submit fails with "Failed to contact Schedd"

When doing crab status the user may get one of these error messages:

Error during task injection:        <task-name>: Failed to contact Schedd: Failed to fetch ads from schedd.

Error during task information retrieval:        <task-name>: Failed to contact Schedd: .

Error explanation: This is a temporary communication error with the scheduler machine (submission node), most probably because the scheduler is overloaded.

What to do: Try again after a couple of minutes.

crab submit fails with "Splitting task ... on dataset ... with ... method does not generate any job"

This is not a CRAB error.

This usually happens when there is no lumi to process. I.e.the intersection of

  1. the input lumimask (if any)
  2. the selected run range (if any)
  3. the set of runs and lumis in the input dataset
is empty. Typical reasons are using a golden json lumimask from some data acquisition era on data from a different era or looking for a specific run in a dataset which does not include that run.

You should carefully cross check what you are trying to select, possibly use lumi arithemetic to verify, and only report this as a problem if you are sure that there is a bug. Typical error in crab status:

crab submit fails with "Block ...  contains more than 100000 lumis and cannot be processed for splitting. For memory/time contraint big blocks are not allowed. Use another dataset as input."

The message is self explaining. CRAB server will die due to lack of memory if it needs to process luminosity lists with millions of entries per block. This can only happen with MC datasets which have been created with improper use of lumisections, since the limit at 100k lumisection in one block would correspond for real data to 100 days of continuous data taking. For MC lumi sections have no relation with luminosity but are used only to allow processing less than a file in one job via split by lumi algorithm, in this case it makes no sense to have more lumis than events.

Some more discussion is in this thread: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/2928.html

There are a few datasets in DBS which do no satisfy this limit, if someone really needs to process those, the only way is to do one job per file using the userInputFiles feature of CRAB. An annotated example of how to do this in python is below, note that you have to disable DBS publication, indicate split by file and provide input file locations, other configuaration parameters can be set as usual:


# this will use CRAB client API
from CRABAPI.RawCommand import crabCommand

# talk to DBS to get list of files in this dataset
from dbs.apis.dbsClient import DbsApi
dbs = DbsApi('https://cmsweb.cern.ch/dbs/prod/global/DBSReader')

dataset = '/BsToJpsiPhiV2_BFilter_TuneZ2star_8TeV-pythia6-evtgen/Summer12_DR53X-PU_RD2_START53_V19F-v3/AODSIM'
fileDictList=dbs.listFiles(dataset=dataset)

print ("dataset %s has %d files" % (dataset, len(fileDictList)))

# DBS client returns a list of dictionaries, but we want a list of Logical File Names
lfnList = [ dic['logical_file_name'] for dic in fileDictList ]

# this is now standard CRAB configuration

from WMCore.Configuration import Configuration
config = Configuration()

config.section_("General")
config.General.transferLogs = False

config.section_("JobType")
config.JobType.pluginName = 'Analysis'

# in following line of course replace with your favorite pset
config.JobType.psetName = 'demoanalyzer.py'
config.section_("Data")

# following 3 lines are the trick to skip DBS data lookup in CRAB Server
config.Data.userInputFiles = lfnList
config.Data.splitting = 'FileBased'
config.Data.unitsPerJob = 1

# since the input will have no metadata information, output can not be put in DBS
config.Data.publication = False

config.section_("User")
# 

config.section_("Site")

# since there is no data discovery and no data location lookup in CRAB
# you have to say where the input files are
config.Site.whitelist = ['T2_CH_CERN']

config.Site.storageSite = 'T2_CH_CERN'

result = crabCommand('submit', config = config)

print (result)

crab submit fails with "Block ...  contains more than 100000 lumis."

The message is self explaining. CRAB server will die due to lack of memory if it needs to process luminosity lists with millions of entries per block. There are two known cases where this can happen:
  • MC datasets which have been created with improper use of lumisections. MC lumi sections have no relation with luminosity but are used only to allow processing less than a file in one job via split by lumi algorithm, in this case it makes no sense to have more lumis than events.
  • nanoAOD or similar super-extra-high compact event formats where one year of data fits in a few files
Those dataset can only be processed if CRAB can ignore the lumi-list information, i.e. using `config.Data.splitting = 'FileBased' and avoiding any extra request which would eventually result in the need to use lumi information. This means no run range, no lumi mask, and no secondary dataset (since CRAB will need to use lum info to match input files from the two datasets). Note that useParent is allowed since in that case CRAB uses parentage information stored in DBS to match input files.

In practice your crabConfig file must have:

config.Data.splitting = 'FileBased'
config.Data.runRange = ''
config.Data.lumiMask  = ''

( the paremeters with an assigned null value `` can be omitted, but if present must indicate the null string )

and must NOT contain the following parameter

config.Data.secondaryInputDataset

CRAB fails to resubmit some jobs

It is important that you as a user are prepared for this to happen and know how to remain productive in your physics analysis with the least effort. While there is a long tradition of "resubmit them until they work", this is hardly useful any more. And while we can't prevent users from trying we can not guarantee that it will work, nor that resubmitted jobs will succeed.

In the case where the missing data sample is important, the best recommendation we can give to users is to
USE GENERAL/GENERIC RESCUE PROCEDURES, RATHER THEN TRY AT ALL COSTS TO REVIVE A DEAD TASK.
We will always welcome problem reports and will try to improve when resubmission failures can be due to CRAB internals, but surely you do not want to hold your breath in the meanwhile.

The safest path is therefore:

  1. let running jobs die or complete and dust settle
  2. use crab klill to make sure everything stops
  3. take stock of what's published in DBS at that point and make sure that it matches what's on disk
    • if your output is not in DBS, you can use crab report, but while DBS information is available forever, crab commands on a specific task may not
  4. assess whether it is more important to get the last percentage of statistics or go on with other work. Do you really need 100% completion in this task ?
  5. if full statistics is needed, create a recovery task for the missing lumis and run it writing to the same dataset as the original task.

Recovery task is an important concept that can be useful in many circumstances. Please find instructions in this FAQ

At the other extreme there's: forget about this, and resubmit a new task with new output dataset. In between it is a murky land where many recipes may be more efficient according to details, but no general simple rule can be given and there's space for individual creativity and/or desperation.

I get a "Syntax error in CRAB configuration"

When doing crab submit the user may get one of these error messages:

Syntax error in CRAB configuration:
invalid syntax (<CRAB-configuration-file-name>.py, <line-where-error-occurred>)

Syntax error in CRAB configuration:
'Configuration' object has no attribute '<attribute-name>'

Error explanation: The CRAB configuration file could not be loaded, because there is a syntax error somewhere in it.

What to do: Check the CRAB configuration file and fix it. There could be a misspelled parameter or section name, or you could be trying to use a configuration attribute (parameter or section) that was not defined. To get more details on where the error occurred, do:

python
import <CRAB-configuration-file-name> #without the '.py'

which gives:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<CRAB-configuration-file-name>.py", <line-where-error-occurred>
    <error-python-code>
                      ^

or

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<CRAB-configuration-file-name>.py", <line-where-error-occurred>, in <module>
    <error-python-code>
AttributeError: 'Configuration' object has no attribute '<attribute-name>'

For more information about the CRAB configuration file, see CRAB3ConfigurationFile.

Problems with the = .requestcache file= and/or the CRAB project directory

If a crab command fails with messages like "Cannot find .requestcache file" or "...  is not a valid CRAB project directory" or otherwise complains that it can not find the tasks you are trying to send the command to, a problem with local directory where crab submit caches relevant information is likely (maybe disk got full, or corrupted, or you removed a file unintentionally).

Please fine more information about the CRAB project directory and possible recovery action on your site in CRAB3Commands#CRAB_project_directory

Problems with job execution

Exit code 8001

This indicate that cmsRun enconterred a bot better specified fatal exception. Usually means a problem in user code or configuration. You should inspect the stdout of one job to find the exception message and traceback which may guide you to the solution.

A particular case is when the exception says An exception of category 'DictionaryNotFound' occurred, like in this example:

----- Begin Fatal Exception 08-Jun-2017 18:18:04 CEST-----------------------
An exception of category 'DictionaryNotFound' occurred while
   [0] Constructing the EventProcessor
Exception Message:
No Dictionary for class: 'edm::Wrapper<edm::DetSetVector<CTPPSDiamondDigi> >'
----- End Fatal Exception -------------------------------------------------

in this case, most likely the input data have been produced with a CMSSW version not compatible with the one used in CRAB job. In general it's not supported reading data with a release older than what it was produced with.

To find out which release was used to produce a given dataser of file, adapt following examples to your situation:

belforte@lxplus045/~> dasgoclient --query "release dataset=/DoubleMuon/Run2016C-18Apr2017-v1/AOD"
["CMSSW_8_0_28"]
belforte@lxplus045/~> 

belforte@lxplus045/~> dasgoclient --query "release file=/store/data/Run2016C/DoubleMuon/AOD/18Apr2017-v1/100001/56D1FA6E-D334-E711-9967-0025905A48B2.root" 
["CMSSW_8_0_28"]
belforte@lxplus045/~> 

Exit code 8028

Normally this has same meaning as 8020: a needed file is not present at the site where it was expected to to be and the site needs to fix this. If the problem is reproducible after O(1 day) or is happening in large amount, write to the support list and it will be followed up.

Only in the special cases in which you intentionally try to read files from a site different from the one where jobs run, more investigation and action are needed. In those cases, keep reading the details below.

Exit code 8028 means "FileOpenError with fallback" (as documented here). That means that the some input file could not be opened neither in the first attempt from the local storage in the execution site nor in the fallback attempt from a remote site using AAA. Note that even if the file is not present at the execution site, the job will still try to find/open it from the local storage, and only in case of failure use the fallback procedure.

Leaving the AAA error aside for a while, the first thing to contemplate here is to understand why the file could not be loaded from the local storage. Was it because the file is not available at the execution site? And if so, was it supposed to be available? If not, can you force CRAB to submit the jobs to the site(s) where the file is hosted? CRAB should do that automatically if the input file is one from the input dataset specified in the CRAB configuration parameter Data.inputDataset, unless you have set Data.ignoreLocality = True, or except in cases like using a (secondary) pile-up dataset. If yours is the last case, please read Using pile-up in this same twiki.

If you intentionally wanted (and had a good reason) to run jobs reading the input files via AAA, then yes, we have to care about why AAA failed. Since AAA must be able to access any CMS site, the next thing is to discard a transient problem: you can submit your jobs again and see if the error persists. Ultimately, you should write to the Computing Tools HyperNews forum (this is a forum for all kind of issues with CMS computing tools, not only CRAB) following these instructions.

Exit code 50660

Exit code 50660 means "Application terminated by wrapper because using too much RAM (RSS)" (as documented here). The amount of RAM that a job can use on a grid node is always limited and if memory need keeps increasing as the job run (so called "memory leak") the job will need to be killed. Grid sites used by CMS guarantee at least 2.5 GB of RAM per core, so allowing for some overhead, CRAB default is to ask 2GB per job. This is usually enough to run full RECO and user jobs should not normally need more. So the user first action when getting this error is to make sure that code is not leaking memory nor allocating useless large structures. If more RAM is really needed, it can be requested via the JobType.maxMemoryMB parameter in CRAB configuration file. Uselessly requesting too much RAM is very likely to result in wasted CPU (we will run less jobs then there are CPU cores available in a node, to spread the available RAM in fewer, larger, chunks), so you have to be careful, abuse will be monitored and tasks may get killed.

Each user is responsible for her/his code and needs to make sure that memory usage is under control. Various tools exist to identify and prevent memory leaks in C++ which are not in CRAB documentation scope. Generally speaking when investigating memory usage you want to make sure that you run on the same input as one job which resulte in memory problems, as usage can depend on number, sequence and kind of event processed. User may also benefic from the [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Commands#crab_preparelocal][crab preparelocal] command to replay one specific job interactively and monitor memory usage.

An important exception is in case the user runs multi-threaded applications, in particular CMSSW. In that case a single job will use multiple cores and not only can, but must use more than the default 2GB of RAM. It is up to the user to request the proper amount of memory, e.g. after measuring it running the code interactively, or by looking up what Production is using in similar workflows. As a generic rule of thumb, (1+1*num_threads) GB may be a good starting point.

Illegal parameter found in configuration. The parameter is named: 'numberEventsInLuminosityBlock'

The most common reasons for this error are:
  1. The user is trying to analyze an input dataset, but he/she has specified in the CRAB configuration file JobType.pluginName = 'PrivateMC' instead of JobType.pluginName = 'Analysis'.
  2. The user is generating MC events, correctly specifying in the CRAB configuration file JobType.pluginName = 'PrivateMC', but in the CMSSW parameter set configuration he/she has specified a source of type PoolSource. The solution is to not specify a PoolSource. Note: This doesn't mean to remove process.source completely, as this attribute must be present. One could set process.source = cms.Source("EmptySource") if no input source is used.

Segmentation Fault (exit code 11 or 139)

Usually segmentation faults are well reproducible and can be debugged by running locally on same input files as the CRAB job (e.g. via using [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Commands#crab_preparelocal][crab preparelocal]). Here's some general hint on how to tackle them from https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/5166/2.html :

A segfault is typically caused by invalid memory access (e.g. reading out of bounds of an array or dereferencing a null or random pointer). A simple step forward is to recompile the offending code with debug symbols, e.g.

  • USER_CXXFLAGS="-g" scram b
and run again. Then the stack trace will show the source file and line number where the segfault occurred. If the cause is not evident, you can add printouts or use gdb.

[ERROR] Operation expired

Some sites configuration can not handle remote access of large files (> 10 GB) and XRootD fails with a message like
== CMSSW:    [1] Reading branch EventAuxiliary
== CMSSW:    [2] Calling XrdFile::readv()
== CMSSW:    Additional Info:
                 [a] Original error: '[ERROR] Operation expired' (errno=0, code=206, source=xrootd.echo.stfc.ac.uk:1094 (site T1_UK_RAL)).
As of winter 2019 this almost only happens for files stored at T1_UK_RAL. If you are in this situation, a way out is to submit a new task using CMSSW ≥ 10_4_0 with the following duplicateCheckMode option in the PSet PoolSource
process.source = cms.Source("PoolSource",
   [...]
   duplicateCheckMode = cms.untracked.string("noDuplicateCheck")
)

When that is not an option and the problem is persistent, you may need to ask for a replica of the data at another site.

CRAB Client API

Multiple submission fails with a CMSSW "duplicate process" error

The general problem is that CMSSW parameter-set configurations don't like to be loaded twice. In that respect, each time the CRAB client loads a CMSSW configuration, it saves it in a local (temporary) cache identifying the loaded module with a key constructed out of the following three pieces: the full path to the module and the python variables sys.path and sys.argv.

A problem arises when the CRAB configuration parameter JobType.pyCfgParams is used. The arguments in JobType.pyCfgParams are added by CRAB to sys.argv, affecting the value of the key that identifies a CMSSW parameter-set in the above mentioned cache. And that's in principle fine, as changing the arguments passed to the CMSSW parameter-set may change the event processor. But when a python process has to do more than one submission (like the case of multicrab for multiple submissions), the CMSSW parameter-set is loaded again every time the JobType.pyCfgParams is changed and this may result in "duplicate process" errors. Below are two examples of these kind of errors:

CmsRunFailure
CMSSW error message follows.
Fatal Exception
An exception of category 'Configuration' occurred while
† †[0] Constructing the EventProcessor
† †[1] Constructing module: class=...... label=......
Exception Message:
Duplicate Process The process name ...... was previously used on these products.
Please modify the configuration file to use a distinct process name.

CmsRunFailure
CMSSW error message follows.
Fatal Exception
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
Exception Message:
MessageLogger PSet:
in vString categories duplication of the string ......
The above are from MessageLogger configuration validation.
In most cases, these involve lines that the logger configuration code
would not process, but which the cfg creator obviously meant to have effect.

One option would be to try to not use JobType.pyCfgParams. But if this is not possible, the more general ad-hoc solution would be to fork the submission into a different python process. For example, if you are doing something like documented in Multicrab using the crabCommand API then we suggest to replace each

submit(config)

by

from multiprocessing import Process
p = Process(target=submit, args=(config,))
p.start()
p.join()

(Of course, from multiprocessing import Process needs to be executed only once, so put it outside any loop.)

Multiple submission produces different PSetDump.py files

If the PSetDump.py file (found in task_directory/inputs) differs for the tasks from a multiple-submission python file, try forking the submission into different python processes, as recommended in the previous FAQ.

More on CRAB tasks

Recovery task

Recovery task: What

Recovery task is an important concept that can be useful in many circumstances.

The general idea is that a CRAB task has run to completion, all re-submission attempts done, but some of the necessary input data was not processed. A recovery task will run the same executable and configuration on the missed input data, and will add results to the same output destination (and DBS dataset) as the original task.

Recovery task: Why

CRAB developers try hard to give you a tool with perfect bookkeeping and full automation which brings each task to 100% success. Similarly do strive the operators of the global CMS submission infrastructure (aka HTCondor pool, aka glideIn) and the administrator of the many sites that contribute hardware resources for CMS data analysis. Yet at times things can go wrong, and we may not be able to investigate and fix every small glitch, and surely never within hours or days.

It is impossible to guarantee that a given task will always complete to 100% success in a short amount of time. At the same time it is impossible to make sure that all desired input data is available when the task is submitted. Moreover both good sense and experience show that the larger a task is, the larger is the chance it hits some problem. Large workflows therefore benefit from the possibility to run them sort of iteratively, with a short (hopefully one or two at most) succession of smaller and smaller tasks.

Recovery task: When

A partial list of real life events where a recovery task is user's fastest and simplest way to get work done:
  • Something went wrong in the global infrastructure and some jobs are lost beyond recovery
  • Something went wrong inside CRAB (bugs, hardware...) which can't be fixed by crab resubmit command
  • Some site went down for longer than it make sense to keep jobs in the queue
  • Some data was not available and had to be retransferred and took longer than... see above
  • More data have been added to the input dataset since the original task ran (pretty much as the above)
  • A new lumimask was prepared were lumis declared bad earlier are now good
  • ... more ...

Recovery task: How

You must of course have around the original

  • scram project area
  • crab configuration file including the pset and any other file referenced in there
  • original CRAB project directory if you did not publish output in DBS

The procedure to generate a recovery task is based on these simple steps:

  1. issue a crab kill . Killing the current task will guarantee that no change happens anymore
  2. make a list of lumis present in the desired input dataset (listIn)
  3. make a list of lumis successfully processed by original CRAB task A (listA)
  4. submit a new CRAB tasks B which process the missing lumis (listB = listIn - listA)

Details are slighly different if you published output in DBS or not:

output in DBS
follow the procedure in this FAQ
output not in DBS
follow the procedure in this Workbook example

Dealing with a growing input dataset and/or changing lumi-mask

While data taking is progressing, corresponding datasets in DBS and lumi-mask files are growing. Also data quality is sometimes improved for already existing data, leading to updated lumi-masks which compared to older lumi-masks include luminosity sections that were previously filtered out. Both of these situations lead to the common case where one would like to run a task (lets call it task B) over an input dataset partially analyzed already in a previous task (lets call it task A), where task B should skip the data already analyzed in task A.

This can be accomplished with a few lines in the CRAB configuration file, see an annotated example below.

from CRABClient.UserUtilities import config, getLumiListInValidFiles
from WMCore.DataStructs.LumiList import LumiList

config = config()

config.General.requestName = 'TaskB'
...
 # you want to use same Pset as in previous task, in order to publish in same dataset
config.JobType.psetName = <TaskA-psetName>
...
# and of course same input dataset
config.Data.inputDataset = <TaskA-input-dataset-name>
config.Data.inputDBS = 'global'  # but this will work for a dataset in phys03 as well

# now the list of lumis that you successfully processed in Task-A
# it can be done in two ways. Uncomment and edit the appropriate one:
#1. (recommended) when Task-A output was a dataset published in DBS
#taskALumis = getLumiListInValidFiles(dataset=<TaskA-output-dataset-name>, dbsurl='phys03')
# or 2. when output from Task-A was not put in DBS
#taskAlumis = LumiList(filename=<the LumiSummary.json file from running crab report on Task-A>

# now the current list of golden lumis for the data range you are interested, can be different from the one used in Task-A
officalLumiMask = LumiList(filename='<some-kosher-name>.json') 

# this is the main trick. Mask out also the lumis which you processed already
newLumiMask = officialLumiMask - taskALumis 

# write the new lumiMask file, now you can use it as input to CRAB
newLumiMask.writeJSON('my_lumi_mask.json')
# and there we, process from input dataset all the lumi listed in the current officialLumiMask file, skipping the ones you already have.
config.Data.lumiMask = 'my_lumi_mask.json' 
config.Data.outputDatasetTag = <TaskA-outputDatasetTag> #  add to your existing dataset
...

IMPORTANT NOTE : in this way you will add any lumi section in the intial data set that was turned from bad to good in the golden list after you ran Task-A, but if some of those data evolved the other way around (from good to bad), there is no way to remove those from your published datasets.

Using pile-up

Important Instructions:
Make sure you run your jobs at the site where the pile-up sample is. Not where the signal is.
This requires you to overrdie the location list that CRAB would extract from the inputDataset.

Rational and details:
The pile-up files have be specified in the CMSSW parameter-set configuration file. There is no way yet to tell in the CRAB configuration file that one wants to use a pile-up dataset as a secondary input dataset. That means that CRAB doesn't know that the CMSSW code will want to access pile-up files; CRAB only knows about the primary input dataset (if any). This means that, assuming there is a primary input dataset, when CRAB does data discovery to figure out to which sites should it submit the jobs, it will only take into account the input dataset specified in the CRAB configuration file (in the Data.inputDataset parameter) and submit the jobs to sites where this dataset is hosted. If there is no primary input dataset, CRAB will submit the jobs to the less busy sites. In any case, if the pile-up files are not hosted in the execution sites, they will be accessed via AAA (Xrootd). But reading the "signal" events directly from the local storage and the pile-up events via AAA is more inefficient than doing the other way around, since for each "signal" event that is read one needs to read in general many (> 20) pile-up events. Therefore, it is highly recommended that the user forces CRAB to submit the jobs to the sites where the pile-up dataset is hosted by whitelisting these sites using the parameter Site.whitelist in the CRAB configuration file. Note that one also needs to set Data.ignoreLocality = True in the CRAB configuration file in case of using a primary input dataset so to avoid CRAB doing data discovery and eventually complain (and fail to submit) that the input dataset is not available in the whitelisted sites. One can use DAS to get the list of sites that host a dataset.

Miscellanea

How CRAB finds data in input datasets from DBS

The following remarks apply to the main input dataset provided to CRAB via the Data.inputDataset configuration parameter:

  • Dataset status: Datasets in DBS can have different status. This is controlled by Production and the norm is to use VALID datasets. Datasets with different status may occasionally be useful, e.g. for comparison or dedicated study of the problems which led them to be deprecated. In order to run over a dataset whose status is not VALID, one has to set Data.allowNonValidInputDataset = True in the CRAB configuration.
  • File status: Files have a is_file_valid flag in DBS, usually set to False when file is lost or corrupted. CRAB considers only valid files in the dataset. Invalid files are skipped.
  • Data location: A dataset in DBS is divided in blocks. Blocks can be migrated with PhEDEx, and PhEDEx is the only service that knows about the current locations (host sites) of a dataset block. Therefore CRAB queries PhEDEx to retrieve the locations of the blocks in a dataset. Next, a data location (aka PNN = Phedex None Name) is turned into a site where to run (aka PSN = Prosessing Site Name) using SiteDB. If a block has no valid locations in PhEDEx or no PSN associated in SiteDB, CRAB skips the block.
  • User datasets: For datasets created by users and published in DBS phys03 instance, the above is modified as follows:
    • Dataset status and File status flags are initially set to VALID by CRAB when the dataset is published; then can be changed by the user.
    • Data block location is tracked as origin_site_name in DBS and data are assumed to never move. If datasets are moved, the user can update the origin_site_name. There is no way to have multiple locations.

How many jobs can I run at the same time ?

CRAB runs jobs on the Grid using a global HTCondor pool created via glideInWms machinery, think of it like a global batch system whith execution nodes all over the places. The most important thing which controls how many jobs you can run is the overall number of execution slots (CPU's) available for your jobs, i.e. that match your requirement of data access, memory and running time. Then HTCondor tries hard to give to every user the same share of computing resources, i.e. equal resources to everyone at any given time. You are not penalized for having run more jobs yesterday, and not rewared either for not having used your share in the past. To assess the user share, HTCondor considers only the number of cores that you are using (until May 2017 also the number of RAM GB's was accounted for).

Beware thus of asking for much more memory per core than you need (see What is the maximum memory per job (maxMemoryMB) I can request?).

How to predict how long my jobs will run for?

Use the --dryrun option when doing crab submit. See crab submit --dryrun.

How to list/copy/remove files/directories in a storage element area?

You can use the gfal-* commands from a machine that has GFAL2 utility tools installed (e.g. lxplus). You have to pass Physical File Names (PFNs) as arguments to the commands. To get the Physical File Name given a Logical File Name and a CMS node name, you can use the lfn2pfn PhEDEx API. LFNs are names like /store/user/mario/myoutput; note that a directory is also a file name.

For example, for the LFN /store/user/username/myfile.root stored in T2_IT_Pisa you can do the following (make sure you did cmsenv before, so to use a new version of curl), where you can replace the first two lines with the values which are useful to you and simply copy/paste the long curl command:

site=T2_IT_Pisa
lfn=/store/user/username/myfile.root
curl -ks "https://cmsweb.cern.ch/phedex/datasvc/perl/prod/lfn2pfn?node=${site}&lfn=${lfn}&protocol=srmv2" | grep PFN | cut -d "'" -f4

which returns:

srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/cms/store/user/username/myfile.root

Before executing the gfal commands, make sure to have a valid proxy:

voms-proxy-init -voms cms

Enter GRID pass phrase for this identity:
Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u<user-id>.

Your proxy is valid until <some date-time 12 hours in the future>

The most useful gfal commands and their usage syntax for listing/removing/copying files/directories are in the examples below (it is recommended to unset the environment when executing gfal commands, i.e. to add env -i in front of the commands). See also the man entry for each command (man gfal-ls etc.):

List a (remote) path:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-ls <physical-path-name-to-directory>

Remove a (remote) file:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-rm <physical-path-name-to-file>

Recursively remove a (remote) directory and all files in it:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-rm -r <physical-path-name-to-directory>

Copy a (remote) file to a directory in the local machine:

env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-copy <physical-path-name-to-source-file> file://<absolute-path-to-local-destination-directory>
Note: the <absolute-path-to-local-destination-directory> starts with / therefore there are three consecutive / characters like file:///tmp/somefilename.root

Why are my jobs submitted to a site that I had explicitly blacklisted (not whitelisted)?

There is a site overflow mechanism in place, which takes place after CRAB submission. Sites are divided in regions of good WAN/xrootd connectivity (e.g. US, Italy, Germany etc.), then jobs queued at one site A for too long are allowed to overflow to a well connected site B which does not host the requested input data but from where data will be read over xrootd. Rationale is that even if those jobs were to fail due to unable to read data or a problem in site B, they will be automatically resubmitted, so nothing is lost with respect to keeping those jobs idle in the queue waiting for free slots at site A. The site overflow can be turned off via the Debug.extraJDL CRAB configuration parameter :

config.section_("Debug")
config.Debug.extraJDL = ['+CMS_ALLOW_OVERFLOW=False']

Note: if you change this configuration option for an already-created task (for instance if you noticed a lot of job failures at a particular site and even after blacklisting the jobs keep going back), you can't simply change the option in the configuration and resubmit. You'll have to kill the existing task and make a new task to get the option to be accepted. You can't simply change it during resubmission.

What is glideinWms Overflow and how can I avoid using it ?

See above FAQ

Doing lumi-mask arithmetics

There is a tool written in python called LumiList.py (available in the WMCore library; is the same code as cmssw/FWCore/PythonUtilities/python/LumiList.py ) that can be used to do lumi-mask arithmetics. The arithmetics can even be done inside the CRAB configuration file (that's the advantage of having the configuration file written in python). Below are some examples.

Example 1: A run range selection can be achieved by selecting from the original lumi-mask file the run range of interest.

from WMCore.DataStructs.LumiList import LumiList

lumiList = LumiList(filename='my_original_lumi_mask.json')
lumiList.selectRuns([x for x in range(193093,193999+1)])
lumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Example 2: Use a new lumi-mask file that is the intersection of two other lumi-mask files.

from WMCore.DataStructs.LumiList import LumiList

originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json')
originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json')
newLumiList = originalLumiList1 & originalLumiList2
newLumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Example 3: Use a new lumi-mask file that is the union of two other lumi-mask files.

from WMCore.DataStructs.LumiList import LumiList

originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json')
originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json')
newLumiList = originalLumiList1 | originalLumiList2
newLumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

Example 4: Use a new lumi-mask file that is the subtraction of two other lumi-mask files.

from WMCore.DataStructs.LumiList import LumiList

originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json')
originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json')
newLumiList = originalLumiList1 - originalLumiList2
newLumiList.writeJSON('my_lumi_mask.json')

config.Data.lumiMask = 'my_lumi_mask.json'

User quota in the CRAB scheduler machines

Each user has a home directory with 100GB of disk space in each of the scheduler machines (schedd for short) assigned to CRAB3 for submitting jobs to the Grid. Whenever a task is submitted by the CRAB server to a schedd, a task directory is created in this space containing among other things CRAB libraries and scripts needed to run the jobs. Log files from Condor/DAGMan and CRAB itself are also placed there. (What is not available in the schedds are the cmsRun log files, except for the snippet available in the CRAB job log file.) As a guidance, a task with 100 jobs uses on average 50MB of space, but this number depends a lot on the number of resubmissions, since each resubmission produces its log files. If a user reaches his/her quota in a given schedd, he/she will not be able to submit more jobs via that schedd (he/she may still be able to submit via other schedd, but since the user can not choose the schedd to which to submit -the choice is done by the CRAB server-, he/she would have to keep trying the submission until the task goes to a schedd with non-exahusted quota). To avoid that, task directories are automatically removed from the schedds after 30 days of their last modification. If a user reaches 50% of its quota in a given schedd, an automatic e-mail similar to the one shown below is sent to him/her.

Subject: WARNING: Reaching your quota

Dear analysis user <username>,

You are using <X>% of your disk quota on the server <schedd-name>. The moment you reach the disk quota of <Y>GB, you will be unable to
run jobs and will experience problems recovering outputs. In order to avoid that, you have to clean up your directory at the server. 
Here are the instructions to do so:
 https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#How_to_clean_up_your_directory_i
Here it is a more detailed description of the issue:
 https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Disk_space_for_output_files
If you have any questions, please contact hn-cms-computing-tools(AT)cern.ch
 Regards,
CRAB support

This e-mail has a link (https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#How_to_clean_up_your_directory_i) to the instructions on how to clean up space in the user's home directory in a schedd. A user can follow the instructions in that page, or alternatively use the crab purge command.

Obsolete/Deprecated stuff (kept here as permanent documentation just in case)

crab-env-bootstrap.sh script to overcome the CRAB3 and CMSSW environment conflicts

To overcome the CRAB3 vs CMSSW environment conflicts, you can use the following script available in CVMFS (/cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh) without need to source the CRAB3 environment. You could do something like this:

cmsenv
# DO NOT setup the CRAB3 environment
alias crab='/cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh'
crab submit
crab status
...
# check that you can run cmsRun locally

Details:

The usual way to setup CRAB3 is to first source the CMSSW environment using cmsenv and then source the CRAB3 environment using source /cvmfs/cms.cern.ch/crab3/crab.(c)sh. This setup procedure has the disadvantage that, depending on which CMSSW version is used, once the CRAB3 environment is sourced the CMSSW commands like cmsRun will stop working (also other useful commands like gfal-copy will not work). Solving this at the root and make the CRAB client RPM compatible to the CMSSW ones is not possible for the way the tools in the COMP repository are built, and because cmsweb has its own release cycle independent from CMSSW.

To overcome this limitation we are now providing a wrapper bash script that can be run in place of the usual crab command. This wrapper script will take care of setting the environment in the correct way before running the usual crab command, and will leave the environment as it was when exiting. The script will be soon available in the CMSSW distribution under the name 'crab' and its usage will be transparent to the user: you will just run the crab commands as you would have done before. In the meantime, the script is available for testing here: /cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh.

ERROR: SyntaxError: invalid syntax (Mixins.py, line 714)

The problematic pset is FWCore/ParameterSet/Mixins.py from CMSSW:

[line 714] p = tLPTest("MyType",** { "a"+str(x): tLPTestType(x) for x in xrange(0,300) } )

This uses dictionary comprehensions, a feature available in python > 2.7. While CMSSW (setup via cmsenv) uses python > 2.7, CRAB (setup via /cvmfs/cms.cern.ch/crab3/crab.sh) still uses python 2.6.8. To overcome this problem, don't setup the CRAB environment and instead use the crab-env-bootstrap.sh script (see this FAQ).

LeonardoCristella - 2018-04-23

Edit | Attach | Watch | Print version | History: r105 < r104 < r103 < r102 < r101 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r105 - 2019-08-09 - StefanoBelforte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback