Complete: ![]() |
Go to SWGuideCrab |
Impossible to retrieve proxy from myproxy.cern.ch ...
myproxy
. CRAB client always tries to keep a valid one there, but there are some known edge cases where this fails, e.g. https://github.com/dmwm/CRABServer/issues/5168myproxy
and then issue the crab commad again.
To remove stale credentials:
grep myproxy-info <CRAB project directory>/crab.log # example: grep myproxy-info crab_20160308_140433/crab.logyou will get something like
command: myproxy-info -l ec95456d3589ed395dc47d3ada8c94c67ee588f1 -s myproxy.cern.chand/or
command : myproxy-info -l belforte_CRAB -s myproxy.cern.ch # In this case you will see your CERN computer username in place of "belforte", of courseIdeally the humanly-named credential is what matters and can be located easily, but for reasons you do not want to know, at times CRAB needs the horrible hex string then simply issue a
myproxy-destroy
command with same arguments:
# example. In real life replace the long hex string with the one from your crab.log myproxy-destroy -l ec95456d3589ed395dc47d3ada8c94c67ee588f1 -s myproxy.cern.ch # example. In real life put your CERN username myproxy-destroy -l <username>_CRAB -s myproxy.cern.chIf things still fail after than, send the following additional info in your request for support, replacing the long hex string with the one that you found in crab.log (ec95456d3589ed395dc47d3ada8c94c67ee588f1 in the above example):
voms-proxy-info -all
myproxy-info -d -l <long-hex-string> -s myproxy.cern.ch
myproxy-info -d -l _CRAB -s myproxy.cern.ch
Error UsernameException: Error contacting CRIC. Details follow: Executed command: curl -sS --capath /etc/grid-security/certificates --cert /tmp/x509up_u8516 --key /tmp/x509up_u8516 'https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=whoami' Stdout: Stderr: curl: (35) error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol versionThis is known to happen with CMSSW_7_* and very likley older releases as well both in SL6 and SL7. The problem is with the version of
curl
binary in CMSSW release, which is not up to date with security requirements in modern https servers.
Best is of course to use a more recent CMSSW release, but if you are really stuck in the past a workaround is
to issue this command right after cmsenv
BEWARE THIS WILL MAKE CRAB WORK BUT BREAK YOUR gcc ENVIRONMENT # for SL6 source /cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/curl/7.59.0/etc/profile.d/init.sh # for SL7 source /cvmfs/cms.cern.ch/slc7_amd64_gcc630/external/curl/7.59.0/etc/profile.d/init.sh
cmsenv
before using CRAB
cmssw_cc6
There is an important caveat though: myproxy-*
commands, therefore from there crab
commands can not create the long lived credential in myproxy.cern.ch needed to run a CRAB tasks , nor verify it
myproxy
in separate SL7 session by executing a crab
command at least once a week (e.g. crab createmyproxy
)
voms-proxy-init -voms cms -rfc -valis 192:00
export X509_USER_PROXY=`voms-proxy-info -file`
crab submit --proxy=$X509_USER_PROXY ....
crab --proxy=$X509_USER_PROXY ...
maxMemoryMB
) I can request? config.JobType.numCores
> 1) most likely the default memory value is not enough. The user share of computing resources accounts for the requested memory per core.
'Automatic'
splitting mode? Data.splitting
parameter has now'Automatic'
.JobType.psetName
parameter and possible further arguments. Probe jobs have a job id of the form 0-[1,2,3,...]
, they can not be resubmitted and the task will fail if none of the probe jobs complete successfully. The output files transfer is disabled for probe jobs.
Data.unitsPerJob
parameter), after which they gracefully stop processing input data. The remaining data will be processed in the next stage (tail) and their jobs labelled as "rescheduled" in the main stage (in the dashboard they will always appear as "failed").
n-[1,2,3,...]
, where n=1,2,...
represents the tail stage number. For small tasks, less than 100 jobs, one tail stage is started when all jobs have completed (successfully or failed). For larger tasks, a first tail stage collects all remaining input data from the first 50% of completed jobs, followed by a stage that processes data when 80% of jobs have completed, and finally a stage collecting leftover input data at 100% job completion. crab status
command shows only the main and tail jobs. For the list of all jobs add the --long
option.
In the short format of crab status
command the total number of jobs is computed by removing from the list of main jobs those which "failed" and are being "rescheduled" and adding the current number of "tail" jobs. Note that the total number of submitted jobs will increase at every "tail stage" and that the grafana dashboard will instead show every job : probes plus main plus tails, therefore the total number of jobs in the dashboard will be different from what printed by crab status
. As usual: when in doubt, trust crab status
output.
--dryrun
when we initially introduced CRAB3 and so many users are stickting to it crab preparelocal
and Automatic Splitting
) which are better suited for this:
scope:containerName
) as input to a CRAB job in the Data.inputDataset
configuration parameter:
config.JobType.inputDataset = 'user.belforte:/GenericTTbar/belforte-crab_20230306_162534-94ba0e06145abd65ccb1d21786dc7e1d/USER'Important Notes:
/store/user/username/...
at some site leaving to the user the creation and fillings of the Rucio container. A properly formed Rucio container is also a byproduct of running standard CRAB tasks using Rucio for Asynchronous Stage Out (i.e. sending outpu to /store/user/rucio/username/...
containerName
needs to follow CMS DBS dataset naming syntax (like all DIDs in CMS instance of Rucio). Then CRAB will use Rucio to find blocks (in DBS term) location and file names and create jobs as usual.
fileBased
splitting, and can not ask for parents or secondary datasets which rely on lumi information to connect primary/secondary files.
NEW
to TAPERECALL
. TAPERECALL
task and check Rucio's rules. If the rule changes state from REPLICATING
to OK
, CRAB will change the task status back to NEW
, waiting for the main thread to process it like the usual tasks. *AOD*
tiers. Becasue other data tiers are usually very large, and hardly needed in their entirety.
Still, you can request tape recall for other data tiers if you limit processing to a few blocks via Data.inputBlocks
parameter. In this case blocks will be automatically recalled from disk, but only if total size is small (10TB currently) Data.partialDataset = True
parameter to skip all tape recall procedures.Data.inputBlocks
parameter: let's assume we want to recall /ZeroBias/Commissioning2021-v1/RAW#214fbb11-2d5c-4b8c-8ee5-b68de1d89153
and /ZeroBias/Commissioning2021-v1/RAW#8760f6e6-6c5e-4c65-854d-02d4bc3312c5
from /ZeroBias/Commissioning2021-v1/RAWconfig.JobType.pluginName = 'Analysis' config.Data.inputDataset = '/ZeroBias/Commissioning2021-v1/RAW' config.Data.inputBlocks = [ '/ZeroBias/Commissioning2021-v1/RAW#214fbb11-2d5c-4b8c-8ee5-b68de1d89153', '/ZeroBias/Commissioning2021-v1/RAW#8760f6e6-6c5e-4c65-854d-02d4bc3312c5', ] # or provide only block UUID #config.Data.inputBlocks = [ # '214fbb11-2d5c-4b8c-8ee5-b68de1d89153', # '8760f6e6-6c5e-4c65-854d-02d4bc3312c5', #]
crab.log
files when they are uploaded via crab uploadlog
;
$CMSSW_BASE/lib
, $CMSSW_BASE/biglib
and $CMSSW_BASE/module
. One can also tell CRAB to include the directory $CMSSW_BASE/python
by setting JobType.sendPythonFolder = True
in the CRAB configuration.
data
, interface
and python
directory recursively found in $CMSSW_BASE/src
.
JobType.inputFiles
.
debug/crabConfig.py
).
debug/originalPSet.py
).
PSet.pkl
) plus a simple PSet.py
file to load the pickle file.
config.JobType.inputFiles
parameter, the directory structure inside the sandbox may be different and affect where the files are placed in the working directory of the job. /afs/cern.ch/user/e/erupeika/supportFiles/foo.root
or myfiles/foo.root
will appear as foo.root
in the sandbox and will be extracted as foo.root
to the job's root working directory.
foo
with files bar1
and bar2
inside it is specified in the inputFiles parameter, the sandbox will contain foo
, foo/bar1
and foo/bar2
(the working directory of the job will therefore also contain a directory foo
with files bar1
and bar2
).
mydir/file1
you should put in crab configuration config.JobType.inputFiles='mydir'
and of course avoid having extra stuff in that directory. While if you put config.Data.inputFiles='mydir/file1'
your application needs to open file1
/store/user/[rucio/]<username>[/<subdirs>]
where username
is the CERN primary account username;
/store/group[/rucio]/<groupname>[/<subdirs>]
where groupname
can be any already existing directory under /store/group/
.
/store/local/<dir>[/<subdirs>]
is also allowed.
If /rucio
is present, CRAB will use Rucio for stageout. User will need to have Rucio quota at the destination storage site.
These are all the allowed paths that can be set in the CRAB configuration parameter Data.outLFNDirBase
. If any other path is given, the submission of the task will fail.
/store/user/
area that uses a different username than the one of my CERN primary account? /store/user/<username>/
(by default). If the store user area uses a different username, it’s up to the destination site to remap that (via a symbolic link or something similar). The typical case is Fermilab; to request the mapping of the store user area, FNAL users should follow the directions on the usingEOSatLPC web pageData.outLFN
parameter of the CRAB configuration file an LFN directory path of the kind /store/user/[<some-username>/<subdir>*]
(i.e. a store path that starts with /store/user/
), CRAB will check if some-username
matches with the globally unique username extracted from the credential. If it doesn't, it will give an error message and not submit the task. The error message would be something like this:
Error contacting the server. Server answered with: Invalid input parameter Reason is: The parameter Data.outLFN in the CRAB configuration file must start with either '/store/user/<username>/' or '/store/group/<groupname>/' (or '/store/local/<something>/' if publication is off), wher...Unfortunately the "Reason is:" message it cut at 200 characters. The message should read:
Reason is: The parameter Data.outLFN in the CRAB configuration file must start with either '/store/user/<username>/' or '/store/group/<groupname>/' (or '/store/local/<something>/' if publication is off), where username is your username as registered in CMS services (i.e. the username of your CERN primary account).A similar message should be given by crab checkwrite if the user does
crab checkwrite --site=<CMS-site-name> --lfn=/store/user/<some-username>
.
======== HTCONDOR JOB SUMMARY at ... START ======== CRAB ID: 1 Execution site: ... Current hostname: ... Destination site: ... Output files: my_output_file.root=my_output_file_1.rootIf the output file in question doesn't appear in that list, then CRAB doesn't know about it, and of course it will not be transferred. This doesn't mean that the output file was not produced; it is simply that CRAB has to know beforehand what are the output files that the job produces. If the output file is produced by either PoolOutputModule or TFileService, CRAB will automatically recognize the name of the output file when the user submits the task and it will add the output file name to the list of expected output files. On the other hand, if the output file is produced by any other module, the user has to specify the output file name in the CRAB configuration parameter
JobType.outputFiles
in order for CRAB to know about it. Note that this parameter takes a python list, so the right way to specify it is:
config.JobType.outputFiles = ['my_output_file.root']
/store/user/somename
to gsiftp://eosuserftp.cern.ch/eos/user/s/somename
which is the proper end point for writing to CERNBOX.
NOTE: since CERNBOX is NOT a CMS storage, files in there can not be listed in DBS nor moved with Rucio, nor transparently accessed by grid jobs.
Note about certificate: since June 2021 (at least) there is no need anymore to contact CERN IT support as indicated previously. A certificate which works for CRAB submission already satisfies the requirements for working for CERNBOX as well. If you have problems you should verify in a CRAB-independent way via the following commands on lxplus (make sure to replace s/somename
with <the first letter of your CERN username>/<your CERN username>
e.g. b/belforte
) voms-proxy-init -voms cms
gfal-ls gsiftp://eosuserftp.cern.ch/eos/user/s/somename
maxMemoryMB
) I can request?).
CRAB project directory: /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/zuolo/crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1 Task name: 190823_123943:dzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1 Grid scheduler - Task Worker: crab3@vocms0107.cern.ch - crab-prod-tw01 Status on the CRAB server: SUBMITTED Task URL to use for HELP: https://cmsweb.cern.ch/crabserver/ui/task/190823_123943%3Adzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1 Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=dzuolo&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=190823_123943%3Adzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1 New dashboard monitoring URL: https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=dzuolo&var-task=190823_123943%3Adzuolo_crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1 In case of issues with the new dashboard, please provide feedback to hn-cms-computing-tools@cern.ch Status on the scheduler: FAILED No publication information (publication has been disabled in the CRAB configuration file) Log file is /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/zuolo/crab_TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8_1/crab.login this case you can more information with
crab status --long
Job State Most Recent Site Runtime Mem (MB) CPU % Retries Restarts Waste Exit Code 0-1 no output T1_US_FNAL 1:00:18 1548 97 0 0 0:00:10 50664 0-2 no output T1_US_FNAL 1:00:15 1509 97 0 0 0:00:10 50664 0-3 no output T1_US_FNAL 1:00:18 1445 100 0 0 0:00:10 50664 0-4 no output T1_US_FNAL 1:00:17 1402 98 0 0 0:00:11 50664 0-5 no output T2_US_MIT 1:00:16 1656 94 0 0 0:00:10 50664
"Error: Failed to retrieve username from CRIC."
crab checkusername
uses the following sequence of bash commands, which you should try to execute one by one (make sure you have a valid proxy) to check if they return what is expected.
1) It gets the path to the users proxy file with the command
which scram >/dev/null 2>&1 && eval `scram unsetenv -sh`; voms-proxy-info -pathwhich should return something like
/tmp/x509up_u575062) It defines the path to the CA certificates directory with the following python command
import os capath = os.environ['X509_CERT_DIR'] if 'X509_CERT_DIR' in os.environ else "/etc/grid-security/certificates" print capathwhich should be equivalent to the following bash command
if [ "x$X509_CERT_DIR" != "x" ]; then capath=$X509_CERT_DIR; else capath=/etc/grid-security/certificates; fi echo $capathand which in lxplus should result in
/etc/grid-security/certificates3) It uses the proxy file and the capath to query
https://cmsweb.cern.ch/sitedb/data/prod/whoami
curl -s --capath <output-from-command-2-above> --cert <output-from-command-1-above> --key <output-from-command-1-above> 'https://cmsweb.cern.ch/sitedb/data/prod/whoami'which should return something like
{"result": [ {"dn": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atanasi/CN=710186/CN=Andres Jorge Tanasijczuk", "login": "atanasi", "method": "X509Proxy", "roles": {"operator": {"group": ["crab3"], "site": []}}, "name": "Andres Jorge Tanasijczuk"} ]}4) Finally it parses the output from the above query to extract the username from the "login" field (in my case it is
atanasi
).
When reporting a problem with crab checkusername
with "Failed to retrieve username from SiteDB." to the CRAB experts, it would be useful to add the output from the above commands.
crab checkusername
gives an error retrieving the username from SiteDB, this should not stop you from trying to submit jobs with CRAB, because the error might just be a problem with crab checkusername
itself and not a real problem with your registration in SiteDB (CRAB uses a different mechanism than the one described above to check the users' registration in SiteDB when attempting to submit jobs to the grid).
"Task could not be submitted because there is no DISK replica"
crab status
command will tell you how to monitor progress. Note that rucio
command are currently not compatible with CMSSW environment and you will need a new shell window. You can find documentation for Rucio in CSM Rucio twiki page
3. If you have a special need which can't be solved by the available automatic procedure, you can contact CMS Data Transfer team via mail: cms-comp-ops-transfer-team@cernNOSPAMPLEASE.ch
"User quota limit reached; cannot upload the file"
crab submit
the user may get this error messages:
Error contacting the server. Server answered with: Invalid input parameter Reason is: User quota limit reached; cannot upload the fileError explanation: The user has reached the limit of 4.88GB in its CRAB cache area. Read more in this FAQ. What to do: Files in the CRAB cache are automatically deleted after 5 days, but the user can clean his/her cache area at any time. See how in this FAQ.
"Trapped exception in Dagman.Fork"
crab status
:
Failure message: The CRAB server backend was not able to (re)submit the task, because the Grid scheduler answered with an error. This is probably a temporary glitch. Please try again later. If the error persists send an e-mail to hn-cms-computing-tools@cern.ch<mailto:hn-cms-computing-tools@cern.ch>. Error reason: Trapped exception in Dagman.Fork: <type 'exceptions.RuntimeError'> Unable to edit jobs matching constraint <traceback object at 0xa113368> File "/data/srv/TaskManager/3.3.1512.rc6/slc6_amd64_gcc481/cms/crabtaskworker/3.3.1512.rc6/lib/python2.6/site-packages/TaskWorker/Actions/DagmanResubmitter.py", line 113, in executeInternal schedd.edit(rootConst, "HoldKillSig", 'SIGKILL')As the error message says, this should be a temporary failure. One should just keep trying until it works. But after doing
crab resubmit
, give it some time to process the resubmission request; it may take a couple of minutes to see the jobs reacting to the resubmission.
"Task failed to bootstrap on schedd"
crab submit
and crab status
the user may get this error message:
Task status: UNKNOWN Error during task injection: Task failed to bootstrap on scheddError explanation: The submission of the task to the scheduler machine has failed. What to do: Submit again.
"Failed to contact Schedd"
crab status
the user may get one of these error messages:
Error during task injection: <task-name>: Failed to contact Schedd: Failed to fetch ads from schedd.
Error during task information retrieval: <task-name>: Failed to contact Schedd: .Error explanation: This is a temporary communication error with the scheduler machine (submission node), most probably because the scheduler is overloaded. What to do: Try again after a couple of minutes.
"Splitting task ... on dataset ... with ... method does not generate any job"
crab status
:
"Block ... contains more than 100000 lumis and cannot be processed for splitting. For memory/time contraint big blocks are not allowed. Use another dataset as input."
userInputFiles
feature of CRAB. An annotated example of how to do this in python is below, note that you have to disable DBS publication, indicate split by file and provide input file locations, other configuaration parameters can be set as usual:
# this will use CRAB client API from CRABAPI.RawCommand import crabCommand # talk to DBS to get list of files in this dataset from dbs.apis.dbsClient import DbsApi dbs = DbsApi('https://cmsweb.cern.ch/dbs/prod/global/DBSReader') dataset = '/BsToJpsiPhiV2_BFilter_TuneZ2star_8TeV-pythia6-evtgen/Summer12_DR53X-PU_RD2_START53_V19F-v3/AODSIM' fileDictList=dbs.listFiles(dataset=dataset) print ("dataset %s has %d files" % (dataset, len(fileDictList))) # DBS client returns a list of dictionaries, but we want a list of Logical File Names lfnList = [ dic['logical_file_name'] for dic in fileDictList ] # this is now standard CRAB configuration from WMCore.Configuration import Configuration config = Configuration() config.section_("General") config.General.transferLogs = False config.section_("JobType") config.JobType.pluginName = 'Analysis' # in following line of course replace with your favorite pset config.JobType.psetName = 'demoanalyzer.py' config.section_("Data") # following 3 lines are the trick to skip DBS data lookup in CRAB Server config.Data.userInputFiles = lfnList config.Data.splitting = 'FileBased' config.Data.unitsPerJob = 1 # since the input will have no metadata information, output can not be put in DBS config.Data.publication = False config.section_("User") # config.section_("Site") # since there is no data discovery and no data location lookup in CRAB # you have to say where the input files are config.Site.whitelist = ['T2_CH_CERN'] config.Site.storageSite = 'T2_CH_CERN' result = crabCommand('submit', config = config) print (result)
"Block ... contains more than 100000 lumis."
useParent
is allowed since in that case CRAB uses parentage information stored in DBS to match input files.
In practice your crabConfig file must have:
config.Data.splitting = 'FileBased' config.Data.runRange = '' config.Data.lumiMask = ''( the paremeters with an assigned null value
``
can be omitted, but if present must indicate the null string )
and must NOT contain the following parameter
config.Data.secondaryInputDataset
crab report
, but while DBS information is available forever, crab commands on a specific task may not
"Syntax error in CRAB configuration"
crab submit
the user may get one of these error messages:
Syntax error in CRAB configuration: invalid syntax (<CRAB-configuration-file-name>.py, <line-where-error-occurred>)
Syntax error in CRAB configuration: 'Configuration' object has no attribute '<attribute-name>'Error explanation: The CRAB configuration file could not be loaded, because there is a syntax error somewhere in it. What to do: Check the CRAB configuration file and fix it. There could be a misspelled parameter or section name, or you could be trying to use a configuration attribute (parameter or section) that was not defined. To get more details on where the error occurred, do:
python import <CRAB-configuration-file-name> #without the '.py'which gives:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<CRAB-configuration-file-name>.py", <line-where-error-occurred> <error-python-code> ^or
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<CRAB-configuration-file-name>.py", <line-where-error-occurred>, in <module> <error-python-code> AttributeError: 'Configuration' object has no attribute '<attribute-name>'For more information about the CRAB configuration file, see CRAB3ConfigurationFile.
CRAB project directory
crab
command fails with messages like "Cannot find .requestcache file"
or "... is not a valid CRAB project directory"
or otherwise complains that it can not find the tasks you are trying to send the command to, a problem with local directory where crab submit
caches relevant information is likely (maybe disk got full, or corrupted, or you removed a file unintentionally).
Please fine more information about the CRAB project directory
and possible recovery action on your site in CRAB3Commands#CRAB_project_directory
An exception of category 'DictionaryNotFound' occurred
, like in this example:
----- Begin Fatal Exception 08-Jun-2017 18:18:04 CEST----------------------- An exception of category 'DictionaryNotFound' occurred while [0] Constructing the EventProcessor Exception Message: No Dictionary for class: 'edm::Wrapper<edm::DetSetVector<CTPPSDiamondDigi> >' ----- End Fatal Exception -------------------------------------------------in this case, most likely the input data have been produced with a CMSSW version not compatible with the one used in CRAB job. In general it's not supported reading data with a release older than what it was produced with. To find out which release was used to produce a given dataser of file, adapt following examples to your situation:
belforte@lxplus045/~> dasgoclient --query "release dataset=/DoubleMuon/Run2016C-18Apr2017-v1/AOD" ["CMSSW_8_0_28"] belforte@lxplus045/~>
belforte@lxplus045/~> dasgoclient --query "release file=/store/data/Run2016C/DoubleMuon/AOD/18Apr2017-v1/100001/56D1FA6E-D334-E711-9967-0025905A48B2.root" ["CMSSW_8_0_28"] belforte@lxplus045/~>
Data.inputDataset
, unless you have set Data.ignoreLocality = True
, or except in cases like using a (secondary) pile-up dataset. If yours is the last case, please read Using pile-up in this same twiki.
If you intentionally wanted (and had a good reason) to run jobs reading the input files via AAA, then yes, we have to care about why AAA failed. Since AAA must be able to access any CMS site, the next thing is to discard a transient problem: you can submit your jobs again and see if the error persists. Ultimately, you should write to the Computing Tools HyperNews forum (this is a forum for all kind of issues with CMS computing tools, not only CRAB) following these instructions.
JobType.maxMemoryMB
parameter in CRAB configuration file. Uselessly requesting too much RAM is very likely to result in wasted CPU (we will run less jobs then there are CPU cores available in a node, to spread the available RAM in fewer, larger, chunks), so you have to be careful, abuse will be monitored and tasks may get killed.
Each user is responsible for her/his code and needs to make sure that memory usage is under control. Various tools exist to identify and prevent memory leaks in C++ which are not in CRAB documentation scope. Generally speaking when investigating memory usage you want to make sure that you run on the same input as one job which resulte in memory problems, as usage can depend on number, sequence and kind of event processed. User may also benefic from the crab preparelocal command to replay one specific job interactively and monitor memory usage.
An important exception is in case the user runs multi-threaded applications, in particular CMSSW. In that case a single job will use multiple cores and not only can, but must use more than the default 2GB of RAM. It is up to the user to request the proper amount of memory, e.g. after measuring it running the code interactively, or by looking up what Production is using in similar workflows. As a generic rule of thumb, (1+1*num_threads) GB may be a good starting point.
JobType.pluginName = 'PrivateMC'
instead of JobType.pluginName = 'Analysis'
.
JobType.pluginName = 'PrivateMC'
, but in the CMSSW parameter set configuration he/she has specified a source of type PoolSource
. The solution is to not specify a PoolSource
. Note: This doesn't mean to remove process.source
completely, as this attribute must be present. One could set process.source = cms.Source("EmptySource")
if no input source is used.
USER_CXXFLAGS="-g" scram b
== CMSSW: [1] Reading branch EventAuxiliary == CMSSW: [2] Calling XrdFile::readv() == CMSSW: Additional Info: [a] Original error: '[ERROR] Operation expired' (errno=0, code=206, source=xrootd.echo.stfc.ac.uk:1094 (site T1_UK_RAL)).As of winter 2019 this almost only happens for files stored at T1_UK_RAL. If you are in this situation, a way out is to submit a new task using CMSSW ≥ 10_4_0 with the following
duplicateCheckMode
option in the PSet PoolSource
process.source = cms.Source("PoolSource", [...] duplicateCheckMode = cms.untracked.string("noDuplicateCheck") )When that is not an option and the problem is persistent, you may need to ask for a replica of the data at another site.
sys.path
and sys.argv
.
A problem arises when the CRAB configuration parameter JobType.pyCfgParams
is used. The arguments in JobType.pyCfgParams
are added by CRAB to sys.argv
, affecting the value of the key that identifies a CMSSW parameter-set in the above mentioned cache. And that's in principle fine, as changing the arguments passed to the CMSSW parameter-set may change the event processor. But when a python process has to do more than one submission (like the case of multicrab for multiple submissions), the CMSSW parameter-set is loaded again every time the JobType.pyCfgParams
is changed and this may result in "duplicate process" errors. Below are two examples of these kind of errors:
CmsRunFailure CMSSW error message follows. Fatal Exception An exception of category 'Configuration' occurred while [0] Constructing the EventProcessor [1] Constructing module: class=...... label=...... Exception Message: Duplicate Process The process name ...... was previously used on these products. Please modify the configuration file to use a distinct process name.
CmsRunFailure CMSSW error message follows. Fatal Exception An exception of category 'Configuration' occurred while [0] Constructing the EventProcessor Exception Message: MessageLogger PSet: in vString categories duplication of the string ...... The above are from MessageLogger configuration validation. In most cases, these involve lines that the logger configuration code would not process, but which the cfg creator obviously meant to have effect.
FATAL ERROR: A different CMSSSW configuration was already cached. Either configuration file name or configuration parameters have been changed. But CMSSW configuration files can't be loaded more than once in memory.One option would be to try to not use
JobType.pyCfgParams
. But if this is not possible, the more general ad-hoc solution would be to fork the submission into a different python process. For example, if you are doing something like documented in Multicrab using the crabCommand API
then we suggest to replace each
submit(config)by
from multiprocessing import Process p = Process(target=submit, args=(config,)) p.start() p.join()(Of course,
from multiprocessing import Process
needs to be executed only once, so put it outside any loop.)
PSetDump.py
file (found in task_directory/inputs
) differs for the tasks from a multiple-submission python file, try forking the submission into different python processes, as recommended in the previous FAQ.
crab resubmit
command
crab kill
. Killing the current task will guarantee that no change happens anymore
from CRABClient.UserUtilities import config, getLumiListInValidFiles from WMCore.DataStructs.LumiList import LumiList config = config() config.General.requestName = 'TaskB' ... # you want to use same Pset as in previous task, in order to publish in same dataset config.JobType.psetName = <TaskA-psetName> ... # and of course same input dataset config.Data.inputDataset = <TaskA-input-dataset-name> config.Data.inputDBS = 'global' # but this will work for a dataset in phys03 as well # now the list of lumis that you successfully processed in Task-A # it can be done in two ways. Uncomment and edit the appropriate one: #1. (recommended) when Task-A output was a dataset published in DBS #taskALumis = getLumiListInValidFiles(dataset=<TaskA-output-dataset-name>, dbsurl='phys03') # or 2. when output from Task-A was not put in DBS #taskAlumis = LumiList(filename=<the LumiSummary.json file from running crab report on Task-A>) # now the current list of golden lumis for the data range you are interested, can be different from the one used in Task-A officalLumiMask = LumiList(filename='<some-kosher-name>.json') # this is the main trick. Mask out also the lumis which you processed already newLumiMask = officialLumiMask - taskALumis # write the new lumiMask file, now you can use it as input to CRAB newLumiMask.writeJSON('my_lumi_mask.json') # and there we, process from input dataset all the lumi listed in the current officialLumiMask file, skipping the ones you already have. config.Data.lumiMask = 'my_lumi_mask.json' config.Data.outputDatasetTag = <TaskA-outputDatasetTag> # add to your existing dataset ...IMPORTANT NOTE : in this way you will add any lumi section in the intial data set that was turned from bad to good in the golden list after you ran Task-A, but if some of those data evolved the other way around (from good to bad), there is no way to remove those from your published datasets.
inputDataset
.
Rational and details:Data.inputDataset
parameter) and submit the jobs to sites where this dataset is hosted. If there is no primary input dataset, CRAB will submit the jobs to the less busy sites. In any case, if the pile-up files are not hosted in the execution sites, they will be accessed via AAA (Xrootd). But reading the "signal" events directly from the local storage and the pile-up events via AAA is more inefficient than doing the other way around, since for each "signal" event that is read one needs to read in general many (> 20) pile-up events. Therefore, it is highly recommended that the user forces CRAB to submit the jobs to the sites where the pile-up dataset is hosted by whitelisting these sites using the parameter Site.whitelist
in the CRAB configuration file. Note that one also needs to set Data.ignoreLocality = True
in the CRAB configuration file in case of using a primary input dataset so to avoid CRAB doing data discovery and eventually complain (and fail to submit) that the input dataset is not available in the whitelisted sites. One can use DASconfig.section_("Site") config.Site.requireAccelerator = TrueYou can view the site and available GPU from CMS Submission Infrastructure: GPUs monitor
crab preparelocal
command. Please refer to crab preparelocal help
There is an example of how to use this to submit one CRAB task on the CERN HTCondor batch system. Those instructions will also work for FNAL LPC HTCondor.
Data.inputDataset
configuration parameter:
Data.allowNonValidInputDataset = True
in the CRAB configuration.
is_file_valid
flag in DBS, usually set to False when file is lost or corrupted. CRAB considers only valid files in the dataset. Invalid files are skipped.
phys03
instance, the above is modified as follows: origin_site_name
in DBS and data are assumed to never move. If datasets are moved, the user can update the origin_site_name
. There is no way to have multiple locations.
maxMemoryMB
) I can request?).
gfal-*
commands from a machine that has GFAL2 utility tools installed (e.g. lxplus). You have to pass Physical File Names (PFNs) as arguments to the commands. To get the Physical File Name given a Logical File Name and a CMS node name, you can use the lfn2pfn
PhEDEx API.
LFNs are names like /store/user/mario/myoutput
; note that a directory is also a file name.
For example, for the LFN /store/user/username/myfile.root
stored in T2_IT_Pisa
you can do the following (make sure you did cmsenv
before, so to use a new version of curl
), where you can replace the first two lines with the values which are useful to you and simply copy/paste the long curl command:
site=T2_IT_Pisa lfn=/store/user/username/myfile.root curl -ks "https://cmsweb.cern.ch/phedex/datasvc/perl/prod/lfn2pfn?node=${site}&lfn=${lfn}&protocol=srmv2" | grep PFN | cut -d "'" -f4which returns:
srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/cms/store/user/username/myfile.rootBefore executing the
gfal
commands, make sure to have a valid proxy:
voms-proxy-init -voms cms
Enter GRID pass phrase for this identity: Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"... Remote VOMS server contacted succesfully. Created proxy in /tmp/x509up_u<user-id>. Your proxy is valid until <some date-time 12 hours in the future>The most useful
gfal
commands and their usage syntax for listing/removing/copying files/directories are in the examples below (it is recommended to unset the environment when executing gfal
commands, i.e. to add env -i
in front of the commands). See also the man
entry for each command (man gfal-ls
etc.):
List a (remote) path:
env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-ls <physical-path-name-to-directory>Remove a (remote) file:
env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-rm <physical-path-name-to-file>Recursively remove a (remote) directory and all files in it:
env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-rm -r <physical-path-name-to-directory>Copy a (remote) file to a directory in the local machine:
env -i X509_USER_PROXY=/tmp/x509up_u$UID gfal-copy <physical-path-name-to-source-file> file://<absolute-path-to-local-destination-directory>Note: the
<absolute-path-to-local-destination-directory>
starts with /
therefore there are three consecutive /
characters like file:///tmp/somefilename.root
Debug.extraJDL
CRAB configuration parameter :
config.section_("Debug") config.Debug.extraJDL = ['+CMS_ALLOW_OVERFLOW=False']Note: if you change this configuration option for an already-created task (for instance if you noticed a lot of job failures at a particular site and even after blacklisting the jobs keep going back), you can't simply change the option in the configuration and resubmit. You'll have to kill the existing task and make a new task to get the option to be accepted. You can't simply change it during resubmission.
LumiList.py
(available in the WMCore libraryfrom WMCore.DataStructs.LumiList import LumiList lumiList = LumiList(filename='my_original_lumi_mask.json') lumiList.selectRuns([x for x in range(193093,193999+1)]) lumiList.writeJSON('my_lumi_mask.json') config.Data.lumiMask = 'my_lumi_mask.json'Example 2: Use a new lumi-mask file that is the intersection of two other lumi-mask files.
from WMCore.DataStructs.LumiList import LumiList originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json') originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json') newLumiList = originalLumiList1 & originalLumiList2 newLumiList.writeJSON('my_lumi_mask.json') config.Data.lumiMask = 'my_lumi_mask.json'Example 3: Use a new lumi-mask file that is the union of two other lumi-mask files.
from WMCore.DataStructs.LumiList import LumiList originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json') originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json') newLumiList = originalLumiList1 | originalLumiList2 newLumiList.writeJSON('my_lumi_mask.json') config.Data.lumiMask = 'my_lumi_mask.json'Example 4: Use a new lumi-mask file that is the subtraction of two other lumi-mask files.
from WMCore.DataStructs.LumiList import LumiList originalLumiList1 = LumiList(filename='my_original_lumi_mask_1.json') originalLumiList2 = LumiList(filename='my_original_lumi_mask_2.json') newLumiList = originalLumiList1 - originalLumiList2 newLumiList.writeJSON('my_lumi_mask.json') config.Data.lumiMask = 'my_lumi_mask.json'
Subject: WARNING: Reaching your quota Dear analysis user <username>, You are using <X>% of your disk quota on the server <schedd-name>. The moment you reach the disk quota of <Y>GB, you will be unable to run jobs and will experience problems recovering outputs. In order to avoid that, you have to clean up your directory at the server. Here are the instructions to do so: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#How_to_clean_up_your_directory_i Here it is a more detailed description of the issue: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Disk_space_for_output_files If you have any questions, please contact hn-cms-computing-tools(AT)cern.ch Regards, CRAB supportThis e-mail has a link (https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#How_to_clean_up_your_directory_i) to the instructions on how to clean up space in the user's home directory in a schedd. A user can follow the instructions in that page, or alternatively use the crab purge command.
crab-env-bootstrap.sh
script to overcome the CRAB3 and CMSSW environment conflicts /cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh
) without need to source the CRAB3 environment. You could do something like this:
cmsenv # DO NOT setup the CRAB3 environment alias crab='/cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh' crab submit crab status ... # check that you can run cmsRun locallyDetails: The usual way to setup CRAB3 is to first source the CMSSW environment using
cmsenv
and then source the CRAB3 environment using source /cvmfs/cms.cern.ch/crab3/crab.(c)sh
. This setup procedure has the disadvantage that, depending on which CMSSW version is used, once the CRAB3 environment is sourced the CMSSW commands like cmsRun will stop working (also other useful commands like gfal-copy
will not work). Solving this at the root and make the CRAB client RPM compatible to the CMSSW ones is not possible for the way the tools in the COMP repository are built, and because cmsweb has its own release cycle independent from CMSSW.
To overcome this limitation we are now providing a wrapper bash script that can be run in place of the usual crab command. This wrapper script will take care of setting the environment in the correct way before running the usual crab command, and will leave the environment as it was when exiting. The script will be soon available in the CMSSW distribution under the name 'crab' and its usage will be transparent to the user: you will just run the crab commands as you would have done before. In the meantime, the script is available for testing here: /cvmfs/cms.cern.ch/crab3/crab-env-bootstrap.sh
.
[line 714] p = tLPTest("MyType",** { "a"+str(x): tLPTestType(x) for x in xrange(0,300) } )This uses dictionary comprehensions, a feature available in python > 2.7. While CMSSW (setup via
cmsenv
) uses python > 2.7, CRAB (setup via /cvmfs/cms.cern.ch/crab3/crab.sh
) still uses python 2.6.8. To overcome this problem, don't setup the CRAB environment and instead use the crab-env-bootstrap.sh
script (see this FAQ).