The tips you can see with the show links are in progressive order of difficulty. If you get stuck on one exercise try to solve it showing only the first tip before looking at the second one.
Exercise 1 - dry run
The following work area was prepared for this exercise: /afs/cern.ch/cms/ccs/wm/scripts/Crab/CRAB3/AdvTutorial/Exercise_1. It contains
the CMSSW parameter set configuration files PSet.py and PSet.pkl;
the CMSSW release area CMSSW_7_2_3_patch1 with additional packages used by the above CMSSW parameter set.
In order to use this already prepared work area with its full enironment, execute cmsenv directly inside /afs/cern.ch/cms/ccs/wm/scripts/Crab/CRAB3/AdvTutorial/Exercise_1/CMSSW_7_2_3_patch1/src/. Then you can setup CRAB3 and execute CRAB command from any other location you wish (for example your home area); you will only need to copy the PSet files to that location.
I am trying to analyze /VBF_HToTauTau_M-125_13TeV-powheg-pythia6/Phys14DR-PU20bx25_tsg_PHYS14_25_V1-v2/MINIAODSIM with the following splitting parameters:
and after waiting one day I am getting the following output with CRAB status:
Error Summary:
1 jobs failed with exit code 50664:
1 jobs failed with following error message: (for example, job 1)
Not retrying job due to wall clock limit (job automatically killed on the worker node)
By default jobs cannot run more than 24 hours on the Grid worker nodes. They are automatically killed if their runtime (called wall clock
time) exceeds this limit. That's what happened to the jobs; they were killed, because they reached the wall clock limit.
1.B) Prepare a CRAB configuration to analyze this dataset and execute crab submit --dryrun.
Help:
1.D) Try to come up with better splitting parameters in the CRAB configuration: your target for this exercise are jobs running for 8 hours (estimate the number of events per lumi from DAS). Use 'crab submit --dryrun' (and please don't submit the task!).
Help:
Looking at the first file in the dataset, it has in average ~100 events per lumi. One way to get this number is to use DAS to get the list of lumis in the file [1] (and count the number of lumis for example with python). Then divide this number by the number of events in the file (you can also find it in DAS):
$ python
>>> lumis = [[3967, 3969], [3973, 3973], ...]
>>> sum([y-x+1 for x, y in lumis])
431
>>> 42902 / 431
99
[1] https://cmsweb.cern.ch/das/request?input=lumi%20file%3D/store/mc/Phys14DR/VBF_HToTauTau_M-125_13TeV-powheg-pythia6/MINIAODSIM/PU20bx25_tsg_PHYS14_25_V1-v2/00000/147B369C-9F77-E411-B99D-00266CF9B184.root&instance=prod/global&idx=0&limit=10
2.A) Run a task similar to the one in section Running CMSSW analysis with CRAB on Data of the CRAB3 (basic) tutorial, but with each job analyzing exactly one input file and with the task analyzing in total five input files. Also, don't use any lumi-mask nor run-range.
Help:
from CRABClient.UserUtilities import config
config = config()
config.General.requestName = 'CRAB3_Advanced_Tutorial_June2015_Exercise2A'
config.General.transferOutputs = True
config.General.transferLogs = False
config.JobType.pluginName = 'Analysis'
config.JobType.psetName = 'pset_tutorial_analysis.py'
config.Data.inputDataset = '/SingleMu/Run2012B-13Jul2012-v1/AOD'
config.Data.splitting = 'FileBased'
config.Data.unitsPerJob = 1
config.Data.totalUnits = 5
config.Data.publication = True
config.Data.outputDatasetTag = config.General.requestName
config.Site.storageSite = <site where you have permission to write>
2.B) Look at what were the five input files analyzed by task A and create a local text file containing the LFNs of these five files. Submit a task to analyze the same input files as task A, but instead of specifying the input dataset use the local text file you just created. Once task A and task B have finished, check that both have published a similar output dataset. The goal of exercises 2.A and 2.B is to show that it is possible to run CRAB over published files treating them as "user input files" (i.e. using Data.userInputFiles instead of Data.inputDataset), although this is of course not recommended.
Note: When running over user input files CRAB will not try to find out at which sites the input files are stored. Instead CRAB will submit the jobs to the less busy site. If the input files are not stored at these sites, they will be accessed via Xrootd. Since Xrootd is of course less efficient than direct access, it is recommended to force CRAB to submit the jobs to the sites where the input files are stored by whitelisting those sites.
Help:
ShowHide how to know which input files were analyzed by task A?
Look for example in the job log files linked from the monitoring pages. In each job log file, at the very beginning, it should say what
were the input files analyzed by the corresponding job.
Answer:
/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0000/008DBED0-86D3-E111-AEDF-20CF3019DF17.root
/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0000/00F9715A-A1D3-E111-BE6F-E0CB4E29C4D1.root
/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0000/00100164-41D4-E111-981B-20CF3027A5AF.root
/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0000/0093BEF2-A4D3-E111-A6B9-E0CB4E19F979.root
/store/data/Run2012B/SingleMu/AOD/13Jul2012-v1/0000/0009F1CC-0DD4-E111-974D-20CF305B0584.root
ShowHide how to know at which sites are these input files stored?
Search for the input dataset /SingleMu/Run2012B-13Jul2012-v1/AOD in DAS.
Answer:
T2_BE_UCL, T2_IT_Legnaro, T2_RU_JINR, T1_US_FNAL, T1_IT_CNAF
2.C) Run a task as in 2.A, but turning off the publication.
2.D) Run a task over the output files from task C (the output files from task C are not published; so this is a typical case of user input files) using a local text file containing the LFNs of the output files from task C. Whitelist the site where you stored the output files from task C.
Help:
ShowHide how to get the LFNs of the output files from task C?
Use 'crab getoutput --dump'.
Exercise 3 - recovery task
Assume the following CRAB configuration, which uses the same CMSSW parameter-set configuration as in exercise 2, and produces 213 jobs:
from CRABClient.UserUtilities import config
config = config()
config.General.requestName = 'CRAB3_Advanced_Tutorial_June2015_Exercise3C'
config.General.transferOutputs = True
config.General.transferLogs = False
config.JobType.pluginName = 'Analysis'
config.JobType.psetName = 'pset_tutorial_analysis.py'
# Assume for this exercise that the default job runtime limit is 1 hour.
config.JobType.maxJobRuntimeMin = 60
config.Data.inputDataset = '/SingleMu/Run2012B-13Jul2012-v1/AOD'
config.Data.splitting = 'LumiBased'
config.Data.unitsPerJob = 240
config.Data.lumiMask = 'https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/Collisions12/8TeV/Prompt/Cert_190456-208686_8TeV_PromptReco_Collisions12_JSON.txt'
config.Data.publication = True
config.Data.outputDatasetTag = config.General.requestName
config.Site.storageSite = <site where the user has permission to write>
3.A) Imagine the following situation. After submitting the task you did crab status and you got a message saying: Your task failed to bootstrap on the Grid scheduler .... What would you do and why?
The task has not been submitted to the Grid, so there is nothing to resubmit (or to recover). One has to submit the task again.
3.B) Imagine the following situation. You submitted the task and 50 jobs have failed in transferring the output files to the destination storage. You were told that there was a temporary issue with the storage and that it has been fixed now. What would you do and why?
Since the issue was a temporary problem not related to the jobs (which were otherwise finishing without problems), resubmitting the failed
jobs would be the most reasonable choice.
3.C) Imagine the following situation. You submitted the task and 50 jobs were killed at the worker nodes, because they reached the default runtime limit. What would you do and why?
Submit a completely new task from scratch using a finer splitting.
Resubmit the failed jobs of the current task requesting a higher job runtime limit.
Submit a recovery task to analyze the failed jobs, but using a finer splitting.
In this case the problem was with the jobs themselves. Resubmitting without changing the requested job runtime would not help, as jobs will
most probably fail again. Resubmitting requesting a higher job runtime may cause the jobs to be queued for a long time before they start
running. Submitting a new task would be a waste of resources as ~80% of the task has already completed. The best choice is to submit a
recovery task to analyze only the failed jobs, but it must be with a finer splitting.
3.D) In relation to 3.C, describe step-by-step the process to submit a recovery task in which the jobs are expected to use 1/4 the walltime than the original jobs. How many jobs will the recovery task have (approx.)? Will the output files from the recovery task be stored in the same directory as the output files from the original task? Will they be published in the same output dataset?
Help (which will not work after 26 July 2015): You can use the following existing task, 150626_112217:atanasi_crab_CRAB3_Advanced_Tutorial_June2015_Exercise3C, which represents the exact situation described in 3.C. Doing crab remake --task 150626_112217:atanasi_crab_CRAB3_Advanced_Tutorial_June2015_Exercise3C you will create a CRAB project directory for this task, at which point you will be able to execute crab commands referring to this task. You can then submit your proposed recovery task. Will it publish in the same output dataset?
Exercise 4 - user script
The following exercises can be all run in one task if you want to save some time.
4.A) Run a task similar to the one in section Running CRAB to generate MC events of the CRAB3 (basic) tutorial, but wrapping cmsRun in a shell script. Don't forget to tell cmsRun to produce the FrameworkJobReport.xml. In the script, print some messages before cmsRun starts and after cmsRun finishes. Where are these messages printed?
Help:
4.B) Run a task as in 4.A, but save the messages into a text file. Consider this text file as an additional output file that should be transferred to the destination storage. Once transferred, retrieve them with crab getoutput and check that the messages are there.
Help:
4.C) Run a task as in 4.A, but write the messages in a local text file, include this file in the input sandbox, and make your script read the messages from that file.
Help:
while read line
do
if [ "$line" == "Before" ]; then
continue
elif [ "$line" == "After" ]; then
cmsRun -j FrameworkJobReport.xml -p PSet.py
else
echo $line
fi
done < input.txt
from CRABClient.UserUtilities import config
config = config()
config.General.requestName = 'CRAB3_Advanced_Tutorial_May2015_Exercise4B'
config.JobType.pluginName = 'PrivateMC'
config.JobType.psetName = 'pset_tutorial_MC_generation.py'
config.JobType.scriptExe = 'myscript.sh'
# Arguments have to be in the form param=value, without white spaces, quotation marks nor additional equal signs (=).
config.JobType.scriptArgs = ['Before=CMSRUN-starting', 'After=CMSRUN-finished']
config.Data.outputPrimaryDataset = 'MinBias'
config.Data.splitting = 'EventBased'
config.Data.unitsPerJob = 10
config.Data.totalUnits = 30
config.Data.publication = True
config.Data.outputDatasetTag = config.General.requestName
config.Site.storageSite = <site where the user has permission to write>
beforemsg=""
aftermsg=""
for i in "$@"
do
case $i in
Before=*)
beforemsg="${i#*=}"
;;
After=*)
aftermsg="${i#*=}"
;;
esac
done
echo "================= $beforemsg ===================="
cmsRun -j FrameworkJobReport.xml -p PSet.py
echo "================= $aftermsg ===================="
4.E) Run a task as in 4.A, but defining the exit code of your script as 80500 (pass the exit code to CRAB; don't do exit 80500 in the script). Once the task finishes (it should fail) check the status to see if the jobs are indeed reported as failed with exit code 80500.
Help:
echo "================= CMSRUN starting ===================="
cmsRun -j FrameworkJobReport.xml -p PSet.py
echo "================= CMSRUN finished ===================="
exitCode=80500
exitMessage="This is a test to see if I can pass exit code 80500 to CRAB."
errorType=""
if [ -e FrameworkJobReport.xml ]
then
cat << EOF > FrameworkJobReport.xml.tmp
<FrameworkJobReport>
<FrameworkError ExitStatus="$exitCode" Type="$errorType" >
$exitMessage
</FrameworkError>
EOF
tail -n+2 FrameworkJobReport.xml >> FrameworkJobReport.xml.tmp
mv FrameworkJobReport.xml.tmp FrameworkJobReport.xml
else
cat << EOF > FrameworkJobReport.xml
<FrameworkJobReport>
<FrameworkError ExitStatus="$exitCode" Type="$errorType" >
$exitMessage
</FrameworkError>
</FrameworkJobReport>
EOF
fi
Exercise 5 - LHE
5.A) The objective of this exercise it to run a MC generation on LHE files. Using the Running MC generation on LHE files twiki as a pointer, prepare (but not submit yet!) a CRAB3 configuration to run on the Grid using the pset and the LHE file you can find here on lxplus: /eos/project/c/cmsweb/www/CRAB/AdvTutorial/Exercise_5/ also available from this page. Use config.JobType.inputFiles to pass the LHE file to the jobs. The target is to have 10 jobs that will run for 8 hours. Run your job locally on 1000 events before submitting it with CRAB (should take less than a minute), and use the Timing service of CMSSW to get a time per event estimation. Everything should use CMSSW_5_3_22. Publication should be enabled in your configuration.
Help:
##### These are the lines you can add to your pset get the estimates in the FrameworkJobReport file
process.Timing = cms.Service("Timing",
summaryOnly = cms.untracked.bool(True)
)
##### For your information CRAB3 also add the following two lines in addition to the previous three:
#process.CPU = cms.Service("CPU")
#process.SimpleMemoryCheck = cms.Service("SimpleMemoryCheck")
ShowHide Runing cmsRun locally to estimate the time per event.
You can open the FrameworkJobReport.xml file now and look for the AvgEventTime, which is what CRAB3 currently looks for the estimation of the walltimes.
You are probably getting the following error, because passing the LHE file directly in the input sandbox causes the sandbox to go over the
allowed limit of 100Mb:
Will use CRAB configuration file /afs/cern.ch/user/m/mmascher/tutorial/Exercise_5/crabConfig.py
Error contacting the server.
Server answered with: Invalid input parameter
Reason is: File is bigger then allowed limit of 10485760B
5.C) Copy the dynlo.lhe file to your destination storage under /store/user/<your-username>/dynlo.lhe and try to submit the task again removing the JobType.inputFiles parameter and accessing the file locally. Remember to modify the LHEInputSource and whitelist your site in the CRAB configuration so that your jobs will only run there.
Help:
5.D) Remove the whitelist and have the jobs run all the sites of your storage element nation. For example, if you are storing files to T2_IT_Legnaro you are going to run jobs only on italian sites (T2_IT*). Can you individuate the differences between the logfiles between 5C and this exercise ?
Help:
6) Using the same CMSSW parameter-set configuration as in exercise 2, submit four identical tasks to analyze the following four MC input datasets:
/DoubleMuParked/Run2012A-22Jan2013-v1/AOD
/DoubleMuParked/Run2012B-22Jan2013-v1/AOD
/DoubleMuParked/Run2012C-22Jan2013-v1/AOD
/DoubleMuParked/Run2012D-22Jan2013-v1/AOD
Don't do crab submit four times; instead use the crabCommand API from the CRABAPI library in a script. There is no need to analyze the whole datasets; just a few lumis or files is enough. Use the CRABAPI library to check the status, resubmit failed jobs, get the report and retrieve the output files, from the tasks you submitted.
Create a python script named "multicrab" (with permissions 744) which you should be able to run in the following way:
./multicrab --crabCmd CMD [--workArea WAD --crabCmdOpts OPTS]
where CMD is the crab command, WAD is a work area directory with many CRAB project directories inside and OPTS are options for the crab
command.
#!/usr/bin/env python
"""
This is a small script that does the equivalent of multicrab.
"""
import os
from optparse import OptionParser
def getOptions():
"""
Parse and return the arguments provided by the user.
"""
usage = ("Usage: %prog --crabCmd CMD [--workArea WAD --crabCmdOpts OPTS]"
"\nThe multicrab command executes 'crab CMD OPTS' for each project directory contained in WAD"
"\nUse multicrab -h for help")
parser = OptionParser(usage=usage)
parser.add_option('-c', '--crabCmd',
dest = 'crabCmd',
default = '',
help = "crab command",
metavar = 'CMD')
parser.add_option('-w', '--workArea',
dest = 'workArea',
default = '',
help = "work area directory (only if CMD != 'submit')",
metavar = 'WAD')
parser.add_option('-o', '--crabCmdOpts',
dest = 'crabCmdOpts',
default = '',
help = "options for crab command CMD",
metavar = 'OPTS')
(options, arguments) = parser.parse_args()
if arguments:
parser.error("Found positional argument(s): %s." % (arguments))
if not options.crabCmd:
parser.error("(-c CMD, --crabCmd=CMD) option not provided.")
if options.crabCmd != 'submit':
if not options.workArea:
parser.error("(-w WAR, --workArea=WAR) option not provided.")
if not os.path.isdir(options.workArea):
parser.error("'%s' is not a valid directory." % (options.workArea))
return options
def main():
options = getOptions()
# Do something.
if __name__ == '__main__':
main()
#!/usr/bin/env python
"""
This is a small script that does the equivalent of multicrab.
"""
import os
from optparse import OptionParser
import CRABClient
from CRABAPI.RawCommand import crabCommand
from CRABClient.ClientExceptions import ClientException
from httplib import HTTPException
def getOptions():
"""
Parse and return the arguments provided by the user.
"""
usage = ("Usage: %prog --crabCmd CMD [--workArea WAD --crabCmdOpts OPTS]"
"\nThe multicrab command executes 'crab CMD OPTS' for each project directory contained in WAD"
"\nUse multicrab -h for help")
parser = OptionParser(usage=usage)
parser.add_option('-c', '--crabCmd',
dest = 'crabCmd',
default = '',
help = "crab command",
metavar = 'CMD')
parser.add_option('-w', '--workArea',
dest = 'workArea',
default = '',
help = "work area directory (only if CMD != 'submit')",
metavar = 'WAD')
parser.add_option('-o', '--crabCmdOpts',
dest = 'crabCmdOpts',
default = '',
help = "options for crab command CMD",
metavar = 'OPTS')
(options, arguments) = parser.parse_args()
if arguments:
parser.error("Found positional argument(s): %s." % (arguments))
if not options.crabCmd:
parser.error("(-c CMD, --crabCmd=CMD) option not provided.")
if options.crabCmd != 'submit':
if not options.workArea:
parser.error("(-w WAR, --workArea=WAR) option not provided.")
if not os.path.isdir(options.workArea):
parser.error("'%s' is not a valid directory." % (options.workArea))
return options
def main():
options = getOptions()
# The submit command needs special treatment.
if options.crabCmd == 'submit':
#--------------------------------------------------------
# This is the base config:
#--------------------------------------------------------
from CRABClient.UserUtilities import config
config = config()
config.General.requestName = None
config.General.workArea = 'CRAB3_Advanced_Tutorial_May2015_Exercise6'
config.JobType.pluginName = 'Analysis'
config.JobType.psetName = 'pset_tutorial_analysis.py'
config.Data.inputDataset = None
config.Data.splitting = 'LumiBased'
config.Data.unitsPerJob = 10
config.Data.totalUnits = 30
config.Data.outputDatasetTag = None
config.Site.storageSite = None # Choose your site.
#--------------------------------------------------------
# Will submit one task for each of these input datasets.
inputDatasets = [
'/DoubleMuParked/Run2012A-22Jan2013-v1/AOD',
'/DoubleMuParked/Run2012B-22Jan2013-v1/AOD',
'/DoubleMuParked/Run2012C-22Jan2013-v1/AOD',
'/DoubleMuParked/Run2012D-22Jan2013-v1/AOD',
]
for inDS in inputDatasets:
# inDS is of the form /A/B/C. Since B is unique for each inDS, use this in the CRAB request name.
config.General.requestName = inDS.split('/')[2]
config.Data.inputDataset = inDS
config.Data.outputDatasetTag = '%s_%s' % (config.General.workArea, config.General.requestName)
# Submit.
try:
print "Submitting for input dataset %s" % (inDS)
crabCommand(options.crabCmd, config = config, *options.crabCmdOpts.split())
except HTTPException as hte:
print "Submission for input dataset %s failed: %s" % (inDS, hte.headers)
except ClientException as cle:
print "Submission for input dataset %s failed: %s" % (inDS, cle)
# All other commands can be simply executed.
elif options.workArea:
for dir in os.listdir(options.workArea):
projDir = os.path.join(options.workArea, dir)
if not os.path.isdir(projDir):
continue
# Execute the crab command.
msg = "Executing (the equivalent of): crab %s --dir %s %s" % (options.crabCmd, projDir, options.crabCmdOpts)
print "-"*len(msg)
print msg
print "-"*len(msg)
try:
crabCommand(options.crabCmd, dir = projDir, *options.crabCmdOpts.split())
except HTTPException as hte:
print "Failed executing command %s for task %s: %s" % (options.crabCmd, projDir, hte.headers)
except ClientException as cle:
print "Failed executing command %s for task %s: %s" % (options.crabCmd, projDir, cle)
if __name__ == '__main__':
main()