Operator Tips and Hints

Complete: 5 Go to SWGuideCrab

Helpful tips, tricks and on-line commands for operators to navigate CRAB3.

Finding what you need

  • Get the full name of the task from the user; this will tell you which schedd to look at.
  • Refer to this page to find the vocmsXXX server and log file locations.

On the schedd

  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
DENY_WRITE = cms123@*, $(DENY_WRITE)
Reload the schedd by sending it a SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
condor_hold -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType=?="ROOT"'
  • To find the HTCondor working (spool) directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc) either run:
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
Or go to the schedd task diretory (such as "~/140221_015428_vocms20:bbockelm_crab_lxplus414_user_17/") and run the ll (ls -l) command. It will show some files as symlinked from the HTCondor spool directory:
... postjob.14.0.txt -> /data/srv/glidecondor/condor_local/spool/9589/0/cluster129589.proc0.subproc0/postjob.14.0.txt
You can simply cd to the shown directory (after deleting the file name, of course), which is the spool directory itself..
  • To find job one in the task, run:
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1'
You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
condor_history -match 1 -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1&&CRAB_Retry=?=2'
  • To list all tasks for user
condor_q cms293
  • To list a specific task configuration:
condor_q -l 179330.0
or if the task does not exist anymore, then it should be taken from the condor history:
condor_history -l 179330.0
  • To see the list of users that are allowed to write on the schedd:
condor_config_val SCHEDD.ALLOW_WRITE
  • To see the location of Master log or Schedd log:
condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
  • To see all the schedds known in the glidein-collector.t2.ucsd.ed collector:
condor_status -pool -schedd
  • To see all running/idle by username:
condor_q -name -name -name -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running by username:
condor_q -name -name -name -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
  • To see all running in the production schedds:
condor_q -name -name -name  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
Job Status explanation:
0 Unexpanded U 
1 Idle I 
2 Running R 
3 Removed X 
4 Completed C 
5 Held H 
6 Submission_err E
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
condor_q 184439.0 -analyze
or if it is not enough information, do
condor_q 184439.0 -better-analyze
  • To see the list of running pilots grouped by site:
condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
  • To see the list of idle pilots for a CE:
condor_q -const 'GlideinEntryName=?="CMS_T2_CH_CERN_ce408" && JobStatus=?=1' -pool -g
  • To see the number of tasks each schedd is running:
condor_status -schedd -pool -const 'regexp("crab3", Name)' -af:h Name TotalSchedulerJobsRunning
  • To see the number of jobs matching a site in one schedd:
 condor_q  -pool -name -const '(StringListMember("T2_US_Purdue", DESIRED_Sites)=?=True)&&(JobStatus=?=1)' 
  • To see the number of scheduler jobs (dagman) idle and running:
condor_status -const 'CMSGWMS_Type=?="crabschedd"' -schedd -af Name TotalSchedulerJobsRunning TotalSchedulerJobsIdle

On the Collector

Boosting user priority

Boosting user priority can be done from the glideinWMS collector machine in the Global Pool, currently User priority works in HTCondor such that jobs from users with the smallest priority number run first. The priority number is a combination of two things, a usage-based number which increases with usage and decreases exponentially over a 24h period, and an overall priority scale factor (default=100). Therefore, people with the most recent usage will have the best priority, given that their priority factors are equal.

To query user priorities (as any user on the collector machine):

condor_userprio -allusers

Commands to change user priorities must be executed as user condor on the collector machine itself. To reset the user priority factor:

condor_userprio -setfactor <user> <val>
To zero a user's recent usage:
condor_userprio -resetusage <user>

For more options please read:

condor_userprio -help

For example, the user under which the HammerCloud jobs are submitted is given a priority factor of 1, so that HammerCloud jobs almost always run before anything else in the Global Pool. This change was made on December 5, 2014, and by the next day over 97% of the jobs started within 2 hours. Before the change, the percentage was only ~80%.

N.B.: It you an error like condor_userprio: Can't locate negotiator in local pool when doing condor_userprio you can specify the pool using --pool, for example condor_userprio -pool Moreover, with recent changes to the global pool negotiator settings now it is not guaranteed this command will run (see

on REST interface

Find REST logs

take the X-Error-Id in the crab.log, go to vocms022 or vocms055 where the REST logs are copied in /build/srv-logs then grep for the X-Error-Id, for example:

grep <X-Error-Id> /build/srv-logs/vocms*/crabserver/crabserver-20151015.log -C 30

you will get a stacktrace that helps you understand what is going on

[17/Aug/2015:20:49:54]    Traceback (most recent call last):
[17/Aug/2015:20:49:54]      File
line 1728, in dbapi_wrapper
[17/Aug/2015:20:49:54]        return handler(*xargs, **xkwargs)
[17/Aug/2015:20:49:54]      File
line 493, in get
[17/Aug/2015:20:49:54]        result =
self.userworkflowmgr.status(workflow, verbose=verbose, userdn=userdn)
[17/Aug/2015:20:49:54]      File
line 143, in status
[17/Aug/2015:20:49:54]        return self.workflow.status(workflow,
userdn, userproxy, verbose=verbose)
[17/Aug/2015:20:49:54]      File
line 186, in throttled_wrapped_function
[17/Aug/2015:20:49:54]  RESTSQL:ghDyYRFibLCT disconnecting
[17/Aug/2015:20:49:54]        return fn(*args, **kw)
[17/Aug/2015:20:49:54]      File
line 144, in wrapped_func
[17/Aug/2015:20:49:54]        return func(*args, **kwargs)
[17/Aug/2015:20:49:54]      File
line 326, in status
[17/Aug/2015:20:49:54]        taskStatus =
self.taskWebStatus(DBResults, verbose=verbose)
[17/Aug/2015:20:49:54]      File
line 563, in taskWebStatus
[17/Aug/2015:20:49:54]        curl.perform()
[17/Aug/2015:20:49:54]    error: (28, 'Connection timed out after
30000 milliseconds')

Monitoring links


useful gwmsmon API's

  • users running jobs of a given duration (by number of jobs):
API for topusers is available. One thing what I was thinking is to
extend table and also have exitcode. Let me know if it is needed.

API description

(hours) - number of hours from 0 to 999
(TimeFrom) - Minimum job runtime from. Default value: * (means any)
(TimeTo) - Job runtime up to. Default value: * (means any)

For example:
For the last week how many jobs per user were running less than 1h:

For the last week how many jobs per user were running more than 1h:

For the last week how many jobs per user were running more than 24h:
  • time distribution (percentiles) of used wall clock time by jobs in a task
Same exists for prodview instead of analysisview. In analysis, workflow is a username, subtask is a taskname:

(call) - one of memoryusage, exitcodes, runtime, memorycpu, percentileruntime
(hours) - from 0 to 999
(workflow) - username (i.e. CRAB_UserHN !classAd)
(subtask) - taskname (i.e. CRAB_ReqName !classAd)


Crabcache (UserFileCache)

  • To get the list of all the users in the crabcache:
curl -X GET '' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [
  • To get all the files uploaded by a user to the crabcache and the amount of quota (in bytes) he's using:
curl -X GET '' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [
 {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]}
  • To get more information about one specific file (the file must be owned by the user who makes the query):
curl -X GET '' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [
 {"hashkey": "697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "created": 1400161415.0, "modified": 1400161415.0, "exists": true, "size": 1303975}
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
curl -X GET '' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [
  • To get the list of power users:
curl -X GET '' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [
  • To get the quota each user has in MegaBytes (power users has 10* this value):
curl -X GET '' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [
 {"quota_user_limit": 10}

Development Oriented Tricks I did not know where to put

To run the preDag on the schedd with debugger

  • I only made this after it ran already by itself.
  • sketchy notes on what to do there, may be enough:
sudo su cms1627

#export _CONDOR_JOB_AD=finished_jobs/job.1.0
export _CONDOR_JOB_AD=finished_jobs/job.0-1.0 

#export X509_USER_PROXY=cfa0cb32f2043a8042e1290debf652b8fd62334e
export X509_USER_PROXY=`ls|egrep  [a-z,0-9]{40}`

rm automatic_splitting/processed

vim TaskWorker/Actions/
and put a breakpoint at some proper place:
import pdb

import TaskWorker.Actions.PreDAG as PreDAG

To run a pre-job on the schedd

  • Go to the spool directory on the task in the schedd
  • get the user identity in order to access files via sudo su username
    • Note: username is the schedd username which the task belongs to, for example, cms9999. You can e.g. do
sudo su `ls -ld ${PWD}|awk '{print $3}'`
  • Copy the relevant arguments that are passed to the pre-job as written in RunJobs.dag (the relevant arguments are those after $RETRY. These look like 1 150818_081011:erupeika_task_name and are in the second line of the text block for each job).
    • this can be done via
export jobId=1  # or whatever job # you want to retry
pjArgs=`cat RunJobs.dag | grep "PRE  Job${jobId}" | awk 'BEGIN { FS="RETRY" }; {print $2}'`
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
    • this can be done via export X509_USER_PROXY=`pwd`/`ls|egrep  [a-z,0-9]{40}`
  • You might also need to change the permissions of the prejob_logs directory
  • Run the pre-job. The generic instruction is:
export TEST_DONT_REDIRECT_STDOUT=True; export _CONDOR_JOB_AD=finished_jobs/job.1.0; export X509_USER_PROXY=<profy_file_name>; sh PREJOB <retry_count> <arguments>'
Note: retry_count can be 0.

  • if you used the shortcuts indicated above the simpler command is
export TEST_DONT_REDIRECT_STDOUT=True; export _CONDOR_JOB_AD=finished_jobs/job.1.0;  sh PREJOB 0 $pjArgs'

If none of the jobs have finished yet, you can probably grab the classads from one of the submitted / finished jobs with:

sudo condor_q -l 3891792.0 > /tmp/
# or
sudo condor_history -l 3891792.0 > /tmp/

Otherwise, you can create a fake file and point the _CONDOR_JOB_AD to it:

QDate = 1429778123
CRAB_ReqName = "150423_074637:mmascher_crab_testecmmascher-dev6_158"
CRAB_UserDN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mmascher/CN=720897/CN=Marco Mascheroni"
CRAB_JobSW = "CMSSW_7_1_9"
CRAB_UserHN = "mmascher"
CRAB_InputData = "/DoubleMuParked/Run2012B-22Jan2013-v1/AOD"
DESIRED_CMSDataset = "/DoubleMuParked/Run2012B-22Jan2013-v1/AOD"
CMSGroups = "/cms"
#The following can be enabled to test slow release of jobs for HC
#CRAB_JobReleaseTimeout = 60
#CRAB_TaskSubmitTime = 1464638738

To run a post-job on the schedd

  • instructions updated and checked: June 2017
  • Go to the spool directory of the task in the schedd.
  • Find the HTCondor clusterId of the job for which you want to run the post-job, e.g. wth something like jobClusterId=`ls -l finished_jobs/job.2.0 | awk '{print $NF}'|cut -d. -f2`
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES). Example:
PJargs=`cat RunJobs.dag | grep "POST Job1" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
which results in something like:
belforte@vocms059/~> echo $PJargs
170620_163137:belforte_crab_20170620_183133 1 /store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/prova/doppio/slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 /store/user/belforte/prova/doppio//slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 cmsRun_1.log.tar.gz conventional kk_1.root
  • Identify the file in the spool directory that contains your proxy and set X509_USER_PROXY env. var (the file name is a sequence of 40 lower-case letters and/or digits, could e.g. use export X509_USER_PROXY=`ls|egrep  [a-z,0-9]{40}`) and make sure the proxy is still valid (voms-proxy-info).
  • Modify the file retry_info/job.X.txt so that the post value is equal to 0.
  • Delete the file defer_info/defer_num.X.0.txt.
  • Find the cmsXXXX username of the user the task belongs to (even if you're running your own task) and replace the cmsXXXX with that.
  • Run the post-job. The generic instruction is:
sudo su cmsXXXX sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.X.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB.

One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:

sudo su cms1425 sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.2.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=a41b46b7b59e9858006416f98d1540f9ddf4d646; sh POSTJOB 3971322 0 0 10 160314_085828:erupeika_crab_test7 2 /store/temp/user/erupeika.1ae8d366cbf4507a432372f69f456dc0d23cc26d/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 /store/user/erupeika/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 cmsRun_2.log.tar.gz output_2.root'
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.
  • Note : at this point it is possible to insert import pdb; pdb.set_trace() in TaskWorker/Action/ and it will stop there allowing interactive debugging

To run the job wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/
    • Note: on most real tasks spool directory is full of useless log files, rather do:
tar cjf /tmp/spoolDir.tar --exclude="*job*" --exclude="postJob*" --exclude="*submit" --exclude="defer*" --exclude="task_statistics*"  .
scp /tmp/spoolDir.tjz mmascher@lxplus:/afs/<somedir>
# and on lxplus expand with
tar xf spoolDir.tar
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile and scriptArgs).
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_*
tar xf sandbox.tar.gz
#export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH # use this if you want to provide your own scripts
export X509_USER_PROXY=/tmp/x509up_u`id -u`
sh -a sandbox.tar.gz --sourceURL= --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
  • Run the script:

Better way with preparelocal

  • crab preparelocal -d ...
  • cd .../local
  • look at and excute line by line interactively, replacing ${1} with job number e.g.
     export SCRAM_ARCH=slc6_amd64_gcc481; export CRAB_RUNTIME_TARBALL=local; export CRAB_TASKMANAGER_TARBALL=local; export _CONDOR_JOB_AD=Job.1.submit
     export CRAB3_RUNTIME_DEBUG=True
     tar xzmf CMSRunAnalysis.tar.gz
     cp  Job.submit Job.1.submit 
  • or can make things easier by creating e.g. a new by
    and at vim prompt (:) type: %s/${1}/99/g
    then write and exit and you have all lines to execute 
  • pick the argument list for e.g. a copy/paste with mouse:
    • cat InputArgs.txt |sed "1q;d" replace 1 with the jobid
  • now can run the main wrapper like e.g.
    • ./ -a sandbox.tar.gz --sourceURL= --jobNumber=1 --cmsswVersion=CMSSW_10_1_0 --scramArch=slc6_amd64_gcc630 --inputFile=job_input_file_list_1.txt --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --maxRuntime=-1 --scriptArgs=[] -o {}
  • if want to debug the python part, edit it and put a call to pdb.set_trace() after __main__, then:
    • export PYTHONPATH=`pwd`/`pwd`/$PYTHONPATH
    • python -r "`pwd`" -a sandbox.tar.gz --sourceURL= --jobNumber=1 --cmsswVersion=CMSSW_10_1_0 --scramArch=slc6_amd64_gcc630 --inputFile=job_input_file_list_1.txt --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --maxRuntime=-1 --scriptArgs=[] -o {} 

To run the job wrapper and the stageout wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
scp -r /data/srv/glidecondor/condor_local/spool/2789/0/cluster72789.proc0.subproc0
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog). I will assume that we want to run job number 1 of the task. Note that in order to avoid running the full job and wasting time, one can change --firstEvent and --lastEvent arguments to be equal to 1, for example. This means that only one event will be analysed and the job will finish quickly.
  • Prepare a script (I call it to run your job
    • Replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile and scriptArgs).
    • Replace the with the name of your proxy file. This file is located in the spool dir you downloaded and has a random name (like a41b46b7b59e9858006416f98d1540f9ddf4d646), it shouldn't be hard to find it with ls.
  • Notice that in the script we have assumed that we will use Job.1.submit as the job ad. For this to work, in the Job.1.submit file, for each line that starts with a + sign, remove the +. Also, change CRAB_Id = $(count) by CRAB_Id = 1. It may also be that some variables are missing, e.g. CRAB_localOutputFiles and CRAB_Destination. In that case, copy them from the job log file and add them to the Job.1.submit file.
  • Create an empty file in the copied directory.

rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl scramOutput.log jobLog.* wmcore_initialized debug CMSSW_7_0_5
tar xvzf sandbox.tar.gz
echo "======== at $(TZ=GMT date) STARTING ========"
sh -a sandbox.tar.gz --sourceURL= --jobNumber=1 --cmsswVersion=CMSSW_7_5_8_patch3 --scramArch=slc6_amd64_gcc491 --inputFile='job_input_file_list_1.txt' --runAndLumis='job_lumis_1.json' --lheInputFiles=False --firstEvent=1 --firstLumi=None --lastEvent=2 --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --scriptArgs=[] -o {}
echo " complete at $(TZ=GMT date) with exit status $EXIT_STATUS"
echo "======== at $(TZ=GMT date) FINISHING ========"
mv jobReport.json jobReport.json.1
export VO_CMS_SW_DIR=/cvmfs/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$VO_CMS_SW_DIR/COMP/slc5_amd64_gcc434/external/openssl/0.9.7m/lib:$VO_CMS_SW_DIR/COMP/slc5_amd64_gcc434/external/bz2lib/1.0.5/lib
command -v python > /dev/null
echo "======== Stageout at $(TZ=GMT date) STARTING ========"
rm -f wmcore_initialized
export _CONDOR_JOB_AD=Job.1.submit
echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"

  • Run the script:

To run the sequential TaskWorker from your virtual machine

This is very useful when debugging the TaskWorker code - this allows you to use pdb (for step-by-step debugging) as well as the standard output to see what is going on in real time. Just source the taskworker environment, point python to the class and provide a TaskWorker configuration, for example:

python ~/private/github/repos/CRABServer/test/python/ /data/srv/TaskManager/current/

This will run a single worker on one thread. Note that the tasks submitted to this worker might not be processed correctly (the task might have the wrong state assigned to it), however, that's usually not a problem.

To run the crabserver in debug mode

It is possible to run the crabserver within a single process. This allows the use of the "pdb" python debugger (similar to the sequential Task Worker) to execute code step by step and see what the crabserver is doing in real time.

First, this requires removing (commenting out) some code from the WMCore class which starts the server. Remove the following code in the start_daemon() method:

    def start_daemon(self):
#         """Start the deamon."""
#         # Redirect all output to the logging daemon.
#         devnull = file("/dev/null", "w")
#         if isinstance(self.logfile, list):
#             subproc = Popen(self.logfile, stdin=PIPE, stdout=devnull, stderr=devnull,
#                             bufsize=0, close_fds=True, shell=False)
#             logger = subproc.stdin
#         elif isinstance(self.logfile, str):
#             logger = open(self.logfile, "a+", 0)
#         else:
#             raise TypeError("'logfile' must be a string or array")
#         os.dup2(logger.fileno(), sys.stdout.fileno())
#         os.dup2(logger.fileno(), sys.stderr.fileno())
#         os.dup2(devnull.fileno(), sys.stdin.fileno())
#         logger.close()
#         devnull.close()
#         # First fork. Discard the parent.
#         pid = os.fork()
#         if pid > 0:
#             os._exit(0)

        # Establish as a daemon, set process group / session id.
#         os.setsid()
#         # Second fork. The child does the work, discard the second parent.
#         pid = os.fork()
#         if pid > 0:
#             os._exit(0)
#         # Save process group id to pid file, then run real worker.
#         file(self.pidfile, "w").write("%d\n" % os.getpgid(0))

        error = False
        except Exception as e:
            error = True
            trace = StringIO()
            cherrypy.log("ERROR: terminating due to error, error trace follows")
            for line in trace.getvalue().rstrip().split("\n"):
                cherrypy.log("ERROR:   %s" % line)

        # Remove pid file once we are done.
        try: os.remove(self.pidfile)
        except: pass

        # Exit
        sys.exit((error and 1) or 0)

Then, remove some more code from the main() method:

    def run(self):
        """Run the server daemon main loop."""
        # Fork.  The child always exits the loop and executes the code below
        # to run the server proper.  The parent monitors the child, and if
        # it exits abnormally, restarts it, otherwise exits completely with
        # the child's exit code.
        cherrypy.log("WATCHDOG: starting server daemon (pid %d)" % os.getpid())
#         while True:
#             serverpid = os.fork()
#             if not serverpid: break
#             signal(SIGINT, SIG_IGN)
#             signal(SIGTERM, SIG_IGN)
#             signal(SIGQUIT, SIG_IGN)
#             (xpid, exitrc) = os.waitpid(serverpid, 0)
#             (exitcode, exitsigno, exitcore) = (exitrc >> 8, exitrc & 127, exitrc & 128)
#             retval = (exitsigno and ("signal %d" % exitsigno)) or str(exitcode)
#             retmsg = retval + ((exitcore and " (core dumped)") or "")
#             restart = (exitsigno > 0 and exitsigno not in (2, 3, 15))
#             cherrypy.log("WATCHDOG: server exited with exit code %s%s"
#                          % (retmsg, (restart and "... restarting") or ""))
#             if not restart:
#                 sys.exit((exitsigno and 1) or exitcode)
#             for pidfile in glob("%s/*/pid" % self.statedir):
#                 if os.path.exists(pidfile):
#                     pid = int(open(pidfile).readline())
#                     os.remove(pidfile)
#                     cherrypy.log("WATCHDOG: killing slave server %d" % pid)
#                     try: os.kill(pid, 9)
#                     except: pass

        # Run. Override signal handlers after CherryPy has itself started and
        # installed its own handlers. To achieve this we need to start the
        # server in non-blocking mode, fiddle with, than ask server to block.
        cherrypy.log("INFO: starting server in %s" % self.statedir)
        signal(SIGHUP, sig_reload)
        signal(SIGUSR1, sig_graceful)
        signal(SIGTERM, sig_terminate)
        signal(SIGQUIT, sig_terminate)
        signal(SIGINT, sig_terminate)

Then, set up the environment and start the server

# Source the script, location will be different depending on version
source /data/srv/HG1612i/sw.erupeika/slc6_amd64_gcc493/cms/crabserver/3.3.1702.rc3/etc/profile.d/

export CONDOR_CONFIG=/data/srv/condor_config
export X509_USER_CERT=$AUTHDIR/dmwm-service-cert.pem
export X509_USER_KEY=$AUTHDIR/dmwm-service-key.pem

# Start the crabcache and frontend
/data/srv/current/config/crabcache/manage start 'I did read documentation'
/data/srv/current/config/frontend/manage start 'I did read documentation'

# Start the server
python /data/srv/HG1612i/sw.erupeika/slc6_amd64_gcc493/cms/crabserver/3.3.1702.rc3/bin/wmc-httpd -r -d /data/srv/state/crabserver/ /data/srv/current/config/crabserver/

Script to cd to your last task's spool dir

A simple script for getting into the spool directory of the newest task. Change the cms pool username to your own.
cd /home/grid/cms1425
cd  "$(\ls -1dt ./*/ | head -n 1)"
cd "$(dirname "$(readlink "jobs_log.txt")")"

How to (re)start a task_process for a single task

In case a certain task has no task_process running, it can be restarted with a simple condor_submit from the schedd. However, it should be done as the user to whom the task belongs, otherwise the task_process will run into a lot of permission issues in the spool directory and ultimately won't work. The procedure is as follows:

# Locate the spool dir of the task, for example by using a known ClusterID of the DAG:
cd `condor_q 9454770.0 -af Iwd`
# Impersonate the user who submitted the original task
sudo su cms1425
# condor_submit the task_process jdl (called daemon.jdl). It should be done from the parent directory because of paths in the jdl.
condor_submit task_process/daemon.jdl
To work on a list of task names for a given user (e.g. in file TL)
# setup as that user
sudo su cms1425
cd /home/grid/cms1425
# put the list of task names in an env. var:
tList=`cat TL`
# check
for t in $tList ; do eval condor_q -con 'crab_reqname==\"$t\"'; done
# start task_process
for t in $tList ; do echo $; pushd $t/SPOOL_DIR; condor_submit task_process/daemon.jdl; popd; done

# condor_submit the task_process jdl (called daemon.jdl). It should be done from the parent directory because of paths in the jdl. condor_submit task_process/daemon.jdl

HTCondor pools

There are two pools we use: global (i.e. Production) and TB (Integration Test Bed) which are not correlated to CRAB prod/pre-prod/dev instances, CMSWEB testbed etc. The two HTCondor pools are selected by their collector.

List schedd's in the pool

#production pool
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool
# to get a list of host names to use in other commands:
schedds=`condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool -af machine`
for s in $schedds ; do echo $s; ssh $s -p 2222 "df -h /data"; done
# ITB pool
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool

Submit a CRAB task to the ITB pool

Need to explicitly select the ITB collector and one ITB schedd among the possible ones, e.g.
belforte@lxplus055/TC3> condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-itb
Name                    Machine           RunningJobs   IdleJobs   HeldJobs           0          0          0          113        194        155            2          0          0

                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

               Total               115                194                155
Then put in
#send to ITB pool
config.Debug.collector =  ''
config.Debug.scheddName = ''

Testing a new schedd

Requirements: On Schedd:

  • Transfer users proxy to the new schedd.
  • Create a file sleep.jdl with the following content:
Universe   = vanilla
Executable =
Arguments  = 1
Log        = sleep.PC.log
Output     = sleep.out.$(Cluster).$(Process)
Error      = sleep.err.$(Cluster).$(Process)
+DESIRED_Sites =  "T3_US_PuertoRico,T2_FI_HIP,T2_UK_SGrid_RALPP,T2_FR_GRIF_LLR,T3_UK_London_QMUL,T3_TW_NTU_HEP,T3_US_Omaha,T2_KR_KNU,T2_RU_SINP,T3_US_UMD,T2_CH_CERN_AI,T3_US_Colorado,T3_US_UB,T1_UK_RAL_Disk,T3_IT_Napoli,T3_NZ_UOA,T2_TH_CUNSTDA,T3_US_Kansas,T3_US_ParrotTest,T3_GR_IASA,T3_US_Parrot,T2_IT_Bari,T2_US_UCSD,T1_RU_JINR,T3_US_Vanderbilt_EC2,T2_RU_IHEP,T2_RU_RRC_KI,T2_CH_CERN,T3_BY_NCPHEP,T3_US_TTU,T3_GR_Demokritos,T3_US_UTENN,T3_US_UCR,T3_TW_NCU,T2_CH_CSCS,T2_UA_KIPT,T3_RU_FIAN,T2_IN_TIFR,T3_UK_London_UCL,T3_US_Brown,T3_US_UCD,T3_CO_Uniandes,T3_KR_KNU,T2_FR_IPHC,T3_US_OSU,T3_US_TAMU,T1_US_FNAL,T2_IT_Rome,T2_UK_London_Brunel,T3_IN_PUHEP,T3_IT_Trieste,T2_EE_Estonia,T3_UK_ScotGrid_ECDF,T2_CN_Beijing,T2_US_Florida,T3_US_Princeton_ICSE,T3_IT_MIB,T3_US_FNALXEN,T1_DE_KIT,T3_IR_IPM,T2_US_Wisconsin,T2_HU_Budapest,T2_DE_RWTH,T2_US_Vanderbilt,T2_BR_SPRACE,T3_UK_SGrid_Oxford,T3_US_NU,T2_BR_UERJ,T3_MX_Cinvestav,T3_US_FNALLPC,T1_US_FNAL_Disk,T3_US_UIowa,T3_IT_Firenze,T3_US_Cornell,T2_ES_IFCA,T3_US_UVA,T3_ES_Oviedo,T3_US_NotreDame,T2_DE_DESY,T1_UK_RAL,T2_US_Caltech,T3_FR_IPNL,T2_TW_Taiwan,T3_US_NEU,T3_UK_London_RHUL,T0_CH_CERN,T1_RU_JINR_Disk,T3_CN_PKU,T2_UK_London_IC,T2_US_Nebraska,T2_ES_CIEMAT,T3_US_Princeton,T2_PK_NCP,T2_CH_CERN_T0,T3_US_FSU,T3_KR_UOS,T3_IT_Perugia,T3_US_Minnesota,T2_TR_METU,T2_AT_Vienna,T2_US_Purdue,T3_US_Rice,T3_HR_IRB,T2_BE_UCL,T3_US_FIT,T2_UK_SGrid_Bristol,T2_PT_NCG_Lisbon,T1_ES_PIC,T3_US_JHU,T2_IT_Legnaro,T2_RU_INR,T3_US_FIU,T3_EU_Parrot,T2_RU_JINR,T2_IT_Pisa,T3_UK_ScotGrid_GLA,T3_US_MIT,T2_CH_CERN_HLT,T2_MY_UPM_BIRUNI,T1_FR_CCIN2P3,T2_FR_GRIF_IRFU,T3_US_UMiss,T2_FR_CCIN2P3,T2_PL_Warsaw,T3_AS_Parrot,T2_US_MIT,T2_BE_IIHE,T2_RU_ITEP,T1_CH_CERN,T3_CH_PSI,T3_IT_Bologna"
requirements = stringListMember(GLIDEIN_CMSSite,DESIRED_Sites)
should_transfer_files = YES
RequestMemory = 2000
when_to_transfer_output = ON_EXIT
x509userproxy = /home/grid/cms1761/temp/user_proxy_file
x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jbalcas/CN=751133/CN=Justas Balcas"
Queue 1
  • Create a script with the following content:
sleep $1
echo "I slept for $1 seconds on:"
  • Run command on schedd
condor_submit sleep.jdl
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
From TW machine (The machine name must be added to SCHEDD.ALLOW_WRITE):
  • Create same sleep.jdl file as on Schedd.
  • Create same file as on Schedd.
  • Execute following commands :
#sourcing TW env will move you to current folder
 source /data/srv/TaskManager/
 voms-proxy-init -voms cms
 userDN=$(voms-proxy-info -identity)
 proxypath=$(voms-proxy-info -path)
 sed --in-place "s|x509userproxysubject = .*|x509userproxysubject = \"$userDN\"|" sleep.jdl
 sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl
# Fill the new schedd information
 condor_submit -debug -pool -remote sleep.jdl

To submit a task through CRAB to a specific schedd and / or pool, modify your crab config as follows:

# add this line with a full schedd name to set a specific schedd for a task (0155 in this example)
config.Debug.scheddName = ''
# a collector can be set as well, necessary when the schedd to be tested is not in the global pool (0115 in this example corresponds to the ITB collector):
config.Debug.collector = ''

URL's to access CRAB Task database

  • queries can also be done via curl with :
    • = curl --key /tmp/x509up_u8516 --cert /tmp/x509up_u8516 -k -X GET ''=
    • e.g.
      belforte@stefanovm2/CRABServer> curl -s --key /tmp/x509up_u8516 --cert /tmp/x509up_u8516 -k  -X GET ''|grep 180821
      ,["SUBMITTED", "180821_092622:jdegens_crab_SingleElectron_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092838:jdegens_crab_SingleMuon_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092902:jdegens_crab_SingleMuon_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084849:jdegens_crab_SingleMuon_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092643:jdegens_crab_SingleElectron_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092926:jdegens_crab_SingleMuon_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092952:jdegens_crab_SingleMuon_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084912:jdegens_crab_SingleMuon_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084720:jdegens_crab_SingleElectron_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084741:jdegens_crab_SingleElectron_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084637:jdegens_crab_SingleElectron_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092729:jdegens_crab_SingleElectron_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092814:jdegens_crab_SingleMuon_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084826:jdegens_crab_SingleMuon_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092706:jdegens_crab_SingleElectron_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084658:jdegens_crab_SingleElectron_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084936:jdegens_crab_SingleMuon_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084615:jdegens_crab_SingleElectron_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084803:jdegens_crab_SingleMuon_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092751:jdegens_crab_SingleElectron_Run2017F_31Mar2018v1_13TeV_MINIAOD"]

URL's to access transferdb

Kill all tasks belonging to a user

There are two ways of doing this. One is to use the "crab kill" command. This is the recommended way since things are going to get killed properly (messages sent to dashboard, transfers killed etc..):

via crab kill from e.g. an lxplus machine (preferred way)

mkdir kill_user
cd kill_user
#get the list of tasks
TASKS=$(condor_q -pool -all -const 'TaskType=="ROOT" && JobStatus==1 && CRAB_UserHN=="<username>"' -af CRAB_ReqName)
echo $TASKS #to check you grabbed the tasks
source /cvmfs/
for task in $TASKS; do
    crab remake --task=$task
    crab kill --killwarning="Your tasks have been killed by a CRAB3 operator, see your inbox" crab_*
    rm -rf crab_*

directy on the schedd via condor command (if you have reasons not to like the preferred way, like need to do it real quick)

Alternatively one can directly go to the schedd and kill the tasks using condor (to my knowledge you need to do it schedd per schedd):

#from the schedd
condor_q -all -const 'TaskType=="ROOT" && JobStatus==1 && CRAB_UserHN=="gurpinar"' -af CRAB_ReqName > ~/tasks.txt
cat ~/tasks.txt #to check you grabbed the tasks

for task in `cat ~/tasks.txt`; do
    sudo su cms683 condor_hold  -const CRAB_ReqName==\"$task\"

Then if you want to be nice you can also update the warning message in Oracle so that user will see the message with crab status (do this from lxplus). Notice that the warning message is automatically changed using crab kill. Beware this operation requires knowing the password for the production Oracle database.

mkdir kill_user
cd kill_user
#create the change warning file and then change XXXXX to the right value
cat <<EOF >
source /afs/ -s prod

python << END
import cx_Oracle as db
import sys

taskname = "\$1"
print taskname

conn = db.connect('cms_analysis_reqmgr/XXXXX@cmsr')
cursor = conn.cursor()
res = cursor.execute("update tasks set tm_task_warnings = '[\"Your tasks have been killed by a CRAB3 operator, see your inbox\"]' where tm_taskname='%s'" % taskname)

for task in `cat ~/tasks.txt`; do
    sh $task

directly on all the schedds via condor command (if you really have an emergency)

You need to issue condor command on all schedd's, and you want to make sure you also HOLD the DAGMANs to prevent resubmissions.

First thing is to have the CERN user name aka CRAB_userHN e.g. belforte

You can do things from lxplus.

  1. make a list of schedds
SchedList=`condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool -af machine`
echo $SchedList
which yields
  1. set the target username
  1. hold the DAGMANs
 for s in $SchedList; do ssh -p 2222 ${s} sudo condor_hold -con 'jobuniverse==7 && crab_userhn==${userHN}' ; done

Remove old tasks in one schedd

At times we had some things go wrong and tasks end up in a zombie state using up a running dagman slot well beyond the a priori defined lifetime of a task. Removal of such old taks can be done one schedd at a time with a small adaptation of the previous recipe. Following example removes tasks older than 2 weeks.

#from the schedd
sudo su
condor_q -const 'JobUniverse==7 && JobStatus==2 && (time() - JobStartDate > 14 * 86400)' -af ClusterId  > ~/tasks.txt
cat ~/tasks.txt #to check you grabbed the tasks

for task in `cat ~/tasks.txt` ; do
  condor_hold $task -reason "Task removed by CRAB operator becasue too old" 

That will store a HoldReason as a ClassAd, e.g.

root@submit-4 ~# condor_q 22111709 -af HoldReason
Task removed by CRAB operator becasue too old (by user condor)

You can also get the list of CRAB task names with e.g.

for task in `cat ~/tasks.txt` ; do
  condor_q $task -af CRAB_ReqName
and use the above list of CRAB task names to insert the HodReason in Oracle so that user will see the message with crab status (do this from lxplus using the example in the previous section). For sake of example, here's what I did to remove very old tasks (>20 dasy) on all schedd's:
sudo su
condor_q -const 'JobUniverse==7 && JobStatus==2 && (time() - JobStartDate > 20 * 86400)' -af ClusterId > ~/tasks.txt
cat ~/tasks.txt

for task in `cat ~/tasks.txt` ; do
  condor_hold $task -reason "Task removed by CRAB operator becasue too old" 

for task in `cat ~/tasks.txt` ; do   condor_q $task -af CRAB_ReqName >> /afs/; done

so I got the final list of all tasks I killed in = /afs/ and will use that in the Oracle changing script:

for task in `cat ~/tasks.txt`; do
    sh $task


Commands below need to be run as root
  • run puppet: puppet agent -tv
  • check if puppet is enabled: cat $(puppet config print  agent_disabled_lockfile)
  • enable puppet: puppet agent --enable
  • disable puppet: puppet agent --disable "Stefano says: keep stable"
    • it is good to set a short note to say why is disabled in the disable command
  • tell last time puppet ran: grep puppet /var/log/messages|tail

How to tell which puppet env is this machine in, various ways:

  1. it is printed when you log in
  2. cat /etc/motd|grep Puppet
  3. grep env /etc/puppetlabs/puppet/puppet.conf

For more info see

Edit | Attach | Watch | Print version | History: r122 < r121 < r120 < r119 < r118 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r122 - 2019-08-17 - StefanoBelforte
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback