Difference: Crab3OperatorDebugging (1 vs. 124)

Revision 1242019-09-12 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 854 to 854
 

URL's to access transferdb

Changed:
<
<
  • get trasfer status for all files in a task
>
>
  • get trasfer status for all files in a task (this is what is used to build the transfer info tab table in the server UI)
 
Line: 864 to 864
 
5
"SUBMITTED",
6
"KILL",
7
"KILLED"}
Changed:
<
<
  • get all info on one file to transfer:
>
>
  • get all info on one file to transfer (this is what is displayed in the tranfer into tab of the UI when clcking on a file id)
 

Kill all tasks belonging to a user

Revision 1232019-09-04 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 1007 to 1007
 

Puppet

Commands below need to be run as root
  • run puppet: puppet agent -tv
Changed:
<
<
  • check if puppet is enabled: cat $(puppet config print  agent_disabled_lockfile)
>
>
  • check if puppet is enabled: cat $(puppet config print  agent_disabled_lockfile); echo ""
 
  • enable puppet: puppet agent --enable
  • disable puppet: puppet agent --disable "Stefano says: keep stable"
    • it is good to set a short note to say why is disabled in the disable command

Revision 1222019-08-17 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 721 to 721
 
#production pool
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-global.cern.ch
Changed:
<
<
>
>
# to get a list of host names to use in other commands: schedds=`condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-global.cern.ch -af machine` for s in $schedds ; do echo $s; ssh $s -p 2222 "df -h /data"; done
 # ITB pool condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-itb.cern.ch
Line: 929 to 931
 done
Changed:
<
<

directy on all the schedds via condor command (if you really have an emergency)

>
>

directly on all the schedds via condor command (if you really have an emergency)

  You need to issue condor command on all schedd's, and you want to make sure you also HOLD the DAGMANs to prevent resubmissions.
Line: 939 to 941
 
  1. make a list of schedds
Changed:
<
<
SchedList=`condor_status -pool cmsgwms-collector-global.cern.ch -sched|grep crab3|cut -d '@' -f2 |cut -d . -f 1`
>
>
SchedList=`condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-global.cern.ch -af machine`
  echo $SchedList which yields
Changed:
<
<
vocms0106 vocms0107 vocms0119 vocms0120 vocms0121 vocms0122 vocms0137 vocms0144 vocms0155 vocms0194 vocms0195 vocms0196 vocms0197 vocms0198 vocms0199 vocms059
>
>
vocms0106.cern.ch vocms0107.cern.ch vocms0119.cern.ch vocms0120.cern.ch vocms0121.cern.ch vocms0122.cern.ch vocms0144.cern.ch vocms0155.cern.ch vocms0194.cern.ch vocms0195.cern.ch vocms0196.cern.ch vocms0197.cern.ch vocms0198.cern.ch vocms0199.cern.ch vocms059.cern.ch
 
  1. set the target username
Line: 951 to 953
 
  1. hold the DAGMANs
Changed:
<
<
for s in $SchedList; do ssh -p 2222 ${s}.cern.ch sudo condor_hold -con 'jobuniverse==7 && crab_userhn==${userHN}' ; done
>
>
for s in $SchedList; do ssh -p 2222 ${s} sudo condor_hold -con 'jobuniverse==7 && crab_userhn==${userHN}' ; done
 

Remove old tasks in one schedd

Revision 1212019-07-02 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 939 to 939
 
  1. make a list of schedds
Changed:
<
<
SchedList=`condor_status -sched|grep crab3|cut -d '@' -f2 |cut -d . -f 1`
>
>
SchedList=`condor_status -pool cmsgwms-collector-global.cern.ch -sched|grep crab3|cut -d '@' -f2 |cut -d . -f 1`
  echo $SchedList which yields

Revision 1202019-05-30 - TodorTrendafilovIvanov

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 750 to 750
 config.Debug.scheddName = 'crab3@vocms069NOSPAMPLEASE.cern.ch'
Added:
>
>
 

Testing a new schedd

Requirements: On Schedd:

Revision 1192019-05-10 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 354 to 354
 

To run a pre-job on the schedd

Changed:
<
<
  • Go to the spool directory on the task in the schedd.
>
>
  • Go to the spool directory on the task in the schedd
  • get the user identity in order to access files via sudo su username
    • Note: username is the schedd username which the task belongs to, for example, cms9999. You can e.g. do
sudo su `ls -ld ${PWD}|awk '{print $3}'`
 
  • Copy the relevant arguments that are passed to the pre-job as written in RunJobs.dag (the relevant arguments are those after $RETRY. These look like 1 150818_081011:erupeika_task_name my_machine.cern.ch and are in the second line of the text block for each job).
Added:
>
>
    • this can be done via
export jobId=1  # or whatever job # you want to retry
pjArgs=`cat RunJobs.dag | grep "PRE  Job${jobId}" | awk 'BEGIN { FS="RETRY" }; {print $2}'`
 
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
Added:
>
>
    • this can be done via export X509_USER_PROXY=`pwd`/`ls|egrep  [a-z,0-9]{40}`
 
  • You might also need to change the permissions of the prejob_logs directory
  • Run the pre-job. The generic instruction is:
Changed:
<
<
sh -c 'sudo -u export TEST_DONT_REDIRECT_STDOUT=True; export _CONDOR_JOB_AD=finished_jobs/job.1.0; export X509_USER_PROXY=; sh dag_bootstrap.sh PREJOB '
>
>
export TEST_DONT_REDIRECT_STDOUT=True; export _CONDOR_JOB_AD=finished_jobs/job.1.0; export X509_USER_PROXY=; sh dag_bootstrap.sh PREJOB ' Note: retry_count can be 0.

  • if you used the shortcuts indicated above the simpler command is
export TEST_DONT_REDIRECT_STDOUT=True; export _CONDOR_JOB_AD=finished_jobs/job.1.0;  sh dag_bootstrap.sh PREJOB 0 $pjArgs'
 
Deleted:
<
<
Note: username is the schedd username which the task belongs to, for example, cms9999. retry_count can be 0.
 If none of the jobs have finished yet, you can probably grab the classads from one of the submitted / finished jobs with:
sudo condor_q -l 3891792.0 > /tmp/.job.ad
Line: 406 to 422
 
  • Find the cmsXXXX username of the user the task belongs to (even if you're running your own task) and replace the cmsXXXX with that.
  • Run the post-job. The generic instruction is:
Changed:
<
<
sudo -u cmsXXXX sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.X.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=; sh dag_bootstrap.sh POSTJOB '
>
>
sudo su cmsXXXX sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.X.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=; sh dag_bootstrap.sh POSTJOB '
  where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB.

One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:

Changed:
<
<
sudo -u cms1425 sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.2.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=a41b46b7b59e9858006416f98d1540f9ddf4d646; sh dag_bootstrap.sh POSTJOB 3971322 0 0 10 160314_085828:erupeika_crab_test7 2 /store/temp/user/erupeika.1ae8d366cbf4507a432372f69f456dc0d23cc26d/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 /store/user/erupeika/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 cmsRun_2.log.tar.gz output_2.root'
>
>
sudo su cms1425 sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.2.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=a41b46b7b59e9858006416f98d1540f9ddf4d646; sh dag_bootstrap.sh POSTJOB 3971322 0 0 10 160314_085828:erupeika_crab_test7 2 /store/temp/user/erupeika.1ae8d366cbf4507a432372f69f456dc0d23cc26d/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 /store/user/erupeika/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 cmsRun_2.log.tar.gz output_2.root'
 
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.
  • Note : at this point it is possible to insert import pdb; pdb.set_trace() in TaskWorker/Action/PostJob.py and it will stop there allowing interactive debugging
Line: 877 to 893
 cat ~/tasks.txt #to check you grabbed the tasks

for task in `cat ~/tasks.txt`; do

Changed:
<
<
sudo -u cms683 condor_hold -const CRAB_ReqName==\"$task\"
>
>
sudo su cms683 condor_hold -const CRAB_ReqName==\"$task\"
 done

Revision 1182019-04-02 - TodorTrendafilovIvanov

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 782 to 782
  sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl # Fill the new schedd information _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY
Changed:
<
<
condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
>
>
condor_submit -debug -pool cmsgwms-collector-global.cern.ch -remote crab3@vocms0199NOSPAMPLEASE.cern.ch sleep.jdl
 

To submit a task through CRAB to a specific schedd and / or pool, modify your crab config as follows:

Revision 1172019-03-30 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 415 to 415
 sudo -u cms1425 sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.2.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=a41b46b7b59e9858006416f98d1540f9ddf4d646; sh dag_bootstrap.sh POSTJOB 3971322 0 0 10 160314_085828:erupeika_crab_test7 2 /store/temp/user/erupeika.1ae8d366cbf4507a432372f69f456dc0d23cc26d/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 /store/user/erupeika/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 cmsRun_2.log.tar.gz output_2.root'
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.
Added:
>
>
  • Note : at this point it is possible to insert import pdb; pdb.set_trace() in TaskWorker/Action/PostJob.py and it will stop there allowing interactive debugging
 
Changed:
<
<
  • Here is a script which automathize all of the above :
>
>
  • Here is a script which automathize all of the above (but the insertion of pdb calls):
 

Revision 1162019-01-18 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 991 to 991
 
  • enable puppet: puppet agent --enable
  • disable puppet: puppet agent --disable "Stefano says: keep stable"
    • it is good to set a short note to say why is disabled in the disable command
Added:
>
>
  • tell last time puppet ran: grep puppet /var/log/messages|tail
  How to tell which puppet env is this machine in, various ways:
  1. it is printed when you log in
  2. cat /etc/motd|grep Puppet
Changed:
<
<
  1. grep env /etc/puppetlabs/puppet/puppet.con
>
>
  1. grep env /etc/puppetlabs/puppet/puppet.conf
  For more info see https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRABPuppet

Revision 1152018-11-14 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 833 to 833
 
Added:
>
>

URL's to access transferdb

 

Kill all tasks belonging to a user

There are two ways of doing this. One is to use the "crab kill" command. This is the recommended way since things are going to get killed properly (messages sent to dashboard, transfers killed etc..):

Revision 1142018-11-09 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 978 to 978
 
  • disable puppet: puppet agent --disable "Stefano says: keep stable"
    • it is good to set a short note to say why is disabled in the disable command
Added:
>
>
How to tell which puppet env is this machine in, various ways:
  1. it is printed when you log in
  2. cat /etc/motd|grep Puppet
  3. grep env /etc/puppetlabs/puppet/puppet.con

For more info see https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRABPuppet

 
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 1132018-10-26 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 335 to 335
 export _CONDOR_JOB_AD=finished_jobs/job.0-1.0

export TEST_DONT_REDIRECT_STDOUT=True

Changed:
<
<
export X509_USER_PROXY=cfa0cb32f2043a8042e1290debf652b8fd62334e
>
>
#export X509_USER_PROXY=cfa0cb32f2043a8042e1290debf652b8fd62334e export X509_USER_PROXY=`ls|egrep [a-z,0-9]{40}`
 rm automatic_splitting/processed

vim TaskWorker/Actions/PreDAG.py

Revision 1122018-09-07 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 286 to 286
 ,"aldo" ]}
Changed:
<
<
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
>
>
  • To get all the files uploaded by a user to the crabcache and the amount of quota (in bytes) he's using:
 
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [

Revision 1112018-09-02 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 968 to 968
 done
Added:
>
>

Puppet

Commands below need to be run as root
  • run puppet: puppet agent -tv
  • check if puppet is enabled: cat $(puppet config print  agent_disabled_lockfile)
  • enable puppet: puppet agent --enable
  • disable puppet: puppet agent --disable "Stefano says: keep stable"
    • it is good to set a short note to say why is disabled in the disable command
 
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 1102018-08-29 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 293 to 293
  {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]} ]}
Changed:
<
<
  • To get more information about one specific file:
>
>
  • To get more information about one specific file (the file must be owned by the user who makes the query):
 
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [

Revision 1092018-08-23 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 798 to 798
 
Added:
>
>

  • queries can also be done via curl with :
    • = curl --key /tmp/x509up_u8516 --cert /tmp/x509up_u8516 -k -X GET ''=
    • e.g.
      belforte@stefanovm2/CRABServer> curl -s --key /tmp/x509up_u8516 --cert /tmp/x509up_u8516 -k  -X GET 'https://cmsweb.cern.ch/crabserver/prod/task?subresource=taskbystatus&username=jdegens&taskstatus=SUBMITTED'|grep 180821
      ,["SUBMITTED", "180821_092622:jdegens_crab_SingleElectron_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092838:jdegens_crab_SingleMuon_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092902:jdegens_crab_SingleMuon_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084849:jdegens_crab_SingleMuon_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092643:jdegens_crab_SingleElectron_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092926:jdegens_crab_SingleMuon_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092952:jdegens_crab_SingleMuon_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084912:jdegens_crab_SingleMuon_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084720:jdegens_crab_SingleElectron_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084741:jdegens_crab_SingleElectron_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084637:jdegens_crab_SingleElectron_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092729:jdegens_crab_SingleElectron_Run2017E_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092814:jdegens_crab_SingleMuon_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084826:jdegens_crab_SingleMuon_Run2017C_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092706:jdegens_crab_SingleElectron_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084658:jdegens_crab_SingleElectron_Run2017D_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084936:jdegens_crab_SingleMuon_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084615:jdegens_crab_SingleElectron_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_084803:jdegens_crab_SingleMuon_Run2017B_31Mar2018v1_13TeV_MINIAOD"]
      ,["SUBMITTED", "180821_092751:jdegens_crab_SingleElectron_Run2017F_31Mar2018v1_13TeV_MINIAOD"]
      belforte@stefanovm2/CRABServer> 
      

 

Kill all tasks belonging to a user

Revision 1082018-08-22 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 166 to 166
 

on REST interface

Changed:
<
<
take the X-Error-Id in the crab.log, go to vocms022 where the REST logs are copied in /build/srv-logs then grep for the X-Error-Id, for example:
>
>

Find REST logs

take the X-Error-Id in the crab.log, go to vocms022 or vocms055 where the REST logs are copied in /build/srv-logs then grep for the X-Error-Id, for example:

 
grep <X-Error-Id> /build/srv-logs/vocms*/crabserver/crabserver-20151015.log -C 30
Added:
>
>
 you will get a stacktrace that helps you understand what is going on

Revision 1072018-08-10 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 452 to 452
  export CRAB3_RUNTIME_DEBUG=True tar xzmf CMSRunAnalysis.tar.gz cp Job.submit Job.1.submit
Added:
>
>
  • or can make things easier by creating e.g. a new run_job99.sh by
    cp  run_job.sh run_job99.sh
    vim run_job99.sh
    and at vim prompt (:) type: %s/${1}/99/g
    then write and exit and you have all lines to execute 
 
  • pick the argument list for e.g. a copy/paste with mouse:
    • cat InputArgs.txt |sed "1q;d" replace 1 with the jobid
  • now can run the main wrapper CMSRunAnalysis.sh like e.g.

Revision 1062018-06-19 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 320 to 320
 

Development Oriented Tricks I did not know where to put

Added:
>
>

To run the preDag on the schedd with debugger

  • I only made this after it ran already by itself.
  • sketchy notes on what to do there, may be enough:
cd SPOOL_DIR
sudo su cms1627

#export _CONDOR_JOB_AD=finished_jobs/job.1.0
export _CONDOR_JOB_AD=finished_jobs/job.0-1.0 

export TEST_DONT_REDIRECT_STDOUT=True
export X509_USER_PROXY=cfa0cb32f2043a8042e1290debf652b8fd62334e
rm automatic_splitting/processed

vim TaskWorker/Actions/PreDAG.py
and put a breakpoint at some proper place:
import pdb
pdb.set_trace()

python
import TaskWorker.Actions.PreDAG as PreDAG
PreDAG.PreDAG().execute("processing",5,0)

 

To run a pre-job on the schedd

Revision 1052018-06-05 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 419 to 419
 sh run_job.sh
Added:
>
>

Better way with preparelocal

  • crab preparelocal -d ...
  • cd .../local
  • look at run_job.sh and excute line by line interactively, replacing ${1} with job number e.g.
     export SCRAM_ARCH=slc6_amd64_gcc481; export CRAB_RUNTIME_TARBALL=local; export CRAB_TASKMANAGER_TARBALL=local; export _CONDOR_JOB_AD=Job.1.submit
     export CRAB3_RUNTIME_DEBUG=True
     tar xzmf CMSRunAnalysis.tar.gz
     cp  Job.submit Job.1.submit 
  • pick the argument list for e.g. a copy/paste with mouse:
    • cat InputArgs.txt |sed "1q;d" replace 1 with the jobid
  • now can run the main wrapper CMSRunAnalysis.sh like e.g.
    • ./CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://stefanovm2.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_10_1_0 --scramArch=slc6_amd64_gcc630 --inputFile=job_input_file_list_1.txt --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --maxRuntime=-1 --scriptArgs=[] -o {}
  • if want to debug the python part CMSRunAnalysis.py, edit it and put a call to pdb.set_trace() after __main__, then:
    • export PYTHONPATH=`pwd`/CRAB3.zip:`pwd`/WMCore.zip:$PYTHONPATH
    • python CMSRunAnalysis.py -r "`pwd`" -a sandbox.tar.gz --sourceURL=https://stefanovm2.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_10_1_0 --scramArch=slc6_amd64_gcc630 --inputFile=job_input_file_list_1.txt --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --maxRuntime=-1 --scriptArgs=[] -o {} 

 

To run the job wrapper and the stageout wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:

Revision 1042018-01-10 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 627 to 627
 # condor_submit the task_process jdl (called daemon.jdl). It should be done from the parent directory because of paths in the jdl. condor_submit task_process/daemon.jdl
Added:
>
>
To work on a list of task names for a given user (e.g. in file TL)
# setup as that user
sudo su cms1425
cd /home/grid/cms1425
# put the list of task names in an env. var:
tList=`cat TL`
# check
for t in $tList ; do eval condor_q -con 'crab_reqname==\"$t\"'; done
# start task_process
for t in $tList ; do echo $; pushd $t/SPOOL_DIR; condor_submit task_process/daemon.jdl; popd; done

# condor_submit the task_process jdl (called daemon.jdl). It should be done from the parent directory because of paths in the jdl. condor_submit task_process/daemon.jdl

 

HTCondor pools

There are two pools we use: global (i.e. Production) and TB (Integration Test Bed) which are not correlated to CRAB prod/pre-prod/dev instances, CMSWEB testbed etc. The two HTCondor pools are selected by their collector.

Revision 1032017-12-18 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 633 to 633
 

List schedd's in the pool

#production pool
Changed:
<
<
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-itb.cern.ch
>
>
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-global.cern.ch
  # ITB pool
Changed:
<
<
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-global.cern.ch
>
>
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-itb.cern.ch
 

Submit a CRAB task to the ITB pool

Revision 1022017-10-05 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 14 to 14
  CRAB Logo
Changed:
<
<

Operator Task Management

>
>

Operator Tips and Hints

 
Complete: 5 Go to SWGuideCrab
Line: 628 to 628
 condor_submit task_process/daemon.jdl
Added:
>
>

HTCondor pools

There are two pools we use: global (i.e. Production) and TB (Integration Test Bed) which are not correlated to CRAB prod/pre-prod/dev instances, CMSWEB testbed etc. The two HTCondor pools are selected by their collector.

List schedd's in the pool

#production pool
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-itb.cern.ch

# ITB pool
condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-global.cern.ch

Submit a CRAB task to the ITB pool

Need to explicitly select the ITB collector and one ITB schedd among the possible ones, e.g.
belforte@lxplus055/TC3> condor_status -sched -con 'CMSGWMS_Type=="crabschedd"' -pool cmsgwms-collector-itb
Name                    Machine           RunningJobs   IdleJobs   HeldJobs

crab3@vocms0115.cern.ch vocms0115.cern.ch           0          0          0
crab3@vocms068.cern.ch  vocms068.cern.ch          113        194        155
crab3@vocms069.cern.ch  vocms069.cern.ch            2          0          0

                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

                    
               Total               115                194                155
belforte@lxplus055/TC3> 
Then put in crabConfig.py
config.section_("Debug")
#send to ITB pool
config.Debug.collector =  'cmsgwms-collector-itb.cern.ch'
config.Debug.scheddName = 'crab3@vocms069.cern.ch'
 

Testing a new schedd

Revision 1012017-08-10 - EmilisAntanasRupeika

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 615 to 615
 cd "$(dirname "$(readlink "jobs_log.txt")")"
Changed:
<
<

Script for checking that each task on the schedd has a running task process

#!/bin/bash 

enable_flag=/etc/enable_task_daemon
[ -f $enable_flag ] || exit 0

log_file=/tmp/`basename $0`
log_file=${log_file%%.sh}.log
:>$log_file
>
>

How to (re)start a task_process for a single task

 
Changed:
<
<
RunningDags=/tmp/tmpRunningDags.txt RunningTPs=/tmp/tmpRunningTPs.txt
>
>
In case a certain task has no task_process running, it can be restarted with a simple condor_submit from the schedd. However, it should be done as the user to whom the task belongs, otherwise the task_process will run into a lot of permission issues in the spool directory and ultimately won't work. The procedure is as follows:
 
Changed:
<
<
condor_q -const 'tasktype=="ROOT" && jobstatus=?=2' -af: clusterid > $RunningDags condor_q -const 'jobuniverse==12 && jobstatus==2' -af: clusterid args > $RunningTPs

for clusterid in `cat $RunningDags` do /bin/grep $clusterid $RunningTPs >/dev/null 2>&1 || echo "DAG $clusterid has no TP running on schedd $HOSTNAME" |tee -a $log_file done

>
>
# Locate the spool dir of the task, for example by using a known ClusterID of the DAG:
cd `condor_q 9454770.0 -af Iwd`
# Impersonate the user who submitted the original task
sudo su cms1425
# condor_submit the task_process jdl (called daemon.jdl). It should be done from the parent directory because of paths in the jdl.
condor_submit task_process/daemon.jdl
 

Revision 1002017-07-04 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 356 to 356
 

To run a post-job on the schedd

Changed:
<
<
  • Beware: this instructions seems not to work anymore as of June 20, 2017
>
>
  • instructions updated and checked: June 2017
 
  • Go to the spool directory of the task in the schedd.
  • Find the HTCondor clusterId of the job for which you want to run the post-job, e.g. wth something like jobClusterId=`ls -l finished_jobs/job.2.0 | awk '{print $NF}'|cut -d. -f2`
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES). Example:

Revision 992017-06-26 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 386 to 386
 
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.
Changed:
<
<
  • Here is an example of a script which automathize all of the above.
<!--/twistyPlugin twikiMakeVisibleInline-->
#!/bin/bash

#set -x
>
>
 
Deleted:
<
<
if [ $# -ne 2 ] then echo "usage: $0 " echo " will re-run postjob for that job on that task " fi

echo "BEWARE: THIS SCRIPT NEEDS TO BE source'D NOT EXECUTED" echo ""

# start from task name taskName=$1 # and a job number (id in task) for which to rerun PostJob jobId=$2 # and for which retry number jobRetry=0

# #========== Now the real action # see https://twiki.cern.ch/twiki/bin/view/CMSPublic/Crab3OperatorDebugging#To_run_a_post_job_on_the_schedd # export _CONDOR_PER_JOB_HISTORY_DIR=`condor_config_val PER_JOB_HISTORY_DIR`

# find the task spool directory constrain=\'crab_workflow==\"$taskName\"\&\&jobUniverse==7\' cat > /tmp/myCondorQ.sh <<EOF condor_q -con $constrain -af iwd EOF chmod +x /tmp/myCondorQ.sh spoolDir=`/tmp/myCondorQ.sh` rm /tmp/myCondorQ.sh

# cd $spoolDir

# find who owns the spool directory username=`ls -ld ${PWD}|awk '{print $3}'`

# current username is $USER, make sure it matches if [ $USER = $username ] then echo "need to have access to this directory. do:" echo " sudo su $username" echo "and then source again" exit fi

# grab the proxy export X509_USER_PROXY=`ls|egrep [a-z,0-9]{40}` #voms-proxy-info

# grab args from the dagman config PJargs=`cat RunJobs.dag | grep "POST Job${jobId}" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`

# find the condor clusterId for the job jobClusterId=`ls -l finished_jobs/job.${jobId}.${jobRetry} | awk '{print $NF}'|cut -d. -f2-3`

# reset PJ count echo '{"pre": 1, "post": 0}' > retry_info/job.${jobId}.txt rm defer_info/defer_num.${jobId}.0.txt

# these two are mandatory export _CONDOR_JOB_AD=finished_jobs/job.${jobId}.0 export TEST_DONT_REDIRECT_STDOUT=True

# to disable the job retry handling part export TEST_POSTJOB_DISABLE_RETRIES=True

# to avoid sending status report to dashboard and file metadata upload export TEST_POSTJOB_NO_STATUS_UPDATE=True

# set some other args to emulate when calling the bootstrap jobReturnCode=0 retryCount=0 maxRetries=10

# could run it, bu all in all simply print the command and let user run it echo "copy/paste and execute this command to re-run the post job" echo "sh dag_bootstrap.sh POSTJOB ${jobClusterId} ${jobReturnCode} ${retryCount} ${maxRetries} $PJargs"

 
Deleted:
<
<
<!--/twistyPlugin-->
 

To run the job wrapper from lxplus

Revision 982017-06-23 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 386 to 386
 
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.
Changed:
<
<
  • Here is an example of a script which automathize all of the above. But be aware that it is not fully validated yet
>
>
  • Here is an example of a script which automathize all of the above.
 
<!--/twistyPlugin twikiMakeVisibleInline-->
Changed:
<
<
# # this initial lines could become arguments:
>
>
#!/bin/bash

#set -x

if [ $# -ne 2 ] then echo "usage: $0 " echo " will re-run postjob for that job on that task " fi

echo "BEWARE: THIS SCRIPT NEEDS TO BE source'D NOT EXECUTED" echo ""

 # start from task name
Changed:
<
<
taskName="170622_192358:belforte_crab_20170622_212353"
>
>
taskName=$1
 # and a job number (id in task) for which to rerun PostJob
Changed:
<
<
jobId=1
>
>
jobId=$2
 # and for which retry number jobRetry=0
Deleted:
<
<
# need the correct user name for file/dir access username=cms1627 # this is for belforte
  # #========== Now the real action
Line: 406 to 415
 # export _CONDOR_PER_JOB_HISTORY_DIR=`condor_config_val PER_JOB_HISTORY_DIR`
Changed:
<
<
#
>
>
# find the task spool directory
 constrain=\'crab_workflow==\"$taskName\"\&\&jobUniverse==7\' cat > /tmp/myCondorQ.sh <<EOF condor_q -con $constrain -af iwd
Line: 418 to 427
 # cd $spoolDir
Changed:
<
<
# sudo su cmsxxx sudo su ${username} #
>
>
# find who owns the spool directory username=`ls -ld ${PWD}|awk '{print $3}'`

# current username is $USER, make sure it matches if [ $USER = $username ] then echo "need to have access to this directory. do:" echo " sudo su $username" echo "and then source again" exit fi

 # grab the proxy export X509_USER_PROXY=`ls|egrep [a-z,0-9]{40}` #voms-proxy-info
Changed:
<
<
# want to rerun PJ for job 2
>
>
# grab args from the dagman config
 PJargs=`cat RunJobs.dag | grep "POST Job${jobId}" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
Deleted:
<
<
# check echo "PJargs is :" echo $PJargs
  # find the condor clusterId for the job jobClusterId=`ls -l finished_jobs/job.${jobId}.${jobRetry} | awk '{print $NF}'|cut -d. -f2-3`
Deleted:
<
<
#
 # reset PJ count echo '{"pre": 1, "post": 0}' > retry_info/job.${jobId}.txt rm defer_info/defer_num.${jobId}.0.txt
Deleted:
<
<
#
 # these two are mandatory export _CONDOR_JOB_AD=finished_jobs/job.${jobId}.0 export TEST_DONT_REDIRECT_STDOUT=True
Line: 456 to 468
 retryCount=0 maxRetries=10
Changed:
<
<
# now run it
>
>
# could run it, bu all in all simply print the command and let user run it echo "copy/paste and execute this command to re-run the post job"
 echo "sh dag_bootstrap.sh POSTJOB ${jobClusterId} ${jobReturnCode} ${retryCount} ${maxRetries} $PJargs"

Revision 972017-06-22 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 387 to 387
 
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.

  • Here is an example of a script which automathize all of the above. But be aware that it is not fully validated yet
Changed:
<
<
>
>
<!--/twistyPlugin twikiMakeVisibleInline-->
 
#
# this initial lines could become arguments:
# start from task name
Changed:
<
<
taskName="170620_163137:belforte_crab_20170620_183133"
>
>
taskName="170622_192358:belforte_crab_20170622_212353"
 # and a job number (id in task) for which to rerun PostJob
Changed:
<
<
jobId=2
>
>
jobId=1
 # and for which retry number jobRetry=0 # need the correct user name for file/dir access
Line: 460 to 460
 echo "sh dag_bootstrap.sh POSTJOB ${jobClusterId} ${jobReturnCode} ${retryCount} ${maxRetries} $PJargs"

Changed:
<
<
>
>
<!--/twistyPlugin-->
 

To run the job wrapper from lxplus

Revision 962017-06-22 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 357 to 357
 

To run a post-job on the schedd

Added:
>
>
  • Beware: this instructions seems not to work anymore as of June 20, 2017
 
  • Go to the spool directory of the task in the schedd.
Changed:
<
<
  • Look at the task page in GlideMon and copy the cluster id of the job for which you want to run the post-job.
>
>
  • Find the HTCondor clusterId of the job for which you want to run the post-job, e.g. wth something like jobClusterId=`ls -l finished_jobs/job.2.0 | awk '{print $NF}'|cut -d. -f2`
 
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES). Example:
PJargs=`cat RunJobs.dag | grep "POST Job1" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
Line: 369 to 370
 170620_163137:belforte_crab_20170620_183133 1 /store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/prova/doppio/slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 /store/user/belforte/prova/doppio//slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 cmsRun_1.log.tar.gz conventional kk_1.root belforte@vocms059/~>
Changed:
<
<
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
>
>
  • Identify the file in the spool directory that contains your proxy and set X509_USER_PROXY env. var (the file name is a sequence of 40 lower-case letters and/or digits, could e.g. use export X509_USER_PROXY=`ls|egrep  [a-z,0-9]{40}`) and make sure the proxy is still valid (voms-proxy-info).
 
  • Modify the file retry_info/job.X.txt so that the post value is equal to 0.
  • Delete the file defer_info/defer_num.X.0.txt.
  • Find the cmsXXXX username of the user the task belongs to (even if you're running your own task) and replace the cmsXXXX with that.
Line: 385 to 386
 
  • Note that there might be some permission issues, which python should throw an exception about. Modifying the files and folders which are causing problems to have 775 permissions will likely solve the issue. Also you can try to change the owner of the files to cmsXXXX in case they're owned by condor.
Added:
>
>
  • Here is an example of a script which automathize all of the above. But be aware that it is not fully validated yet

#
# this initial lines could become arguments:
# start from task name
taskName="170620_163137:belforte_crab_20170620_183133"
# and a job number (id in task) for which to rerun PostJob
jobId=2
# and for which retry number
jobRetry=0
# need the correct user name for file/dir access
username=cms1627  # this is for belforte

#
#========== Now the real action
# see https://twiki.cern.ch/twiki/bin/view/CMSPublic/Crab3OperatorDebugging#To_run_a_post_job_on_the_schedd
#
export _CONDOR_PER_JOB_HISTORY_DIR=`condor_config_val PER_JOB_HISTORY_DIR`

#
constrain=\'crab_workflow==\"$taskName\"\&\&jobUniverse==7\'
cat > /tmp/myCondorQ.sh <<EOF
condor_q -con $constrain -af iwd
EOF
chmod +x /tmp/myCondorQ.sh
spoolDir=`/tmp/myCondorQ.sh`
rm /tmp/myCondorQ.sh

#
cd $spoolDir

# sudo su cmsxxx
sudo su ${username}
#
# grab the proxy
export X509_USER_PROXY=`ls|egrep  [a-z,0-9]{40}`
#voms-proxy-info

# want to rerun PJ for job 2
PJargs=`cat RunJobs.dag | grep "POST Job${jobId}" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
# check
echo "PJargs is :"
echo $PJargs


# find the condor clusterId for the job 
jobClusterId=`ls -l finished_jobs/job.${jobId}.${jobRetry} | awk '{print $NF}'|cut -d. -f2-3`

#
# reset PJ count
echo '{"pre": 1, "post": 0}' > retry_info/job.${jobId}.txt
rm defer_info/defer_num.${jobId}.0.txt

#
# these two are mandatory
export _CONDOR_JOB_AD=finished_jobs/job.${jobId}.0
export TEST_DONT_REDIRECT_STDOUT=True

# to disable the job retry handling part
export TEST_POSTJOB_DISABLE_RETRIES=True

# to avoid sending status report to dashboard and file metadata upload
export TEST_POSTJOB_NO_STATUS_UPDATE=True

# set some other args to emulate when calling the bootstrap
jobReturnCode=0
retryCount=0
maxRetries=10

# now run it
echo "sh dag_bootstrap.sh POSTJOB ${jobClusterId} ${jobReturnCode} ${retryCount} ${maxRetries} $PJargs"

 

To run the job wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:

Revision 952017-06-22 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 361 to 361
 
  • Look at the task page in GlideMon and copy the cluster id of the job for which you want to run the post-job.
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES). Example:
Changed:
<
<
args=`cat RunJobs.dag | grep "POST Job1" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
>
>
PJargs=`cat RunJobs.dag | grep "POST Job1" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
  which results in something like:
Changed:
<
<
belforte@vocms059/~> echo $args
>
>
belforte@vocms059/~> echo $PJargs
 170620_163137:belforte_crab_20170620_183133 1 /store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/prova/doppio/slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 /store/user/belforte/prova/doppio//slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 cmsRun_1.log.tar.gz conventional kk_1.root belforte@vocms059/~>
Line: 377 to 377
 
sudo -u cmsXXXX sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.X.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
Changed:
<
<
where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB. One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:
>
>
where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB.

One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:

 
sudo -u cms1425 sh -c 'export _CONDOR_JOB_AD=finished_jobs/job.2.0; export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=a41b46b7b59e9858006416f98d1540f9ddf4d646; sh dag_bootstrap.sh POSTJOB 3971322 0 0 10 160314_085828:erupeika_crab_test7 2 /store/temp/user/erupeika.1ae8d366cbf4507a432372f69f456dc0d23cc26d/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 /store/user/erupeika/GenericTTbar/CRAB3_tutorial_May2015_MC_analysis_pub/160314_085828/0000 cmsRun_2.log.tar.gz output_2.root'

Revision 942017-06-21 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 359 to 359
 
  • Go to the spool directory of the task in the schedd.
  • Look at the task page in GlideMon and copy the cluster id of the job for which you want to run the post-job.
Changed:
<
<
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES).
>
>
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES). Example:
args=`cat RunJobs.dag | grep "POST Job1" | awk 'BEGIN { FS="MAX_RETRIES" }; {print $2}'`
which results in something like:
belforte@vocms059/~> echo $args
170620_163137:belforte_crab_20170620_183133 1 /store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/prova/doppio/slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 /store/user/belforte/prova/doppio//slash/GenericTTbar/CRAB3_test-SB/170620_163137/0000 cmsRun_1.log.tar.gz conventional kk_1.root
belforte@vocms059/~> 
 
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
  • Modify the file retry_info/job.X.txt so that the post value is equal to 0.
  • Delete the file defer_info/defer_num.X.0.txt.

Revision 932017-05-25 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 755 to 755
 done
Added:
>
>

directy on all the schedds via condor command (if you really have an emergency)

You need to issue condor command on all schedd's, and you want to make sure you also HOLD the DAGMANs to prevent resubmissions.

First thing is to have the CERN user name aka CRAB_userHN e.g. belforte

You can do things from lxplus.

  1. make a list of schedds
 SchedList=`condor_status -sched|grep crab3|cut -d '@' -f2 |cut -d . -f 1`
 echo $SchedList
which yields
vocms0106 vocms0107 vocms0119 vocms0120 vocms0121 vocms0122 vocms0137 vocms0144 vocms0155 vocms0194 vocms0195 vocms0196 vocms0197 vocms0198 vocms0199 vocms059
  1. set the target username
userHN="belforte"
  1. hold the DAGMANs
 for s in $SchedList; do ssh -p 2222 ${s}.cern.ch sudo condor_hold -con 'jobuniverse==7 && crab_userhn==${userHN}' ; done
 

Remove old tasks in one schedd

At times we had some things go wrong and tasks end up in a zombie state using up a running dagman slot well beyond the a priori defined lifetime of a task. Removal of such old taks can be done one schedd at a time with a small adaptation of the previous recipe. Following example removes tasks older than 2 weeks.

Revision 922017-05-18 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 391 to 391
 
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it run_job.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile and scriptArgs).
Changed:
<
<
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
>
>
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_* process.id run.sh.old
 tar xf sandbox.tar.gz
Changed:
<
<
export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH export _CONDOR_JOB_AD=.job.ad
>
>
#export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH # use this if you want to provide your own scripts
 export X509_USER_PROXY=/tmp/x509up_u`id -u` export CRAB3_RUNTIME_DEBUG=TRUE sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'

Revision 912017-05-03 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 224 to 224
 
Changed:
<
<
>
>
  Deprecated...:

Revision 902017-05-03 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 214 to 214
 

Monitoring links

Changed:
<
<
>
>
  • CRAB3 monitoring page in MONIT Grafana
  • ASO monitoring page in MONIT : Grafana
 
Line: 224 to 224
 
Added:
>
>
  Deprecated...:

Revision 892017-04-26 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 379 to 379
 
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
Added:
>
>
    • Note: on most real tasks spool directory is full of useless log files, rather do:
tar cjf /tmp/spoolDir.tar --exclude="*job*" --exclude="postJob*" --exclude="*submit" --exclude="defer*" --exclude="task_statistics*"  .
scp /tmp/spoolDir.tjz mmascher@lxplus:/afs/cern.ch/work/m/mmascher/<somedir>
# and on lxplus expand with
tar xf spoolDir.tar
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it run_job.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile and scriptArgs).
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
Changed:
<
<
tar xvzf sandbox.tar.gz
>
>
tar xf sandbox.tar.gz
 export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH export _CONDOR_JOB_AD=.job.ad export X509_USER_PROXY=/tmp/x509up_u`id -u`
Changed:
<
<
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
>
>
export CRAB3_RUNTIME_DEBUG=TRUE sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
 
  • Run the script:

Revision 882017-04-20 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 214 to 214
 

Monitoring links

Changed:
<
<
>
>
 
Changed:
<
<
>
>
  Deprecated...:
Line: 230 to 232
 
Added:
>
>

useful gwmsmon API's

  • users running jobs of a given duration (by number of jobs):
API for topusers is available. One thing what I was thinking is to
extend table and also have exitcode. Let me know if it is needed.

API description
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/topusers(hours)/(TimeFrom)/(TimeTo)

(hours) - number of hours from 0 to 999
(TimeFrom) - Minimum job runtime from. Default value: * (means any)
(TimeTo) - Job runtime up to. Default value: * (means any)

For example:
For the last week how many jobs per user were running less than 1h:
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/topusers168/0/1

For the last week how many jobs per user were running more than 1h:
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/topusers168/1/

For the last week how many jobs per user were running more than 24h:
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/topusers168/24/
  • time distribution (percentiles) of used wall clock time by jobs in a task
https://cms-gwmsmon.cern.ch/analysview/json/historynew/(call)(hours)/(workflow)/(subtask)
Same exists for prodview instead of analysisview. In analysis, workflow is a username, subtask is a taskname:

(call) - one of memoryusage, exitcodes, runtime, memorycpu, percentileruntime
(hours) - from 0 to 999
(workflow) - username (i.e. CRAB_UserHN !classAd)
(subtask) - taskname (i.e. CRAB_ReqName !classAd)

examples:

https://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/smitra/170411_132805:smitra_crab_DYJets
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/memoryusage720/smitra/170411_132805:smitra_crab_DYJets
 

Crabcache (UserFileCache)

  • To get the list of all the users in the crabcache:

Revision 872017-03-13 - TodorTrendafilovIvanov

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 552 to 552
 

Script for checking that each task on the schedd has a running task process

Changed:
<
<
#!/bin/sh
>
>
#!/bin/bash
 
Changed:
<
<
# grep for lines that begin with the cluster id, don't care about others condor_q -const 'tasktype=="ROOT" && jobstatus=?=2' | grep '^[0-9]\+\.0' > tmpRunningDags.txt condor_q -const 'jobuniverse==12 && jobstatus==2' | grep '^[0-9]\+\.0' > tmpRunningTPs.txt

# for each line in the running dagmans file, check that it has a corresponding task process # in the running TPs file. while read p; do # filter out the cluster id, looks like 12345.0 clusterId=$(awk '{print $1}' <<< $p) # remove the last two characters, ".0" won't match what's in the tmpRunningTPs file clusterId=$(sed 's/.\{2\}$//' <<< $clusterId) # grep for this dag's cluster id in the running TPs file foundTask=$(grep $clusterId tmpRunningTPs.txt) if [ -z "$foundTask" ]; then echo "DAG $clusterId has no TP running on this schedd" else echo $foundTask fi done < tmpRunningDags.txt

>
>
enable_flag=/etc/enable_task_daemon [ -f $enable_flag ] || exit 0
 
Added:
>
>
log_file=/tmp/`basename $0` log_file=${log_file%%.sh}.log :>$log_file

RunningDags=/tmp/tmpRunningDags.txt RunningTPs=/tmp/tmpRunningTPs.txt

condor_q -const 'tasktype=="ROOT" && jobstatus=?=2' -af: clusterid > $RunningDags condor_q -const 'jobuniverse==12 && jobstatus==2' -af: clusterid args > $RunningTPs

for clusterid in `cat $RunningDags` do /bin/grep $clusterid $RunningTPs >/dev/null 2>&1 || echo "DAG $clusterid has no TP running on schedd $HOSTNAME" |tee -a $log_file done

 
Added:
>
>
 

Testing a new schedd

Requirements: On Schedd:

Revision 862017-02-23 - EmilisAntanasRupeika

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 412 to 412
  This will run a single worker on one thread. Note that the tasks submitted to this worker might not be processed correctly (the task might have the wrong state assigned to it), however, that's usually not a problem.
Added:
>
>

To run the crabserver in debug mode

It is possible to run the crabserver within a single process. This allows the use of the "pdb" python debugger (similar to the sequential Task Worker) to execute code step by step and see what the crabserver is doing in real time.

First, this requires removing (commenting out) some code from the WMCore Main.py class which starts the server. Remove the following code in the start_daemon() method:

    def start_daemon(self):
#         """Start the deamon."""
#
#         # Redirect all output to the logging daemon.
#         devnull = file("/dev/null", "w")
#         if isinstance(self.logfile, list):
#             subproc = Popen(self.logfile, stdin=PIPE, stdout=devnull, stderr=devnull,
#                             bufsize=0, close_fds=True, shell=False)
#             logger = subproc.stdin
#         elif isinstance(self.logfile, str):
#             logger = open(self.logfile, "a+", 0)
#         else:
#             raise TypeError("'logfile' must be a string or array")
#         os.dup2(logger.fileno(), sys.stdout.fileno())
#         os.dup2(logger.fileno(), sys.stderr.fileno())
#         os.dup2(devnull.fileno(), sys.stdin.fileno())
#         logger.close()
#         devnull.close()
#
#         # First fork. Discard the parent.
#         pid = os.fork()
#         if pid > 0:
#             os._exit(0)

        # Establish as a daemon, set process group / session id.
        os.chdir(self.statedir)
#         os.setsid()
#
#         # Second fork. The child does the work, discard the second parent.
#         pid = os.fork()
#         if pid > 0:
#             os._exit(0)
#
#         # Save process group id to pid file, then run real worker.
#         file(self.pidfile, "w").write("%d\n" % os.getpgid(0))

        error = False
        try:
            self.run()
        except Exception as e:
            error = True
            trace = StringIO()
            traceback.print_exc(file=trace)
            cherrypy.log("ERROR: terminating due to error, error trace follows")
            for line in trace.getvalue().rstrip().split("\n"):
                cherrypy.log("ERROR:   %s" % line)

        # Remove pid file once we are done.
        try: os.remove(self.pidfile)
        except: pass

        # Exit
        sys.exit((error and 1) or 0)

Then, remove some more code from the main() method:

    def run(self):
        """Run the server daemon main loop."""
        # Fork.  The child always exits the loop and executes the code below
        # to run the server proper.  The parent monitors the child, and if
        # it exits abnormally, restarts it, otherwise exits completely with
        # the child's exit code.
        cherrypy.log("WATCHDOG: starting server daemon (pid %d)" % os.getpid())
#         while True:
#             serverpid = os.fork()
#             if not serverpid: break
#             signal(SIGINT, SIG_IGN)
#             signal(SIGTERM, SIG_IGN)
#             signal(SIGQUIT, SIG_IGN)
#             (xpid, exitrc) = os.waitpid(serverpid, 0)
#             (exitcode, exitsigno, exitcore) = (exitrc >> 8, exitrc & 127, exitrc & 128)
#             retval = (exitsigno and ("signal %d" % exitsigno)) or str(exitcode)
#             retmsg = retval + ((exitcore and " (core dumped)") or "")
#             restart = (exitsigno > 0 and exitsigno not in (2, 3, 15))
#             cherrypy.log("WATCHDOG: server exited with exit code %s%s"
#                          % (retmsg, (restart and "... restarting") or ""))
#
#             if not restart:
#                 sys.exit((exitsigno and 1) or exitcode)
#
#             for pidfile in glob("%s/*/pid" % self.statedir):
#                 if os.path.exists(pidfile):
#                     pid = int(open(pidfile).readline())
#                     os.remove(pidfile)
#                     cherrypy.log("WATCHDOG: killing slave server %d" % pid)
#                     try: os.kill(pid, 9)
#                     except: pass

        # Run. Override signal handlers after CherryPy has itself started and
        # installed its own handlers. To achieve this we need to start the
        # server in non-blocking mode, fiddle with, than ask server to block.
        self.validate_config()
        self.setup_server()
        self.install_application()
        cherrypy.log("INFO: starting server in %s" % self.statedir)
        cherrypy.engine.start()
        signal(SIGHUP, sig_reload)
        signal(SIGUSR1, sig_graceful)
        signal(SIGTERM, sig_terminate)
        signal(SIGQUIT, sig_terminate)
        signal(SIGINT, sig_terminate)
        cherrypy.engine.block()

Then, set up the environment and start the server

# Source the init.sh script, location will be different depending on version
source /data/srv/HG1612i/sw.erupeika/slc6_amd64_gcc493/cms/crabserver/3.3.1702.rc3/etc/profile.d/init.sh

AUTHDIR=/data/srv/current/auth/crabserver/
PYTHONPATH=$AUTHDIR:$PYTHONPATH
export CONDOR_CONFIG=/data/srv/condor_config
export X509_USER_CERT=$AUTHDIR/dmwm-service-cert.pem
export X509_USER_KEY=$AUTHDIR/dmwm-service-key.pem

# Start the crabcache and frontend
/data/srv/current/config/crabcache/manage start 'I did read documentation'
/data/srv/current/config/frontend/manage start 'I did read documentation'

# Start the server
python /data/srv/HG1612i/sw.erupeika/slc6_amd64_gcc493/cms/crabserver/3.3.1702.rc3/bin/wmc-httpd -r -d /data/srv/state/crabserver/ /data/srv/current/config/crabserver/config.py
 

Script to cd to your last task's spool dir

A simple script for getting into the spool directory of the newest task. Change the cms pool username to your own.

Revision 852017-02-21 - EmilisAntanasRupeika

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 420 to 420
 cd "$(dirname "$(readlink "jobs_log.txt")")"
Added:
>
>

Script for checking that each task on the schedd has a running task process

#!/bin/sh

# grep for lines that begin with the cluster id, don't care about others
condor_q -const 'tasktype=="ROOT" && jobstatus=?=2' | grep '^[0-9]\+\.0' > tmpRunningDags.txt
condor_q -const 'jobuniverse==12 && jobstatus==2' | grep '^[0-9]\+\.0' > tmpRunningTPs.txt

# for each line in the running dagmans file, check that it has a corresponding task process
# in the running TPs file.
while read p; do
    # filter out the cluster id, looks like 12345.0
    clusterId=$(awk '{print $1}' <<< $p)
    # remove the last two characters, ".0" won't match what's in the tmpRunningTPs file
    clusterId=$(sed 's/.\{2\}$//' <<< $clusterId)
    # grep for this dag's cluster id in the running TPs file
    foundTask=$(grep $clusterId tmpRunningTPs.txt)
    if [ -z "$foundTask" ]; then
        echo "DAG $clusterId has no TP running on this schedd"
    else
        echo $foundTask
    fi
done < tmpRunningDags.txt

 

Testing a new schedd

Requirements: On Schedd:

Revision 842017-02-15 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 268 to 268
  "mmascher" ]}
Changed:
<
<
  • To get the quota each user has (power users has 10* this value):
>
>
  • To get the quota each user has in MegaBytes (power users has 10* this value):
 
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
{"result": [

Revision 832017-02-10 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 161 to 161
  For example, the user under which the HammerCloud jobs are submitted is given a priority factor of 1, so that HammerCloud jobs almost always run before anything else in the Global Pool. This change was made on December 5, 2014, and by the next day over 97% of the jobs started within 2 hours. Before the change, the percentage was only ~80%.
Added:
>
>
N.B.: It you an error like condor_userprio: Can't locate negotiator in local pool when doing condor_userprio you can specify the pool using --pool, for example condor_userprio -pool cmssrv221.fnal.gov. Moreover, with recent changes to the global pool negotiator settings now it is not guaranteed this command will run (see https://cms-logbook.cern.ch/elog/GlideInWMS/5018)
 

on REST interface

take the X-Error-Id in the crab.log, go to vocms022 where the REST logs are copied in /build/srv-logs then grep for the X-Error-Id, for example:

Revision 822017-01-09 - TodorTrendafilovIvanov

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 97 to 97
  Job Status explanation:
 
Changed:
<
<
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
>
>
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.

Revision 812016-11-29 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 513 to 513
 

Then if you want to be nice you can also update the warning message in Oracle so that user will see the message with crab status (do this from lxplus). Notice that the warning message is automatically changed using crab kill.

Added:
>
>
Beware this operation requires knowing the password for the production Oracle database.
 
mkdir kill_user
Line: 583 to 584
 for task in `cat ~/tasks.txt` ; do condor_q $task -af CRAB_ReqName >> /afs/cern.ch/user/b/belforte/WORK/CRAB3/TC3/dbg/KILLED/tasks.txt; done
Changed:
<
<
so I got the final list of all tasks I killed in = /afs/cern.ch/user/b/belforte/WORK/CRAB3/TC3/dbg/KILLED/tasks.txt= and will use that in the
>
>
so I got the final list of all tasks I killed in = /afs/cern.ch/user/b/belforte/WORK/CRAB3/TC3/dbg/KILLED/tasks.txt= and will use that in the Oracle changing script:
 
for task in `cat ~/tasks.txt`; do
    sh change_warning.sh $task

Revision 802016-11-29 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 483 to 483
  There are two ways of doing this. One is to use the "crab kill" command. This is the recommended way since things are going to get killed properly (messages sent to dashboard, transfers killed etc..):
Added:
>
>

via crab kill from e.g. an lxplus machine (preferred way)

 
mkdir kill_user
cd kill_user
Line: 497 to 499
 done
Changed:
<
<
>
>

directy on the schedd via condor command (if you have reasons not to like the preferred way, like need to do it real quick)

 Alternatively one can directly go to the schedd and kill the tasks using condor (to my knowledge you need to do it schedd per schedd):
Line: 538 to 540
 for task in `cat ~/tasks.txt`; do sh change_warning.sh $task done
Added:
>
>

Remove old tasks in one schedd

At times we had some things go wrong and tasks end up in a zombie state using up a running dagman slot well beyond the a priori defined lifetime of a task. Removal of such old taks can be done one schedd at a time with a small adaptation of the previous recipe. Following example removes tasks older than 2 weeks.

#from the schedd
sudo su
condor_q -const 'JobUniverse==7 && JobStatus==2 && (time() - JobStartDate > 14 * 86400)' -af ClusterId  > ~/tasks.txt
cat ~/tasks.txt #to check you grabbed the tasks

for task in `cat ~/tasks.txt` ; do
  condor_hold $task -reason "Task removed by CRAB operator becasue too old" 
done

That will store a HoldReason as a ClassAd, e.g.

root@submit-4 ~# condor_q 22111709 -af HoldReason
Task removed by CRAB operator becasue too old (by user condor)

You can also get the list of CRAB task names with e.g.

for task in `cat ~/tasks.txt` ; do
  condor_q $task -af CRAB_ReqName
done
and use the above list of CRAB task names to insert the HodReason in Oracle so that user will see the message with crab status (do this from lxplus using the example in the previous section). For sake of example, here's what I did to remove very old tasks (>20 dasy) on all schedd's:
sudo su
condor_q -const 'JobUniverse==7 && JobStatus==2 && (time() - JobStartDate > 20 * 86400)' -af ClusterId > ~/tasks.txt
cat ~/tasks.txt

for task in `cat ~/tasks.txt` ; do
  condor_hold $task -reason "Task removed by CRAB operator becasue too old" 
done

for task in `cat ~/tasks.txt` ; do   condor_q $task -af CRAB_ReqName >> /afs/cern.ch/user/b/belforte/WORK/CRAB3/TC3/dbg/KILLED/tasks.txt; done

so I got the final list of all tasks I killed in = /afs/cern.ch/user/b/belforte/WORK/CRAB3/TC3/dbg/KILLED/tasks.txt= and will use that in the

for task in `cat ~/tasks.txt`; do
    sh change_warning.sh $task
done
 

META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 792016-11-25 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 478 to 478
 
Added:
>
>

Kill all tasks belonging to a user

There are two ways of doing this. One is to use the "crab kill" command. This is the recommended way since things are going to get killed properly (messages sent to dashboard, transfers killed etc..):

mkdir kill_user
cd kill_user
#get the list of tasks
TASKS=$(condor_q -pool cmsgwms-collector-global.cern.ch -all -const 'TaskType=="ROOT" && JobStatus==1 && CRAB_UserHN=="<username>"' -af CRAB_ReqName)
echo $TASKS #to check you grabbed the tasks
source /cvmfs/cms.cern.ch/crab3/crab_standalone.sh
for task in $TASKS; do
    crab remake --task=$task
    crab kill --killwarning="Your tasks have been killed by a CRAB3 operator, see your inbox" crab_*
    rm -rf crab_*
done

Alternatively one can directly go to the schedd and kill the tasks using condor (to my knowledge you need to do it schedd per schedd):

#from the schedd
condor_q -all -const 'TaskType=="ROOT" && JobStatus==1 && CRAB_UserHN=="gurpinar"' -af CRAB_ReqName > ~/tasks.txt
cat ~/tasks.txt #to check you grabbed the tasks

for task in `cat ~/tasks.txt`; do
    sudo -u cms683 condor_hold  -const CRAB_ReqName==\"$task\"
done

Then if you want to be nice you can also update the warning message in Oracle so that user will see the message with crab status (do this from lxplus). Notice that the warning message is automatically changed using crab kill.

mkdir kill_user
cd kill_user
#create the change warning file and then change XXXXX to the right value
cat <<EOF >change_warning.sh
source /afs/cern.ch/project/oracle/script/setoraenv.sh -s prod

python << END
import cx_Oracle as db
import sys

taskname = "\$1"
print taskname

conn = db.connect('cms_analysis_reqmgr/XXXXX@cmsr')
cursor = conn.cursor()
res = cursor.execute("update tasks set tm_task_warnings = '[\"Your tasks have been killed by a CRAB3 operator, see your inbox\"]' where tm_taskname='%s'" % taskname)
cursor.close()
conn.commit()
conn.close()
END
EOF

for task in `cat ~/tasks.txt`; do
    sh change_warning.sh $task
done
 
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 782016-11-17 - EmilisAntanasRupeika

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 411 to 411
 cd "$(dirname "$(readlink "jobs_log.txt")")"
Changed:
<
<

Scripts to test a new schedd

>
>

Testing a new schedd

  Requirements: On Schedd:
  • Transfer users proxy to the new schedd.
Line: 462 to 462
  condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
Added:
>
>
To submit a task through CRAB to a specific schedd and / or pool, modify your crab config as follows:
# add this line with a full schedd name to set a specific schedd for a task (0155 in this example)
config.Debug.scheddName = 'crab3@vocms0155.cern.ch'
# a collector can be set as well, necessary when the schedd to be tested is not in the global pool (0115 in this example corresponds to the ITB collector):
config.Debug.collector = 'vocms0115.cern.ch'
 

URL's to access CRAB Task database

  • get general info about a task:

Revision 772016-10-23 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 207 to 207
 

Monitoring links

Deleted:
<
<
 
Deleted:
<
<
 

Revision 762016-10-14 - EmilisAntanasRupeika

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 356 to 356
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog). I will assume that we want to run job number 1 of the task. Note that in order to avoid running the full job and wasting time, one can change --firstEvent and --lastEvent arguments to be equal to 1, for example. This means that only one event will be analysed and the job will finish quickly.
Changed:
<
<
  • Prepare a script (I call it run_job_and_cmscp.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile and scriptArgs).
>
>
  • Prepare a script (I call it run_job_and_cmscp.sh) to run your job
    • Replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile and scriptArgs).
    • Replace the with the name of your proxy file. This file is located in the spool dir you downloaded and has a random name (like a41b46b7b59e9858006416f98d1540f9ddf4d646), it shouldn't be hard to find it with ls.
  • Notice that in the script we have assumed that we will use Job.1.submit as the job ad. For this to work, in the Job.1.submit file, for each line that starts with a + sign, remove the +. Also, change CRAB_Id = $(count) by CRAB_Id = 1. It may also be that some variables are missing, e.g. CRAB_localOutputFiles and CRAB_Destination. In that case, copy them from the job log file and add them to the Job.1.submit file.
  • Create an empty startup_environment.sh file in the copied directory.
 
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* wmcore_initialized debug CMSSW_7_0_5 process.id
tar xvzf sandbox.tar.gz
export PYTHONPATH=/afs/cern.ch/work/a/atanasi/private/github/repos/CRABServer/src/python:$PYTHONPATH
export _CONDOR_JOB_AD=.job.ad
Changed:
<
<
export X509_USER_PROXY=/tmp/x509up_u`id -u`
>
>
export JOBSTARTDIR=$PWD export X509_USER_PROXY=
 export CRAB3_RUNTIME_DEBUG=TRUE echo "======== CMSRunAnalysis.sh at $(TZ=GMT date) STARTING ========"
Changed:
<
<
sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_7_0_5 --scramArch=slc6_amd64_gcc481 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --scriptArgs=[] -o {}
>
>
sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://erupeikavm.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_7_5_8_patch3 --scramArch=slc6_amd64_gcc491 --inputFile='job_input_file_list_1.txt' --runAndLumis='job_lumis_1.json' --lheInputFiles=False --firstEvent=1 --firstLumi=None --lastEvent=2 --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --scriptArgs=[] -o {}
 EXIT_STATUS=$? echo "CMSRunAnalysis.sh complete at $(TZ=GMT date) with exit status $EXIT_STATUS" echo "======== CMSRunAnalsysis.sh at $(TZ=GMT date) FINISHING ========"
Line: 373 to 379
 export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch . $VO_CMS_SW_DIR/cmsset_default.sh export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$VO_CMS_SW_DIR/COMP/slc5_amd64_gcc434/external/openssl/0.9.7m/lib:$VO_CMS_SW_DIR/COMP/slc5_amd64_gcc434/external/bz2lib/1.0.5/lib
Changed:
<
<
command -v python2.6 > /dev/null
>
>
command -v python > /dev/null
 echo "======== Stageout at $(TZ=GMT date) STARTING ========" rm -f wmcore_initialized export _CONDOR_JOB_AD=Job.1.submit export TEST_CMSCP_NO_STATUS_UPDATE=True
Changed:
<
<
PYTHONUNBUFFERED=1 ./cmscp.py "JOB_WRAPPER_EXIT_CODE=${EXIT_STATUS}"
>
>
PYTHONUNBUFFERED=1 python cmscp.py
 STAGEOUT_EXIT_STATUS=$? echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
Added:
>
>
 
Changed:
<
<
  • Notice that in the script we have assumed that we will use Job.1.submit as the job ad. For this to work, in the Job.1.submit file, for each line that starts with a + sign, remove the +. Also, change CRAB_Id = $(count) by CRAB_Id = 1. It may also be that some variables are missing, e.g. CRAB_localOutputFiles and CRAB_Destination. In that case, copy them from the job log file and add them to the Job.1.submit file.
>
>
 
  • Run the script:
sh run_job_and_cmscp.sh

Revision 752016-09-16 - StefanoBelforte

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 446 to 446
 
  • Create same sleep.sh file as on Schedd.
  • Execute following commands :
Deleted:
<
<
#IF on SLC5 source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
 #sourcing TW env will move you to current folder source /data/srv/TaskManager/env.sh voms-proxy-init -voms cms
Changed:
<
<
userDN=$(voms-proxy-info | grep 'issuer\ *:' | sed 's/issuer *: //') proxypath=$(voms-proxy-info | grep 'path\ *:' | sed 's/path *: //')
>
>
userDN=$(voms-proxy-info -identity) proxypath=$(voms-proxy-info -path)
  sed --in-place "s|x509userproxysubject = .*|x509userproxysubject = \"$userDN\"|" sleep.jdl sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl # Fill the new schedd information
Line: 460 to 458
  condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
Added:
>
>

URL's to access CRAB Task database

 
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 742016-08-09 - JadirSilva

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 205 to 205
 

Monitoring links

Changed:
<
<
>
>
 
Deleted:
<
<
 
Deleted:
<
<
 
Line: 221 to 219
  Deprecated...:
Added:
>
>
 

Revision 732016-05-30 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 302 to 302
 CRAB_UserHN = "mmascher" CRAB_InputData = "/DoubleMuParked/Run2012B-22Jan2013-v1/AOD" DESIRED_CMSDataset = "/DoubleMuParked/Run2012B-22Jan2013-v1/AOD"
Added:
>
>
CMSGroups = "/cms" #The following can be enabled to test slow release of jobs for HC #CRAB_JobReleaseTimeout = 60 #CRAB_TaskSubmitTime = 1464638738
 

To run a post-job on the schedd

Revision 722016-05-20 - StefanoBelforte

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
>
>
META TOPICPARENT name="SWGuideCrab"
 
<!-- /ActionTrackerPlugin -->
Changed:
<
<
CRAB Logo
>
>
CRAB Logo
 

Operator Task Management

Line: 24 to 25
 

Finding what you need

  • Get the full name of the task from the user; this will tell you which schedd to look at.
Changed:
<
<
  • Refer to this page to find the vocmsXXX server and log file locations.
>
>
  • Refer to this page to find the vocmsXXX server and log file locations.
 

On the schedd

  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
Changed:
<
<
DENY_WRITE = cms123@*, $(DENY_WRITE)
>
>
DENY_WRITE = cms123@*, $(DENY_WRITE)
 Reload the schedd by sending it a SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
Changed:
<
<
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
>
>
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
 
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
Changed:
<
<
condor_hold -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType=?="ROOT"'
>
>
condor_hold -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType=?="ROOT"'
 
  • To find the HTCondor working (spool) directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc) run:
Changed:
<
<
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
>
>
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
 
  • To find job one in the task, run:
Changed:
<
<
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1'
>
>
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1'
 You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
Changed:
<
<
condor_history -match 1 -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1&&CRAB_Retry=?=2'
>
>
condor_history -match 1 -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1&&CRAB_Retry=?=2'
 
  • To list all tasks for user
Changed:
<
<
condor_q cms293
>
>
condor_q cms293
 
  • To list a specific task configuration:
Changed:
<
<
condor_q -l 179330.0
>
>
condor_q -l 179330.0
 or if the task does not exist anymore, then it should be taken from the condor history:
Changed:
<
<
condor_history -l 179330.0
>
>
condor_history -l 179330.0
 
  • To see the list of users that are allowed to write on the schedd:
Changed:
<
<
condor_config_val SCHEDD.ALLOW_WRITE
>
>
condor_config_val SCHEDD.ALLOW_WRITE
 
  • To see the location of Master log or Schedd log:
Changed:
<
<
condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
>
>
condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
 
  • To see all the schedds known in the glidein-collector.t2.ucsd.ed collector:
Changed:
<
<
condor_status -pool glidein-collector.t2.ucsd.edu -schedd
>
>
condor_status -pool glidein-collector.t2.ucsd.edu -schedd
 
  • To see all running/idle by username:
Changed:
<
<
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
>
>
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
 
  • To see all running by username:
Changed:
<
<
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
>
>
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
 
  • To see all running in the production schedds:
Changed:
<
<
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
>
>
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
 Job Status explanation:
Changed:
<
<
 
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
>
>
 
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
Changed:
<
<
condor_q 184439.0 -analyze
>
>
condor_q 184439.0 -analyze
 or if it is not enough information, do
Changed:
<
<
condor_q 184439.0 -better-analyze
>
>
condor_q 184439.0 -better-analyze
 
  • To see the list of running pilots grouped by site:
Changed:
<
<
condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
>
>
condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
 

Monitoring links

Changed:
<
<
>
>
 
Line: 112 to 123
 

Crabcache (UserFileCache)

  • To get the list of all the users in the crabcache:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=listusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ "mmascher" ,"aldo"
Changed:
<
<
]}
>
>
]}
 
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]}
Changed:
<
<
]}
>
>
]}
 
  • To get more information about one specific file:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ {"hashkey": "697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "created": 1400161415.0, "modified": 1400161415.0, "exists": true, "size": 1303975}
Changed:
<
<
]}
>
>
]}
 
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ ""
Changed:
<
<
]}
>
>
]}
 
  • To get the list of power users:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ "mmascher"
Changed:
<
<
]}
>
>
]}
 
  • To get the quota each user has (power users has 10* this value):
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ {"quota_user_limit": 10}
Changed:
<
<
]}
>
>
]}
 

Development Oriented Tricks I did not know where to put

Line: 158 to 175
 
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES).
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
  • Run the post-job. The generic instruction is:
Changed:
<
<
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
>
>
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
 where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB. One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:
Changed:
<
<
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=eebe07e240237dde878ab66bb56e7794e8d5b39e; sh dag_bootstrap.sh POSTJOB 29649 0 0 10 141118_124153_crab3test-5:atanasi_crab_test_new_cmscp_postjob_9_cern 1 /store/temp/user/atanasi.49f14a3a459fb095e75e64b6cc9e824bb642b6e1/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 /store/user/atanasi/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 cmsRun_1.log.tar.gz tfileservice_1.root pooloutputmodule_1.root'
>
>
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=eebe07e240237dde878ab66bb56e7794e8d5b39e; sh dag_bootstrap.sh POSTJOB 29649 0 0 10 141118_124153_crab3test-5:atanasi_crab_test_new_cmscp_postjob_9_cern 1 /store/temp/user/atanasi.49f14a3a459fb095e75e64b6cc9e824bb642b6e1/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 /store/user/atanasi/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 cmsRun_1.log.tar.gz tfileservice_1.root pooloutputmodule_1.root'
 
  • If the task is not yours, identify the user's cms code and add sudo -u cms<code> at the beginning of the instruction given above.

To run the job wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
Changed:
<
<
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
>
>
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it run_job.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
Changed:
<
<
>
>
 rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old tar xvzf sandbox.tar.gz export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH export _CONDOR_JOB_AD=.job.ad export X509_USER_PROXY=/tmp/x509up_u8440
Changed:
<
<
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
>
>
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
 
  • Run the script:
Changed:
<
<
sh run_job.sh
>
>
sh run_job.sh
 

To run the job wrapper and the stageout wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
Changed:
<
<
scp -r /data/srv/glidecondor/condor_local/spool/2789/0/cluster72789.proc0.subproc0 atanasi@lxplus.cern.ch:/afs/cern.ch/work/a/atanasi/private/
>
>
scp -r /data/srv/glidecondor/condor_local/spool/2789/0/cluster72789.proc0.subproc0 atanasi@lxplus.cern.ch:/afs/cern.ch/work/a/atanasi/private/
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog). I will assume that we want to run job number 1 of the task.
  • Prepare a script (I call it run_job_and_cmscp.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
Changed:
<
<
>
>
 rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* wmcore_initialized debug CMSSW_7_0_5 process.id tar xvzf sandbox.tar.gz export PYTHONPATH=/afs/cern.ch/work/a/atanasi/private/github/repos/CRABServer/src/python:$PYTHONPATH
Line: 222 to 243
  EXIT_STATUS=$STAGEOUT_EXIT_STATUS fi fi
Changed:
<
<
echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
>
>
echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
 
  • In the Job.1.submit file, for each line that starts with a + sign, remove the +. Also, change CRAB_Id = $(count) by CRAB_Id = 1.
  • Run the script:
Changed:
<
<
sh run_job_and_cmscp.sh
>
>
sh run_job_and_cmscp.sh
 

Scripts to test a new schedd

Requirements: On Schedd:

  • Transfer users proxy to the new schedd.
  • Create a file sleep.jdl with the following content:
Changed:
<
<
>
>
 Universe = vanilla Executable = sleep.sh Arguments = 1
Line: 247 to 270
 when_to_transfer_output = ON_EXIT x509userproxy = /home/grid/cms1761/temp/user_proxy_file x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jbalcas/CN=751133/CN=Justas Balcas"
Changed:
<
<
Queue 1
>
>
Queue 1
 
  • Create a script sleep.sh with the following content:
Changed:
<
<
>
>
 #!/bin/bash sleep $1 echo "I slept for $1 seconds on:" hostname
Changed:
<
<
date
>
>
date
 
  • Run command on schedd
Changed:
<
<
condor_submit sleep.jdl
>
>
condor_submit sleep.jdl
 
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
From TW machine (The machine name must be added to SCHEDD.ALLOW_WRITE):
  • Create same sleep.jdl file as on Schedd.
  • Create same sleep.sh file as on Schedd.
  • Execute following commands :
Changed:
<
<
>
>
 #IF on SLC5 source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh #sourcing TW env will move you to current folder
Line: 275 to 301
  sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl # Fill the new schedd information _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY
Changed:
<
<
condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
>
>
condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
 
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 282014-11-28 - FarrukhAftabKhan

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 12 to 12
 pre.note {background-color: white;}
Changed:
<
<
CRAB Logo
>
>
CRAB Logo
 

Operator Task Management

Line: 25 to 24
 

Finding what you need

  • Get the full name of the task from the user; this will tell you which schedd to look at.
Changed:
<
<
  • Refer to this page to find the vocmsXXX server and log file locations.
>
>
  • Refer to this page to find the vocmsXXX server and log file locations.
 

On the schedd

  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
Changed:
<
<
DENY_WRITE = cms123@*, $(DENY_WRITE)
>
>
DENY_WRITE = cms123@*, $(DENY_WRITE)
 Reload the schedd by sending it a SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
Changed:
<
<
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
>
>
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
 
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
Changed:
<
<
condor_hold -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType=?="ROOT"'
>
>
condor_hold -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType=?="ROOT"'
 
  • To find the HTCondor working (spool) directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc) run:
Changed:
<
<
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
>
>
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
 
  • To find job one in the task, run:
Changed:
<
<
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1'
>
>
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1'
 You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
Changed:
<
<
condor_history -match 1 -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1&&CRAB_Retry=?=2'
>
>
condor_history -match 1 -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1&&CRAB_Retry=?=2'
 
  • To list all tasks for user
Changed:
<
<
condor_q cms293
>
>
condor_q cms293
 
  • To list a specific task configuration:
Changed:
<
<
condor_q -l 179330.0
>
>
condor_q -l 179330.0
 or if the task does not exist anymore, then it should be taken from the condor history:
Changed:
<
<
condor_history -l 179330.0
>
>
condor_history -l 179330.0
 
  • To see the list of users that are allowed to write on the schedd:
Changed:
<
<
condor_config_val SCHEDD.ALLOW_WRITE
>
>
condor_config_val SCHEDD.ALLOW_WRITE
 
  • To see the location of Master log or Schedd log:
Changed:
<
<
condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
>
>
condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
 
  • To see all the schedds known in the glidein-collector.t2.ucsd.ed collector:
Changed:
<
<
condor_status -pool glidein-collector.t2.ucsd.edu -schedd
>
>
condor_status -pool glidein-collector.t2.ucsd.edu -schedd
 
  • To see all running/idle by username:
Changed:
<
<
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
>
>
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
 
  • To see all running by username:
Changed:
<
<
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
>
>
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
 
  • To see all running in the production schedds:
Changed:
<
<
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
>
>
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
 Job Status explanation:
Changed:
<
<
 
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
>
>
 
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
Changed:
<
<
condor_q 184439.0 -analyze
>
>
condor_q 184439.0 -analyze
 or if it is not enough information, do
Changed:
<
<
condor_q 184439.0 -better-analyze
>
>
condor_q 184439.0 -better-analyze
 
  • To see the list of running pilots grouped by site:
Changed:
<
<
condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
>
>
condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
 

Monitoring links

Changed:
<
<
>
>
 
Line: 123 to 112
 

Crabcache (UserFileCache)

  • To get the list of all the users in the crabcache:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=listusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ "mmascher" ,"aldo"
Changed:
<
<
]}
>
>
]}
 
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]}
Changed:
<
<
]}
>
>
]}
 
  • To get more information about one specific file:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ {"hashkey": "697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "created": 1400161415.0, "modified": 1400161415.0, "exists": true, "size": 1303975}
Changed:
<
<
]}
>
>
]}
 
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ ""
Changed:
<
<
]}
>
>
]}
 
  • To get the list of power users:
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ "mmascher"
Changed:
<
<
]}
>
>
]}
 
  • To get the quota each user has (power users has 10* this value):
Changed:
<
<
>
>
 curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k {"result": [ {"quota_user_limit": 10}
Changed:
<
<
]}
>
>
]}
 

Development Oriented Tricks I did not know where to put

Line: 175 to 158
 
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES).
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
  • Run the post-job. The generic instruction is:
Changed:
<
<
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
>
>
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
 where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB. One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:
Changed:
<
<
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=eebe07e240237dde878ab66bb56e7794e8d5b39e; sh dag_bootstrap.sh POSTJOB 29649 0 0 10 141118_124153_crab3test-5:atanasi_crab_test_new_cmscp_postjob_9_cern 1 /store/temp/user/atanasi.49f14a3a459fb095e75e64b6cc9e824bb642b6e1/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 /store/user/atanasi/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 cmsRun_1.log.tar.gz tfileservice_1.root pooloutputmodule_1.root'
>
>
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=eebe07e240237dde878ab66bb56e7794e8d5b39e; sh dag_bootstrap.sh POSTJOB 29649 0 0 10 141118_124153_crab3test-5:atanasi_crab_test_new_cmscp_postjob_9_cern 1 /store/temp/user/atanasi.49f14a3a459fb095e75e64b6cc9e824bb642b6e1/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 /store/user/atanasi/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 cmsRun_1.log.tar.gz tfileservice_1.root pooloutputmodule_1.root'
 
  • If the task is not yours, identify the user's cms code and add sudo -u cms<code> at the beginning of the instruction given above.

To run the job wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
Changed:
<
<
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
>
>
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it run_job.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
Changed:
<
<
>
>
 rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old tar xvzf sandbox.tar.gz export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH export _CONDOR_JOB_AD=.job.ad export X509_USER_PROXY=/tmp/x509up_u8440
Changed:
<
<
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
>
>
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
 
  • Run the script:
Changed:
<
<
sh run_job.sh
>
>
sh run_job.sh
 

To run the job wrapper and the stageout wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
Changed:
<
<
scp -r /data/srv/glidecondor/condor_local/spool/2789/0/cluster72789.proc0.subproc0 atanasi@lxplus.cern.ch:/afs/cern.ch/work/a/atanasi/private/
>
>
scp -r /data/srv/glidecondor/condor_local/spool/2789/0/cluster72789.proc0.subproc0 atanasi@lxplus.cern.ch:/afs/cern.ch/work/a/atanasi/private/
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog). I will assume that we want to run job number 1 of the task.
  • Prepare a script (I call it run_job_and_cmscp.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
Changed:
<
<
>
>
 rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* wmcore_initialized debug CMSSW_7_0_5 process.id tar xvzf sandbox.tar.gz export PYTHONPATH=/afs/cern.ch/work/a/atanasi/private/github/repos/CRABServer/src/python:$PYTHONPATH
Line: 243 to 222
  EXIT_STATUS=$STAGEOUT_EXIT_STATUS fi fi
Changed:
<
<
echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
>
>
echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
 
  • In the Job.1.submit file, for each line that starts with a + sign, remove the +. Also, change CRAB_Id = $(count) by CRAB_Id = 1.
  • Run the script:
Changed:
<
<
sh run_job_and_cmscp.sh
>
>
sh run_job_and_cmscp.sh
 

Scripts to test a new schedd

Requirements: On Schedd:

  • Transfer users proxy to the new schedd.
  • Create a file sleep.jdl with the following content:
Changed:
<
<
>
>
 Universe = vanilla Executable = sleep.sh Arguments = 1
Line: 268 to 245
 should_transfer_files = YES RequestMemory = 2000 when_to_transfer_output = ON_EXIT
Deleted:
<
<
Queue 1
 x509userproxy = /home/grid/cms1761/temp/user_proxy_file x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jbalcas/CN=751133/CN=Justas Balcas"
Changed:
<
<
>
>
Queue 1
 
  • Create a script sleep.sh with the following content:
Changed:
<
<
>
>
 #!/bin/bash sleep $1 echo "I slept for $1 seconds on:" hostname
Changed:
<
<
date
>
>
date
 
  • Run command on schedd
Changed:
<
<
condor_submit sleep.jdl
>
>
condor_submit sleep.jdl
 
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
From TW machine (The machine name must be added to SCHEDD.ALLOW_WRITE):
  • Create same sleep.jdl file as on Schedd.
  • Create same sleep.sh file as on Schedd.
  • Execute following commands :
Changed:
<
<
>
>
 #IF on SLC5 source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh #sourcing TW env will move you to current folder
Line: 301 to 275
  sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl # Fill the new schedd information _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY
Changed:
<
<
condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
>
>
condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
 
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 272014-11-25 - AndresTanasijczuk

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 235 to 235
 echo "======== Stageout at $(TZ=GMT date) STARTING ========" rm -f wmcore_initialized export _CONDOR_JOB_AD=Job.1.submit
Added:
>
>
export TEST_CMSCP_NO_STATUS_UPDATE=True
 PYTHONUNBUFFERED=1 ./cmscp.py "JOB_EXIT_CODE=${EXIT_STATUS}" STAGEOUT_EXIT_STATUS=$? if [ $STAGEOUT_EXIT_STATUS -ne 0 ]; then

Revision 262014-11-23 - AndresTanasijczuk

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 210 to 210
 
  • Copy the spool directory of the task you want to run to lxplus, e.g.:
Changed:
<
<
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
>
>
scp -r /data/srv/glidecondor/condor_local/spool/2789/0/cluster72789.proc0.subproc0 atanasi@lxplusNOSPAMPLEASE.cern.ch:/afs/cern.ch/work/a/atanasi/private/
 
  • Go to the directory you just copied to lxplus.
Changed:
<
<
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
>
>
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog). I will assume that we want to run job number 1 of the task.
 
  • Prepare a script (I call it run_job_and_cmscp.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
Changed:
<
<
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
>
>
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* wmcore_initialized debug CMSSW_7_0_5 process.id
 tar xvzf sandbox.tar.gz export PYTHONPATH=/afs/cern.ch/work/a/atanasi/private/github/repos/CRABServer/src/python:$PYTHONPATH export _CONDOR_JOB_AD=.job.ad
Changed:
<
<
export X509_USER_PROXY=/tmp/x509up_u57506
>
>
export X509_USER_PROXY=/tmp/x509up_u`id -u`
 export CRAB3_RUNTIME_DEBUG=TRUE echo "======== CMSRunAnalysis.sh at $(TZ=GMT date) STARTING ========"
Changed:
<
<
sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_7_0_5 --scramArch=slc6_amd64_gcc481 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --scriptArgs='[]' -o '{}'
>
>
sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_7_0_5 --scramArch=slc6_amd64_gcc481 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis=job_lumis_1.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --scriptArgs=[] -o {}
 EXIT_STATUS=$? echo "CMSRunAnalysis.sh complete at $(TZ=GMT date) with exit status $EXIT_STATUS" echo "======== CMSRunAnalsysis.sh at $(TZ=GMT date) FINISHING ========"
Line: 236 to 236
 rm -f wmcore_initialized export _CONDOR_JOB_AD=Job.1.submit PYTHONUNBUFFERED=1 ./cmscp.py "JOB_EXIT_CODE=${EXIT_STATUS}"
Added:
>
>
STAGEOUT_EXIT_STATUS=$?
 if [ $STAGEOUT_EXIT_STATUS -ne 0 ]; then if [ $EXIT_STATUS -eq 0 ]; then EXIT_STATUS=$STAGEOUT_EXIT_STATUS
Line: 243 to 244
 fi echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
Added:
>
>
  • In the Job.1.submit file, for each line that starts with a + sign, remove the +. Also, change CRAB_Id = $(count) by CRAB_Id = 1.
 
  • Run the script:
sh run_job_and_cmscp.sh

Revision 252014-11-23 - AndresTanasijczuk

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 192 to 192
 
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
Changed:
<
<
  • Prepare a script (I call it run.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
>
>
  • Prepare a script (I call it run_job.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
 
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
tar xvzf sandbox.tar.gz
Line: 203 to 203
 
  • Run the script:
Changed:
<
<
sh run.sh
>
>
sh run_job.sh

To run the job wrapper and the stageout wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it run_job_and_cmscp.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
tar xvzf sandbox.tar.gz
export PYTHONPATH=/afs/cern.ch/work/a/atanasi/private/github/repos/CRABServer/src/python:$PYTHONPATH
export _CONDOR_JOB_AD=.job.ad
export X509_USER_PROXY=/tmp/x509up_u57506
export CRAB3_RUNTIME_DEBUG=TRUE
echo "======== CMSRunAnalysis.sh at $(TZ=GMT date) STARTING ========"
sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_7_0_5 --scramArch=slc6_amd64_gcc481 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": [[666666, 666666]]}' --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --scriptArgs='[]' -o '{}'
EXIT_STATUS=$?
echo "CMSRunAnalysis.sh complete at $(TZ=GMT date) with exit status $EXIT_STATUS"
echo "======== CMSRunAnalsysis.sh at $(TZ=GMT date) FINISHING ========"
mv jobReport.json jobReport.json.1
export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch
. $VO_CMS_SW_DIR/cmsset_default.sh
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$VO_CMS_SW_DIR/COMP/slc5_amd64_gcc434/external/openssl/0.9.7m/lib:$VO_CMS_SW_DIR/COMP/slc5_amd64_gcc434/external/bz2lib/1.0.5/lib
command -v python2.6 > /dev/null
echo "======== Stageout at $(TZ=GMT date) STARTING ========"
rm -f wmcore_initialized
export _CONDOR_JOB_AD=Job.1.submit
PYTHONUNBUFFERED=1 ./cmscp.py "JOB_EXIT_CODE=${EXIT_STATUS}"
if [ $STAGEOUT_EXIT_STATUS -ne 0 ]; then
    if [ $EXIT_STATUS -eq 0 ]; then
        EXIT_STATUS=$STAGEOUT_EXIT_STATUS
    fi
fi
echo "======== Stageout at $(TZ=GMT date) FINISHING (status $STAGEOUT_EXIT_STATUS) ========"
  • Run the script:
sh run_job_and_cmscp.sh
 

Scripts to test a new schedd

Revision 242014-11-19 - AndresTanasijczuk

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
<!-- /ActionTrackerPlugin -->
Line: 178 to 178
 
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
Changed:
<
<
where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB. A specific example is:
>
>
where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB. One could also add export TEST_POSTJOB_DISABLE_RETRIES=True to disable the job retry handling part, or export TEST_POSTJOB_NO_STATUS_UPDATE=True to avoid sending status report to dashboard and file metadata upload. A specific example is:
 
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=eebe07e240237dde878ab66bb56e7794e8d5b39e; sh dag_bootstrap.sh POSTJOB 29649 0 0 10 141118_124153_crab3test-5:atanasi_crab_test_new_cmscp_postjob_9_cern 1 /store/temp/user/atanasi.49f14a3a459fb095e75e64b6cc9e824bb642b6e1/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 /store/user/atanasi/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 cmsRun_1.log.tar.gz tfileservice_1.root pooloutputmodule_1.root'

Revision 232014-11-18 - AndresTanasijczuk

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
Added:
>
>
<!-- /ActionTrackerPlugin -->

CRAB Logo

 

Operator Task Management

Changed:
<
<
Helpful tips, tricks, and one-line commands for operators to navigate CRAB3.
>
>
Complete: 5 Go to SWGuideCrab

Helpful tips, tricks and on-line commands for operators to navigate CRAB3.

 

Finding what you need

Line: 11 to 29
 

On the schedd

Changed:
<
<
  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
    DENY_WRITE = cms123@*, $(DENY_WRITE) 
    Reload the schedd by sending it SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_hold -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType =?= "ROOT"'
  • To find the HTCondor working (spool) directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
  • To find job one in the task, run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1'
    You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
    condor_history -match 1 -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1 && CRAB_Retry =?= 2'
  • To list all tasks for user
    condor_q cms293
  • To list specific task configuration
    condor_q -l 179330.0
    or if it does not exists, then it should be taken from history :
    condor_history -l 179330.0 
  • To see list of allowed to write on Schedd :
    condor_config_val SCHEDD.ALLOW_WRITE
  • To see location of Master log or Schedd log :
    condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
  • To see all the schedd known in the glidein-collector.t2.ucsd.ed collector:
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
  • To see all running/idle by username :
    condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running by username :
    condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
  • To see all running in the production schedds:
     condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
     0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
    condor_q 184439.0 -analyze
    or if it is not enough information, do
    condor_q 184439.0 -better-analyze
  • To see the list of running pilots grouped by site:
    condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
>
>
  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
DENY_WRITE = cms123@*, $(DENY_WRITE)
Reload the schedd by sending it a SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17 run:
condor_hold -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType=?="ROOT"'
  • To find the HTCondor working (spool) directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc) run:
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
  • To find job one in the task, run:
condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1'
You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
condor_history -match 1 -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"&&CRAB_Id=?=1&&CRAB_Retry=?=2'
  • To list all tasks for user
condor_q cms293
  • To list a specific task configuration:
condor_q -l 179330.0
or if the task does not exist anymore, then it should be taken from the condor history:
condor_history -l 179330.0
  • To see the list of users that are allowed to write on the schedd:
condor_config_val SCHEDD.ALLOW_WRITE
  • To see the location of Master log or Schedd log:
condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
  • To see all the schedds known in the glidein-collector.t2.ucsd.ed collector:
condor_status -pool glidein-collector.t2.ucsd.edu -schedd
  • To see all running/idle by username:
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running by username:
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
  • To see all running in the production schedds:
condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
Job Status explanation:
 
0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
condor_q 184439.0 -analyze
or if it is not enough information, do
condor_q 184439.0 -better-analyze
  • To see the list of running pilots grouped by site:
condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
 

Monitoring links

Deleted:
<
<
 
Line: 43 to 122
 

Crabcache (UserFileCache)

Changed:
<
<
  • To get the list of all the users in the crabcache:
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=listusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
  • To get the list of all the users in the crabcache:
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=listusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
 {"result": [ "mmascher" ,"aldo"
Changed:
<
<
]}
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
    curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
]}
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
 {"result": [ {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]}
Changed:
<
<
]}
  • To get more information about one specific file:
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
]}
  • To get more information about one specific file:
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
 {"result": [ {"hashkey": "697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "created": 1400161415.0, "modified": 1400161415.0, "exists": true, "size": 1303975}
Changed:
<
<
]}
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
]}
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
 {"result": [ ""
Changed:
<
<
]}
  • To get the list of power users:
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
]}
  • To get the list of power users:
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
 {"result": [ "mmascher"
Changed:
<
<
]}
  • To get the quota each user has (power users has 10* this value):
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
]}
  • To get the quota each user has (power users has 10* this value):
curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
 {"result": [ {"quota_user_limit": 10}
Changed:
<
<
]}
>
>
]}
 

Development Oriented Tricks I did not know where to put

Changed:
<
<
  • To run a PostJob on the schedd:
    1. Look at the Glidemon and find out the cluster id you want to run
    2. Go to the spool directory of the task
    3. Copy the relevan part in RunJobs.dag (see point below).
    4. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change the first four number after POSTJOB to something like this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'

  • To run the job wrapper from lxplus:
    1. Copy the spool directory of the task you want to run to lxplus, e.g.:
      scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
    2. Go to the directory you just copied to lxplus
    3. Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog)
    4. Prepare a script (I call it run.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis)
      rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
      
      
>
>

To run a post-job on the schedd

  • Go to the spool directory of the task in the schedd.
  • Look at the task page in GlideMon and copy the cluster id of the job for which you want to run the post-job.
  • Copy the relevant arguments that are passed to the post-job corresponding to the above cluster id job as written in RunJobs.dag (the relevant arguments are those after $MAX_RETRIES).
  • Identify the file in the spool directory that contains your proxy (the file name is a sequence of 40 lower-case letters and/or digits) and make sure the proxy is still valid (voms-proxy-info -file <proxy_file_name>).
  • Run the post-job. The generic instruction is:
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=<profy_file_name>; sh dag_bootstrap.sh POSTJOB <job_cluster_id> <job_return_code> <retry_count> <max_retries> <arguments>'
where you have to substitute the arguments with what you found in your RunJobs.dag, the proxy file name and the first four numbers after POSTJOB. A specific example is:
sh -c 'export TEST_DONT_REDIRECT_STDOUT=True; export X509_USER_PROXY=eebe07e240237dde878ab66bb56e7794e8d5b39e; sh dag_bootstrap.sh POSTJOB 29649 0 0 10 141118_124153_crab3test-5:atanasi_crab_test_new_cmscp_postjob_9_cern 1 /store/temp/user/atanasi.49f14a3a459fb095e75e64b6cc9e824bb642b6e1/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 /store/user/atanasi/GenericTTbar/test_new_cmscp_postjob_9_cern/141118_124153/0000 cmsRun_1.log.tar.gz tfileservice_1.root pooloutputmodule_1.root'
  • If the task is not yours, identify the user's cms code and add sudo -u cms<code> at the beginning of the instruction given above.

To run the job wrapper from lxplus

  • Copy the spool directory of the task you want to run to lxplus, e.g.:
scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
  • Go to the directory you just copied to lxplus.
  • Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog).
  • Prepare a script (I call it run.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis).
rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
 tar xvzf sandbox.tar.gz export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH export _CONDOR_JOB_AD=.job.ad export X509_USER_PROXY=/tmp/x509up_u8440
Changed:
<
<
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
    1. Run the script
      sh run.sh
>
>
export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": 666666, 666666}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
  • Run the script:
sh run.sh
 
Changed:
<
<

Scripts to test new schedd

>
>

Scripts to test a new schedd

  Requirements : On Schedd :
Changed:
<
<
  • Transfer users proxy to the new schedd
  • Create sleep.jdl file :
>
>
  • Transfer users proxy to the new schedd.
  • Create a file sleep.jdl with the following content:
 Universe = vanilla Executable = sleep.sh Arguments = 1
Line: 111 to 227
 x509userproxy = /home/grid/cms1761/temp/user_proxy_file x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jbalcas/CN=751133/CN=Justas Balcas"
Changed:
<
<
  • Create sleep.sh
>
>
  • Create a script sleep.sh with the following content:
 #!/bin/bash sleep $1 echo "I slept for $1 seconds on:"
Line: 120 to 236
 date
  • Run command on schedd
Changed:
<
<
>
>
 condor_submit sleep.jdl
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
Line: 128 to 244
 
  • Create same sleep.jdl file as on Schedd.
  • Create same sleep.sh file as on Schedd.
  • Execute following commands :
Changed:
<
<
>
>
 #IF on SLC5 source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh #sourcing TW env will move you to current folder

Revision 222014-10-28 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"

Operator Task Management

Line: 7 to 7
 

Finding what you need

  • Get the full name of the task from the user; this will tell you which schedd to look at.
Changed:
<
<
  • Refer to this page to find the vocmsXXX server and log file locations.
>
>
  • Refer to this page to find the vocmsXXX server and log file locations.
 

On the schedd

Revision 212014-10-24 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"

Operator Task Management

Line: 78 to 78
 
    1. Copy the relevan part in RunJobs.dag (see point below).
    2. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change the first four number after POSTJOB to something like this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
Added:
>
>
  • To run the job wrapper from lxplus:
    1. Copy the spool directory of the task you want to run to lxplus, e.g.:
      scp -r /data/srv/glidecondor/condor_local/spool/9290/0/cluster379290.proc0.subproc0 mmascher@lxplus:/afs/cern.ch/work/m/mmascher
    2. Go to the directory you just copied to lxplus
    3. Copy the job arguments of the job you want to run from glidemon or dashboard (fourth line of the joblog)
    4. Prepare a script (I call it run.sh) to run your job (replace the job arguments with the one you have just copied. Be careful!!! You need to enclose some parameters with quotes, e.g. inputFile, scriptArgs, runAndLumis)
      rm -rf jobReport.json cmsRun-stdout.log edmProvDumpOutput.log jobReportExtract.pickle FrameworkJobReport.xml outfile.root PSet.pkl PSet.py scramOutput.log jobLog.* assa wmcore_initialized debug CMSSW_5_3_4 process.id run.sh.old
      tar xvzf sandbox.tar.gz
      export PYTHONPATH=~/repos/CRABServer/src/python:$PYTHONPATH
      export _CONDOR_JOB_AD=.job.ad
      export X509_USER_PROXY=/tmp/x509up_u8440
      export CRAB3_RUNTIME_DEBUG=TRUE; sh CMSRunAnalysis.sh -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile='["/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0010/786D8FE8-BBAD-E111-884D-0025901D5DB2.root"]' --runAndLumis='{"1": [[666666, 666666]]}' --lheInputFiles=None --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=assa --scriptArgs='["assa=1", "ouch=2"]' -o '{}'
    5. Run the script
      sh run.sh
 

Scripts to test new schedd

Requirements : On Schedd :

Revision 202014-10-07 - AndresTanasijczuk

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="CRAB"
>
>
META TOPICPARENT name="https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"
 

Operator Task Management

Helpful tips, tricks, and one-line commands for operators to navigate CRAB3.

Line: 130 to 130
  _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
Added:
>
>
META TOPICMOVED by="atanasi" date="1412700166" from="CMS.Crab3OperatorDebugging" to="CMSPublic.Crab3OperatorDebugging"

Revision 192014-07-14 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 21 to 21
 
  • To see list of allowed to write on Schedd :
    condor_config_val SCHEDD.ALLOW_WRITE
  • To see location of Master log or Schedd log :
    condor_config_val MASTER_LOG OR condor_config_val SCHEDD_LOG
  • To see all the schedd known in the glidein-collector.t2.ucsd.ed collector:
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
Changed:
<
<
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running in the vocms20 and submit-5 schedds:
     condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu -name vocms20.cern.ch -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
     0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E 
>
>
  • To see all running/idle by username :
    condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running by username :
    condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch -format '%s\n' AccountingGroup -const '(JobStatus=?=2)' | sort | uniq -c
  • To see all running in the production schedds:
     condor_q -name crab3test-4@vocms0109.cern.ch -name crab3test-2@vocms95.cern.ch -name crab3test-3@vocms96.cern.ch  -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
     0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E 
 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
    condor_q 184439.0 -analyze
    or if it is not enough information, do
    condor_q 184439.0 -better-analyze
  • To see the list of running pilots grouped by site:
    condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c

Revision 182014-07-03 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 24 to 24
 
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running in the vocms20 and submit-5 schedds:
     condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu -name vocms20.cern.ch -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
     0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
    condor_q 184439.0 -analyze
    or if it is not enough information, do
    condor_q 184439.0 -better-analyze
Added:
>
>
  • To see the list of running pilots grouped by site:
    condor_status -format '%s\n' GLIDEIN_CMSSite  | sort | uniq -c
 

Monitoring links

Line: 37 to 38
 
Added:
>
>
 

Crabcache (UserFileCache)

Revision 172014-05-21 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 40 to 40
 

Crabcache (UserFileCache)

Added:
>
>
  • To get the list of all the users in the crabcache:
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=listusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     "mmascher"
    ,"aldo"
    ]}
 
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
    curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]}
    
    

Revision 162014-05-20 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 40 to 40
 

Crabcache (UserFileCache)

Changed:
<
<
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
    curl -X GET 'https://mmascher-poc/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
    curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
 {"result": [ {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]} ]}
Changed:
<
<
  • To get more information about one specific file:
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
  • To get more information about one specific file:
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
 {"result": [ {"hashkey": "697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "created": 1400161415.0, "modified": 1400161415.0, "exists": true, "size": 1303975} ]}
Changed:
<
<
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
 {"result": [ "" ]}
Changed:
<
<
  • To get the list of power users:
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
  • To get the list of power users:
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
 {"result": [ "mmascher" ]}
Changed:
<
<
  • To get the quota each user has (power users has 10* this value):
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
>
>
  • To get the quota each user has (power users has 10* this value):
     curl -X GET 'https://cmsweb.cern.ch/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    
    
 {"result": [ {"quota_user_limit": 10} ]}

Revision 152014-05-15 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 38 to 38
 
Added:
>
>

Crabcache (UserFileCache)

  • To get all the files uploaded by a user to the crabcache and the amount of quota he's using:
    curl -X GET 'https://mmascher-poc/crabcache/info?subresource=userinfo&username=mmascher' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     {"file_list": ["/data/state/crabcache/files/m/mmascher/69/697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "/data/state/crabcache/files/m/mmascher/14/14571bc71cf2077961408fb1a060b9497a9c7d0cc1dcb47ed0f7fc1ac2e3748d", "/data/state/crabcache/files/m/mmascher/ef/efdedb430f8462c72259fd2e427e684d8e3aedf8d0d811bf7ef8a97f30a47bac", ..., "/data/state/crabcache/files/m/mmascher/05/059b06681025f14ecce5cbdc49c83e1e945a30838981c37b5293781090b07bd7", "/data/state/crabcache/files/m/mmascher/up/uplog.txt"], "used_space": [130170972]}
    ]}
  • To get more information about one specific file:
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=fileinfo&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     {"hashkey": "697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a", "created": 1400161415.0, "modified": 1400161415.0, "exists": true, "size": 1303975}
    ]}
  • To remove a specific file (currently you can only remove your files. In the future power users should be able to remove everything):
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=fileremove&hashkey=697a932e19bd2912710fe0322de3eff41a5553f1f9820117a8262f0ebcd3640a' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     ""
    ]}
  • To get the list of power users:
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=powerusers' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     "mmascher"
    ]}
  • To get the quota each user has (power users has 10* this value):
    [lxplus411] /afs/cern.ch/user/m/mmascher > curl -X GET 'https://mmascher-poc/crabcache/info?subresource=basicquota' --key $X509_USER_PROXY --cert $X509_USER_PROXY -k
    {"result": [
     {"quota_user_limit": 10}
    ]}
 

Development Oriented Tricks I did not know where to put

Revision 142014-05-12 - MohdAdliBinMDAli

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 14 to 14
 
  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
    DENY_WRITE = cms123@*, $(DENY_WRITE) 
    Reload the schedd by sending it SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_hold -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType =?= "ROOT"'
Changed:
<
<
  • To find the HTCondor working directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
>
>
  • To find the HTCondor working (spool) directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
 
  • To find job one in the task, run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1'
    You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
    condor_history -match 1 -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1 && CRAB_Retry =?= 2'
  • To list all tasks for user
    condor_q cms293
  • To list specific task configuration
    condor_q -l 179330.0
    or if it does not exists, then it should be taken from history :
    condor_history -l 179330.0 

Revision 132014-05-08 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 22 to 22
 
  • To see location of Master log or Schedd log :
    condor_config_val MASTER_LOG    OR      condor_config_val SCHEDD_LOG
  • To see all the schedd known in the glidein-collector.t2.ucsd.ed collector:
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
Changed:
<
<
  • To see all running in the vocms20 and submit-5 schedds:
     condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
          0	Unexpanded	U
          1	Idle	I
          2	Running	R
          3	Removed	X
          4	Completed	C
          5	Held	H
          6	Submission_err	E
    
>
>
  • To see all running in the vocms20 and submit-5 schedds:
     condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu -name vocms20.cern.ch -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
     0 Unexpanded U 1 Idle I 2 Running R 3 Removed X 4 Completed C 5 Held H 6 Submission_err E 
 
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
    condor_q 184439.0 -analyze
    or if it is not enough information, do
    condor_q 184439.0 -better-analyze

Monitoring links

Changed:
<
<
>
>
 
Changed:
<
<
>
>
 

Development Oriented Tricks I did not know where to put

Changed:
<
<
    1. Look at the Glidemon and find out the cluster id you want to run
    2. Go to the spool directory of the task
    3. Copy the relevan part in RunJobs.dag (see point below).
    4. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change the first four number after POSTJOB to something like this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
>
>
    1. Look at the Glidemon and find out the cluster id you want to run
    2. Go to the spool directory of the task
    3. Copy the relevan part in RunJobs.dag (see point below).
    4. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change the first four number after POSTJOB to something like this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
 

Scripts to test new schedd

Changed:
<
<
Requirements : On Schedd :
>
>
Requirements : On Schedd :
 
  • Transfer users proxy to the new schedd
  • Create sleep.jdl file :
Line: 83 to 79
 
condor_submit sleep.jdl
Changed:
<
<
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
>
>
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
 From TW machine (The machine name must be added to SCHEDD.ALLOW_WRITE):
  • Create same sleep.jdl file as on Schedd.
  • Create same sleep.sh file as on Schedd.

Revision 122014-04-23 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 65 to 65
 +DESIRED_Sites = "T3_US_PuertoRico,T2_FI_HIP,T2_UK_SGrid_RALPP,T2_FR_GRIF_LLR,T3_UK_London_QMUL,T3_TW_NTU_HEP,T3_US_Omaha,T2_KR_KNU,T2_RU_SINP,T3_US_UMD,T2_CH_CERN_AI,T3_US_Colorado,T3_US_UB,T1_UK_RAL_Disk,T3_IT_Napoli,T3_NZ_UOA,T2_TH_CUNSTDA,T3_US_Kansas,T3_US_ParrotTest,T3_GR_IASA,T3_US_Parrot,T2_IT_Bari,T2_US_UCSD,T1_RU_JINR,T3_US_Vanderbilt_EC2,T2_RU_IHEP,T2_RU_RRC_KI,T2_CH_CERN,T3_BY_NCPHEP,T3_US_TTU,T3_GR_Demokritos,T3_US_UTENN,T3_US_UCR,T3_TW_NCU,T2_CH_CSCS,T2_UA_KIPT,T3_RU_FIAN,T2_IN_TIFR,T3_UK_London_UCL,T3_US_Brown,T3_US_UCD,T3_CO_Uniandes,T3_KR_KNU,T2_FR_IPHC,T3_US_OSU,T3_US_TAMU,T1_US_FNAL,T2_IT_Rome,T2_UK_London_Brunel,T3_IN_PUHEP,T3_IT_Trieste,T2_EE_Estonia,T3_UK_ScotGrid_ECDF,T2_CN_Beijing,T2_US_Florida,T3_US_Princeton_ICSE,T3_IT_MIB,T3_US_FNALXEN,T1_DE_KIT,T3_IR_IPM,T2_US_Wisconsin,T2_HU_Budapest,T2_DE_RWTH,T2_US_Vanderbilt,T2_BR_SPRACE,T3_UK_SGrid_Oxford,T3_US_NU,T2_BR_UERJ,T3_MX_Cinvestav,T3_US_FNALLPC,T1_US_FNAL_Disk,T3_US_UIowa,T3_IT_Firenze,T3_US_Cornell,T2_ES_IFCA,T3_US_UVA,T3_ES_Oviedo,T3_US_NotreDame,T2_DE_DESY,T1_UK_RAL,T2_US_Caltech,T3_FR_IPNL,T2_TW_Taiwan,T3_US_NEU,T3_UK_London_RHUL,T0_CH_CERN,T1_RU_JINR_Disk,T3_CN_PKU,T2_UK_London_IC,T2_US_Nebraska,T2_ES_CIEMAT,T3_US_Princeton,T2_PK_NCP,T2_CH_CERN_T0,T3_US_FSU,T3_KR_UOS,T3_IT_Perugia,T3_US_Minnesota,T2_TR_METU,T2_AT_Vienna,T2_US_Purdue,T3_US_Rice,T3_HR_IRB,T2_BE_UCL,T3_US_FIT,T2_UK_SGrid_Bristol,T2_PT_NCG_Lisbon,T1_ES_PIC,T3_US_JHU,T2_IT_Legnaro,T2_RU_INR,T3_US_FIU,T3_EU_Parrot,T2_RU_JINR,T2_IT_Pisa,T3_UK_ScotGrid_GLA,T3_US_MIT,T2_CH_CERN_HLT,T2_MY_UPM_BIRUNI,T1_FR_CCIN2P3,T2_FR_GRIF_IRFU,T3_US_UMiss,T2_FR_CCIN2P3,T2_PL_Warsaw,T3_AS_Parrot,T2_US_MIT,T2_BE_IIHE,T2_RU_ITEP,T1_CH_CERN,T3_CH_PSI,T3_IT_Bologna" requirements = stringListMember(GLIDEIN_CMSSite,DESIRED_Sites) should_transfer_files = YES
Added:
>
>
RequestMemory = 2000
 when_to_transfer_output = ON_EXIT Queue 1 x509userproxy = /home/grid/cms1761/temp/user_proxy_file

Revision 112014-04-21 - BrianBockelman

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 99 to 99
  sed --in-place "s|x509userproxysubject = .*|x509userproxysubject = \"$userDN\"|" sleep.jdl sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl # Fill the new schedd information
Added:
>
>
_condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY
  condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl \ No newline at end of file

Revision 102014-04-11 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 90 to 90
 
  • Execute following commands :
#IF on SLC5
Changed:
<
<
> source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
>
>
source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
 #sourcing TW env will move you to current folder
Changed:
<
<
> source /data/srv/TaskManager/env.sh
> voms-proxy-init -voms cms
> userDN=$(voms-proxy-info | grep 'issuer\ *:' | sed 's/issuer *: //')
> proxypath=$(voms-proxy-info | grep 'path\ *:' | sed 's/path *: //')
> sed --in-place "s|x509userproxysubject = .*|x509userproxysubject = \"$userDN\"|" sleep.jdl
> sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl
> condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
>
>
source /data/srv/TaskManager/env.sh voms-proxy-init -voms cms userDN=$(voms-proxy-info | grep 'issuer\ *:' | sed 's/issuer *: //') proxypath=$(voms-proxy-info | grep 'path\ *:' | sed 's/path *: //') sed --in-place "s|x509userproxysubject = .*|x509userproxysubject = \"$userDN\"|" sleep.jdl sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl # Fill the new schedd information condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95NOSPAMPLEASE.cern.ch sleep.jdl
 

Revision 92014-04-04 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 22 to 22
 
  • To see location of Master log or Schedd log :
    condor_config_val MASTER_LOG    OR      condor_config_val SCHEDD_LOG
  • To see all the schedd known in the glidein-collector.t2.ucsd.ed collector:
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
Changed:
<
<
>
>
  • To see all running in the vocms20 and submit-5 schedds:
     condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs 
    Job Status explanation :
          0	Unexpanded	U
          1	Idle	I
          2	Running	R
          3	Removed	X
          4	Completed	C
          5	Held	H
          6	Submission_err	E
    
  • Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q.
    condor_q 184439.0 -analyze
    or if it is not enough information, do
    condor_q 184439.0 -better-analyze
 

Monitoring links

Line: 74 to 84
 
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.
Added:
>
>
From TW machine (The machine name must be added to SCHEDD.ALLOW_WRITE):
  • Create same sleep.jdl file as on Schedd.
  • Create same sleep.sh file as on Schedd.
  • Execute following commands :
#IF on SLC5
> source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
#sourcing TW env will move you to current folder
> source /data/srv/TaskManager/env.sh
> voms-proxy-init -voms cms
> userDN=$(voms-proxy-info | grep 'issuer\ *:' | sed 's/issuer *: //')
> proxypath=$(voms-proxy-info | grep 'path\ *:' | sed 's/path *: //')
> sed --in-place "s|x509userproxysubject = .*|x509userproxysubject = \"$userDN\"|" sleep.jdl
> sed --in-place "s|x509userproxy = .*|x509userproxy = $proxypath|" sleep.jdl
> condor_submit -debug -pool glidein-collector-2.t2.ucsd.edu -remote crab3test-2@vocms95.cern.ch sleep.jdl

Revision 82014-04-04 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 38 to 38
 
    1. Go to the spool directory of the task
    2. Copy the relevan part in RunJobs.dag (see point below).
    3. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change the first four number after POSTJOB to something like this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
Added:
>
>

Scripts to test new schedd

Requirements : On Schedd :

  • Transfer users proxy to the new schedd
  • Create sleep.jdl file :
Universe   = vanilla
Executable = sleep.sh
Arguments  = 1
Log        = sleep.PC.log
Output     = sleep.out.$(Cluster).$(Process)
Error      = sleep.err.$(Cluster).$(Process)
+DESIRED_Sites =  "T3_US_PuertoRico,T2_FI_HIP,T2_UK_SGrid_RALPP,T2_FR_GRIF_LLR,T3_UK_London_QMUL,T3_TW_NTU_HEP,T3_US_Omaha,T2_KR_KNU,T2_RU_SINP,T3_US_UMD,T2_CH_CERN_AI,T3_US_Colorado,T3_US_UB,T1_UK_RAL_Disk,T3_IT_Napoli,T3_NZ_UOA,T2_TH_CUNSTDA,T3_US_Kansas,T3_US_ParrotTest,T3_GR_IASA,T3_US_Parrot,T2_IT_Bari,T2_US_UCSD,T1_RU_JINR,T3_US_Vanderbilt_EC2,T2_RU_IHEP,T2_RU_RRC_KI,T2_CH_CERN,T3_BY_NCPHEP,T3_US_TTU,T3_GR_Demokritos,T3_US_UTENN,T3_US_UCR,T3_TW_NCU,T2_CH_CSCS,T2_UA_KIPT,T3_RU_FIAN,T2_IN_TIFR,T3_UK_London_UCL,T3_US_Brown,T3_US_UCD,T3_CO_Uniandes,T3_KR_KNU,T2_FR_IPHC,T3_US_OSU,T3_US_TAMU,T1_US_FNAL,T2_IT_Rome,T2_UK_London_Brunel,T3_IN_PUHEP,T3_IT_Trieste,T2_EE_Estonia,T3_UK_ScotGrid_ECDF,T2_CN_Beijing,T2_US_Florida,T3_US_Princeton_ICSE,T3_IT_MIB,T3_US_FNALXEN,T1_DE_KIT,T3_IR_IPM,T2_US_Wisconsin,T2_HU_Budapest,T2_DE_RWTH,T2_US_Vanderbilt,T2_BR_SPRACE,T3_UK_SGrid_Oxford,T3_US_NU,T2_BR_UERJ,T3_MX_Cinvestav,T3_US_FNALLPC,T1_US_FNAL_Disk,T3_US_UIowa,T3_IT_Firenze,T3_US_Cornell,T2_ES_IFCA,T3_US_UVA,T3_ES_Oviedo,T3_US_NotreDame,T2_DE_DESY,T1_UK_RAL,T2_US_Caltech,T3_FR_IPNL,T2_TW_Taiwan,T3_US_NEU,T3_UK_London_RHUL,T0_CH_CERN,T1_RU_JINR_Disk,T3_CN_PKU,T2_UK_London_IC,T2_US_Nebraska,T2_ES_CIEMAT,T3_US_Princeton,T2_PK_NCP,T2_CH_CERN_T0,T3_US_FSU,T3_KR_UOS,T3_IT_Perugia,T3_US_Minnesota,T2_TR_METU,T2_AT_Vienna,T2_US_Purdue,T3_US_Rice,T3_HR_IRB,T2_BE_UCL,T3_US_FIT,T2_UK_SGrid_Bristol,T2_PT_NCG_Lisbon,T1_ES_PIC,T3_US_JHU,T2_IT_Legnaro,T2_RU_INR,T3_US_FIU,T3_EU_Parrot,T2_RU_JINR,T2_IT_Pisa,T3_UK_ScotGrid_GLA,T3_US_MIT,T2_CH_CERN_HLT,T2_MY_UPM_BIRUNI,T1_FR_CCIN2P3,T2_FR_GRIF_IRFU,T3_US_UMiss,T2_FR_CCIN2P3,T2_PL_Warsaw,T3_AS_Parrot,T2_US_MIT,T2_BE_IIHE,T2_RU_ITEP,T1_CH_CERN,T3_CH_PSI,T3_IT_Bologna"
requirements = stringListMember(GLIDEIN_CMSSite,DESIRED_Sites)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Queue 1
x509userproxy = /home/grid/cms1761/temp/user_proxy_file
x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jbalcas/CN=751133/CN=Justas Balcas"
  • Create sleep.sh
#!/bin/bash
sleep $1
echo "I slept for $1 seconds on:"
hostname
date
  • Run command on schedd
condor_submit sleep.jdl
  • condor_q -> This command will print a table containing information about the jobs that have submitted. You should be able to identify the jobs that you have just submitted. After some time, your job will complete. To see the output, cat the files that we set as the output.

Revision 72014-04-04 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 24 to 24
 
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
  • To see all running in the vocms20 and submit-5 schedds:condor_q -name vocms20.cern.ch -name crab3test@submit-5NOSPAMPLEASE.t2.ucsd.edu -const '(JobStatus=?=2)' | grep gWMS-CMSRun | wc -l #to count running jobs
Added:
>
>

Monitoring links

 

Development Oriented Tricks I did not know where to put

Revision 62014-04-04 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 20 to 20
 
  • To list specific task configuration
    condor_q -l 179330.0
    or if it does not exists, then it should be taken from history :
    condor_history -l 179330.0 
  • To see list of allowed to write on Schedd :
    condor_config_val SCHEDD.ALLOW_WRITE
  • To see location of Master log or Schedd log :
    condor_config_val MASTER_LOG    OR      condor_config_val SCHEDD_LOG
Changed:
<
<
  • To see condor status :
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
>
>
  • To see all the schedd known in the glidein-collector.t2.ucsd.ed collector:
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
 
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
Added:
>
>
 

Development Oriented Tricks I did not know where to put

Revision 52014-04-04 - JustasBalcas

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 16 to 16
 
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_hold -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType =?= "ROOT"'
  • To find the HTCondor working directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
  • To find job one in the task, run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1'
    You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
    condor_history -match 1 -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1 && CRAB_Retry =?= 2'
Added:
>
>
  • To list all tasks for user
    condor_q cms293
  • To list specific task configuration
    condor_q -l 179330.0
    or if it does not exists, then it should be taken from history :
    condor_history -l 179330.0 
  • To see list of allowed to write on Schedd :
    condor_config_val SCHEDD.ALLOW_WRITE
  • To see location of Master log or Schedd log :
    condor_config_val MASTER_LOG    OR      condor_config_val SCHEDD_LOG
  • To see condor status :
    condor_status -pool glidein-collector.t2.ucsd.edu -schedd
  • To see all running by username :
    condor_q -name crab3test-2@vocms95.cern.ch -name crab3test@submit-5.t2.ucsd.edu  -name vocms20.cern.ch -format '%s\n' AccountingGroup | sort | uniq -c
 

Development Oriented Tricks I did not know where to put

Revision 42014-03-28 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 23 to 23
 
    1. Look at the Glidemon and find out the cluster id you want to run
    2. Go to the spool directory of the task
    3. Copy the relevan part in RunJobs.dag (see point below).
Deleted:
<
<
    1. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
 \ No newline at end of file
Added:
>
>
    1. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change the first four number after POSTJOB to something like this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
 \ No newline at end of file

Revision 32014-03-28 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"

Operator Task Management

Line: 16 to 16
 
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_hold -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType =?= "ROOT"'
  • To find the HTCondor working directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
  • To find job one in the task, run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1'
    You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
    condor_history -match 1 -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1 && CRAB_Retry =?= 2'
\ No newline at end of file
Added:
>
>

Development Oriented Tricks I did not know where to put

  • To run a PostJob on the schedd:
    1. Look at the Glidemon and find out the cluster id you want to run
    2. Go to the spool directory of the task
    3. Copy the relevan part in RunJobs.dag (see point below).
    4. Run it! Don't just copy and paste the part below, but substitute it to what you find in your RunJobs.dag. Remember to change this ... POSTJOB 1198863 0 0 10 ... where 1198863 is the cluster id (jobid):
      sudo -u cms1287 sh -c 'export TEST_DONT_REDIRECT_STDOUT=True;export X509_USER_PROXY=602e17194771641967ee6db7e7b3ffe358a54c59;sh dag_bootstrap.sh POSTJOB 1198863 0 0 10 mmascher-poc.cern.ch /crabserver/dev/filemetadata 140327_144908_vocms20:mmascher_crab_test11 1 140327_144908_crab_test11-6c2e61e17ebc64360cffbba7b17d735a CMSSW_5_3_4 T2_IT_Legnaro /store/temp/user/mmascher.a51a1e1d7e1bbd3799d579c676a09fb91de3cc23/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 /store/user/mmascher/GenericTTbar/140327_144908_crab_test11/140327_144908/0000 cmsRun_1.log.tar.gz outfile_1.root'
 \ No newline at end of file

Revision 22014-03-25 - MarcoMascheroni

Line: 1 to 1
 
META TOPICPARENT name="CRAB"
Deleted:
<
<
 

Operator Task Management

Helpful tips, tricks, and one-line commands for operators to navigate CRAB3.

Line: 15 to 14
 
  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
    DENY_WRITE = cms123@*, $(DENY_WRITE) 
    Reload the schedd by sending it SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_hold -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType =?= "ROOT"'
Changed:
<
<
  • To find the HTCondor working directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" -af Iwd'
>
>
  • To find the HTCondor working directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"' -af Iwd
 
  • To find job one in the task, run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1'
    You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
    condor_history -match 1 -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1 && CRAB_Retry =?= 2'

Revision 12014-02-21 - BrianBockelman

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="CRAB"

Operator Task Management

Helpful tips, tricks, and one-line commands for operators to navigate CRAB3.

Finding what you need

  • Get the full name of the task from the user; this will tell you which schedd to look at.
  • Refer to this page to find the vocmsXXX server and log file locations.

On the schedd

  • To deny a user with local unix user name cms123, add the following line to .../config.d/99_banned_users.conf (create if it does not exist):
    DENY_WRITE = cms123@*, $(DENY_WRITE) 
    Reload the schedd by sending it SIGHUP.
  • To list all jobs in a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_q -const 'CRAB_ReqName=?="140221_015428_vocms20:bbockelm_crab_lxplus414_user_17"'
  • To kill a task named 140221_015428_vocms20:bbockelm_crab_lxplus414_user_17, run:
    condor_hold -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && TaskType =?= "ROOT"'
  • To find the HTCondor working directory (different from the web monitor directory; contains things like the raw logfile, postjob, proxy, input files, etc), run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" -af Iwd'
  • To find job one in the task, run:
    condor_q -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1'
    You can replace condor_q with condor_history for completed jobs. If you want just the history for retry 2,
    condor_history -match 1 -const 'CRAB_ReqName =?= "140221_015428_vocms20:bbockelm_crab_lxplus414_user_17" && CRAB_Id =?= 1 && CRAB_Retry =?= 2'
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback