Condor Commands
Is Condor running?
command |
description |
example |
ps aux l grep condor_{master,schedd,collector,negotiator} |
To check if condor or any condor daemon is running |
ps aux l grep condor_master |
Checking pool status
condor_status
command |
description |
example |
condor_status -master |
List machines, but only names (status and slots are not shown) |
|
condor_status -avail |
List those slots that are not busy and could run HTCondor jobs at this moment |
|
condor_status -run |
List slots that are currently running jobs and show related information (owner of each job, machine where it was submitted from, etc.) |
|
condor_status -state -total |
List a summary according to the state of each slot |
|
condor_status -submitters |
Show information about the current general status, like number of running, idle and held jobs (and submitters) |
|
condor_status machine |
Show the status of a specific machine |
|
condor_status -sort Memory |
Sort slots by Memory, you can try also with other attributes |
|
condor_status -server |
List attributes of slots, like memory, disk, load, flops, etc. |
|
condor_status -schedd |
Shows the lits of schedds attached to the collector - can be done from both schedd and collector |
NOTE - if more than one collector is attached, it will choose a random collector to query! |
condor_status -schedd -pool (collector) |
shows the schedds attached to that collector only |
condor_status -schedd -pool neut.cern.ch |
condor_status -neg |
shows the current negotiator being used, can be used to identify what pool you are in |
|
Submitting Jobs
- The submit directory has to be accesible from all machines.
- Make sure you run condor_submit from a directory found on all machines (such as your own home directory tree).
- Once you find your job is running on a particular machine, it seems to be fun to login and use see how it's doing (with ps, top, or vmstat). Resist! If Condor sees a user login to its machine, it will suspend your Condor job there, and wont restart until 15 minutes after you have logged out.
condor_submit
command |
description |
example |
condor_submit submit_file -dry-run dest_file |
this option parses the submit file and saves all the related info (name and locations of input and output files after expanding all variables, value of requirements, etc.) to dest_file, but real jobs are not submitted. Using this option is highly recommended when debugging or before the real submission if you have made some modifications in your submit file and you are not sure whether they will work. |
|
condor_submit submit_file 'var=value' |
add or modify variable(s) at submission time, without changing the submit file. For instance, if you are using queue $(N) in your submit file, then condor_submit submit_file 'N = 10' will submit 10 jobs. You can specify several pairs of var=value |
|
More information about submitting jobs visit this link: Checking and managing submitted jobs |
|
Checking and managing submitted jobs
condor_q
command |
description |
example |
condor_q |
shows the current queue on the machine (schedd). Show my jobs in HTCondor queue and their ids (cluster.process), info and status (I: idle (waiting for a machine to execute on), R: running, H: on hold (there was an error, waiting for user’s action), S: suspended, C: completed, X: removed, <: transferring input and >: transferring output) |
|
condor_q -name (schedd_name) |
done on a collector - shows the queue of that schedd |
condor_q -name neut.cern.ch |
condor_q -global |
Show all users' jobs in the queue |
condor_q -analyze |
Analyse a specific job and show the reason why it is in its current state (useful for those jobs in Idle status: Condor will show us how many slots match our restrictions and may give us suggestion) |
|
condor_q -better-analyze |
Analyse a specific job and show the reason why it is in its current state, giving extended info |
condor_q -run |
Show your running jobs and related info, like how much time they have been running, in which machine, etc.. |
condor_q -currentrun |
Show the consumed on the current run, the cumulative time from last executions will not be used (you can combine also with -run flag to see only the running processes at this moment) |
|
condor_q -hold |
Show only jobs in the "on hold" state and the reason for that. Held jobs are those that got an error so they could not finish. An action from the user is expected to solve the problem, and then he should use the condor_release command in order to check the job again |
|
condor_q -l (job #) |
lists the classads of that job. Useful for grepping |
condor_q -l 413.1 l grep -i glideinentryname |
condor_q -const 'condor_var=?="string"' |
shows only the jobs matching the constraint, where condor_var = a string |
DESIRED_Sites=?="T1_US_FNAL" - shows only jobs that ask to run at FNAL |
condor_q -const 'condor_var==5' |
shows only the jobs matching the contraing, where condor_var = number |
example: "jobstatus==5" - shows all jobs with jobstatus==5, meaning held jobs. The grep for "^holdreason = " |
condor_q -format '\n%s' condor_var |
lists in whatever format (like "\n%s") condor_var of all jobs in the queue |
|
condor_tail and condor_hold
command |
description |
example |
condor_tail |
Display on screen the last lines of the stdout (screen) of a running job on a remote machine. You can use this command to check whether your job is working fine, you can also visualize error (stderr) or output files created by your program |
|
condor_tail -f |
Do not stop displaying the content, it will be displayed until interrupted with Ctrl+C |
|
condor_tail -no-stdout output_file |
Show the content of an output file (output_file has to be listed in the transfer_output_files command in the submit file) |
|
condor_release -constraint constraint |
Release all my held jobs that satisfy the constraint |
|
Note: Jobs with on hold state are those that HTCondor was not able to properly execute, usually due to problems with executable, paths, etc. If you can solve the problems changing the input files and/or the executable, then you can use condor_release command to run again your program since it will send again all files to the remote machines. If you need to change the submit file to solve the problems, then condor_release will NOT work because it will not evaluate again the submit file. In that case you can use condor_qedit or cancel all held jobs and re-submit them again. |
|
condor_release -all |
Release all my held jobs |
|
condor_hold cluster_id |
Hold all jobs of a specific submission |
|
condor_hold -constraint constraint |
Hold all jobs that satisfy the constraint |
|
condor_hold -all |
Hold all my jobs from the queue |
|
condor_rm
command |
description |
example |
condor_rm (job #) |
removes job # |
condor_rm 417.0, condor_rm 417 (for the whole cluster) |
condor_rm (cluster #) |
Remove all jobs of a specific submission |
condor_rm -const (expr) |
removes all jobs with constraint expr |
condor_rm -const 'jobstatus==5' - removes all held jobs, or condor_rm -const 'desired_sites=?="T2_CH_CERN"' removes jobs asking to go to CERN (only) |
condor_rm -all |
Remove all my jobs from the queue |
|
Getting info from logs
condor_history
command |
description |
example |
condor_history |
Show all completed jobs to date (it has to be run in the same machine where the submission was done). shows all completed jobs - can be used like condor_q (-l, -const, -format...) |
|
condor_history -userlog file.log |
list basic information registered in the log files (use condor_logview <file.log> to see information in graphic mode) (-l, -const, -format...) |
|
condor_userlog file.log |
Show and summarize job statistics from job log files (those created when using log command in the submit file) |
|
condor_logview file.log |
This is not an original HTCondor command, we have created this link to the script that allows you to display graphical information contained in the log of your executions |
|
condor_history -long
XXX.YYY | grep LastRemoteHost: show machine where job
XXX.YYY was executed
Note:There is also an online tool to analyze your log files and get more information:
HTCondor Log Analyzer (
http://condorlog.cse.nd.edu/
)
Other Commands
condor_submit_dag dag_file: Submit a DAG file, used to describe jobs with dependencies
condor_version: Print the version of HTCondor.
condor_qedit: use this command to modify the attributes of a job placed on the queue. This may be useful when you need to change some of the parameters specified in the submit file without re-submitting jobs.
condor_compile: Relink a program with HTCondor libraries so it can be used in the standard universe where checkpoints are enable. Relinked programs can be also executed as an standalone checkpointing executable, what means that you can run it directly in your shell (no HTCondor submission is needed) and create specific or periodic checkpoints that allow you to recover the execution in case of problems.
Changing Priorities
Note: Condor_userprio is a list of priority of the machines - not an actual representation of what machines are in the pool or even exist. condor_userprio will hold on to hold entries of machines that are no longer needed (in another pool, another user, decommissioned, etc). To remove an old machine's priority listing, use condor_userprio -delete (user@old_machine).
HTCondor manual
HTCondor manual link
Back to the CERN Neutrino Platform Computing Cluster Main Page
Back to the CERN Neutrino Platform-Computing Main Page
Topic revision: r7 - 2017-03-28
- NectarB