If you download and install ganga locally and you want to use local batch systems at your site, you may need to tune the configuration of batch handlers provided by Ganga.
Standard batch backends in Ganga are all very similar and differ by the configuration. The configuration specifies the batch commands and their output. Depending on the batch system configuration the output of commands may differ, hence this is probably where you have to tune it.
Technically backends such as LSF, PBS and SGE are all derived from the same
Batch
base class.
Default configuration values
The predefined configuration of the batch handlers works out-of-the box at CERN and is presented below.
config.LSF
LSF : internal LSF command line interface
jobid_name = 'LSB_BATCH_JID'
jobnameopt = 'J'
kill_res_pattern = '(^Job <\\d+> is being terminated)|(Job <\\d+>: Job has already finished)|(Job <\\d+>: No matching job found)'
kill_str = 'bkill %s'
postexecute = "\ndef filefilter(fn):\n # FILTER OUT Batch INTERNAL INPUT/OUTPUT FILES: \n # 10 digits . any number of digits . err or out\n import re\n internals = re.compile(r'\\d{10}\\.\\d+.(out|err)')\n return internals.match(fn) or fn == '.Batch.start'\n"
preexecute = '\n'
queue_name = 'LSB_QUEUE'
shared_python_executable = False
submit_res_pattern = '^Job <(?P<id>\\d*)> is submitted to .*queue <(?P<queue>\\S*)>'
submit_str = 'cd %s; bsub %s %s %s %s'
timeout = 300
In [3]:config.PBS
PBS : internal PBS command line interface
jobid_name = 'PBS_JOBID'
jobnameopt = 'N'
kill_res_pattern = '(^$)|(qdel: Unknown Job Id)'
kill_str = 'qdel %s'
postexecute = '\nenv = os.environ\njobnumid = env["PBS_JOBID"]\nos.chdir("/tmp/")\nos.system("rm -rf /tmp/%s/" %jobnumid) \n'
preexecute = '\nenv = os.environ\njobnumid = env["PBS_JOBID"]\nos.system("mkdir /tmp/%s/" %jobnumid)\nos.chdir("/tmp/%s/" %jobnumid)\nos.environ["PATH"]+=":."\n'
queue_name = 'PBS_QUEUE'
shared_python_executable = False
submit_res_pattern = '^(?P<id>\\d*)\\.pbs\\s*'
submit_str = 'cd %s; qsub %s %s %s %s'
timeout = 300
In [4]:config.SGE
SGE : internal SGE command line interface
jobid_name = 'JOB_ID'
jobnameopt = 'N'
kill_res_pattern = '(has registered the job +\\d+ +for deletion)|(denied: job +"\\d+" +does not exist)'
kill_str = 'qdel %s'
postexecute = ''
preexecute = 'os.chdir(os.environ["TMPDIR"])\nos.environ["PATH"]+=":."'
queue_name = 'QUEUE'
shared_python_executable = False
submit_res_pattern = 'Your job (?P<id>\\d+) (.+)'
submit_str = 'cd %s; qsub -cwd -V %s %s %s %s'
timeout = 300
Meaning of batch configuration parameters
Some of the configuration parameters are regular expressions (regex) patterns. See standard python
re
module:
http://docs.python.org/lib/module-re.html
- submit_str
- Command to submit a job. The occurrences of %s are replaced by: temporary directory, queue option, stderr, stdout, script command.
- submit_res_pattern
- Regex pattern to parse the output of submit command. The (?P<id>AAA) is the pattern for job id number where AAA is usually \d+, which mean non zero number of digits. This pattern is used to extract the batch id of the job and other parameters such as queue.
- kill_str
- Command to kill a job. The %s is replaced by the job batch id.
- kill_res_pattern
- Regex pattern to parse the output of kill command. The (?P<id>\d+) pattern is used for job id number. The kill_res_pattern must be an empty string if your batch system silently kills the job.
- preexecute
- Python script executed before executing of the job wrapper script.
- postexecute
- Python script executed after executing of the job wrapper script.
- queue_name
- Name of environment which contains queue name.
- jobid_name
- Name of environment which contains batch job id.
- shared_python_executable
- DO NOT touch it, if you don't understand what it means. If True then the python executable interpreter which is used by ganga client is used by the wrapper script on the worker node. If False, then the default python interpreter defined on the worker node is picked up. $ timeout Timeout in seconds after which a job is declared killed if it has not touched its heartbeat file. Heartbeat is touched every 30s so do not set this below 120 or so.
Configuring your batch backend
This is a recipe how to get going with configuration of your batch backend.
Create simple shell script, for example
test_batch.sh
:
echo 'Curent dir is ' `pwd`
python -V
env
Submit this shell script to your batch system:
- LSF
- bsub test_batch.sh
- PBS
- qsub test_batch.sh
- SGE
- qsub test_batch.sh
The answer from batch system vary depending on type of batch system and system settings.
Let us suppose you have an answer like this:
Job 13579 is submitted to queue SHORT at the.best.farm.uk
You need to define
submit_res_pattern which extracts the
id number and
queue name from this answer string.
submit_res_pattern = ^Job (?P<id>\d+) is submitted to queue (?P<queue>\S+) at .*
This means:
- ^
- match only at the begining of the string
- Job
- literal part of the string
- (?P<id>\d+)
- 1 or more decimal digits, which will be used as id number of job
- is submitted to queue
- literal part of the string
- (?P<queue>\S+)
- 1 or more non-whitespace characters, which will be used as queue name of job
- at
- part of string
- .*
- any remaining characters
For more information see the standard python
re
module:
http://docs.python.org/lib/module-re.html
To check the correctness of pattern, run in python:
>>> import re
>>> answer = 'Job 13579 is submitted to queue SHORT at the.best.farm.uk'
>>> pattern = '^Job (?P<id>\d+) is submitted to queue (?P<queue>\S+) at .*'
>>> re.compile(pattern,re.M).search(answer).groups()
You must have
('13579', 'SHORT') as result.
After that you need to define
kill_res_pattern. Add
sleep 360
to the end of the
test_batch.sh
. Submit this file to batch system and then kill batch job with:
- LSF
- bkill 13579
- PBS
- qdel 13579
- SGE
- qdel 13579
where 13579 is the batch job number.
If batch system answer something after killing your job, you need to define
kill_res_pattern. In other case, it must be equal empty string.
Repeat the bkill(qdel) command without submitting job. The batch system answer must be added to
kill_res_pattern:
kill_res_pattern = (Answer after first kill of job \d+)|(Answer after second kill of job \d+)
where \d+ replace jobs id.
Some pitfalls
- When printing the config values, each "\" is printed as "\\". You should use only one "\" for each "\". Using two slashed ("\\" ) means escaping the slash ("\") character itself, so it does not have a special meaning to the
re
module.
- Now consider the fact that in python, as in C/C++, you need to escape the slash: "\n" means new line, while "\\n" means a slash followed by "n". Read more about how to handle this using raw strings on http://www.python.org/doc/2.3.5/tut/node5.html
- When you are entering data into your site wide configuration file, take care that strings should be entered without single quotes. See your private .gangarc file for examples.