SetupBatchBackends

If you download and install ganga locally and you want to use local batch systems at your site, you may need to tune the configuration of batch handlers provided by Ganga.

Standard batch backends in Ganga are all very similar and differ by the configuration. The configuration specifies the batch commands and their output. Depending on the batch system configuration the output of commands may differ, hence this is probably where you have to tune it.

Technically backends such as LSF, PBS and SGE are all derived from the same Batch base class.

Default configuration values

The predefined configuration of the batch handlers works out-of-the box at CERN and is presented below.

config.LSF
LSF : internal LSF command line interface
     jobid_name = 'LSB_BATCH_JID'
     jobnameopt = 'J'
     kill_res_pattern = '(^Job <\\d+> is being terminated)|(Job <\\d+>: Job has already finished)|(Job <\\d+>: No matching job found)'
     kill_str = 'bkill %s'
     postexecute = "\ndef filefilter(fn):\n  # FILTER OUT Batch INTERNAL INPUT/OUTPUT FILES: \n  # 10 digits . any number of digits . err or out\n  import re\n  internals = re.compile(r'\\d{10}\\.\\d+.(out|err)')\n  return internals.match(fn) or fn == '.Batch.start'\n"
     preexecute = '\n'
     queue_name = 'LSB_QUEUE'
     shared_python_executable = False
     submit_res_pattern = '^Job <(?P<id>\\d*)> is submitted to .*queue <(?P<queue>\\S*)>'
     submit_str = 'cd %s; bsub %s %s %s %s'
     timeout = 300

In [3]:config.PBS
PBS : internal PBS command line interface
     jobid_name = 'PBS_JOBID'
     jobnameopt = 'N'
     kill_res_pattern = '(^$)|(qdel: Unknown Job Id)'
     kill_str = 'qdel %s'
     postexecute = '\nenv = os.environ\njobnumid = env["PBS_JOBID"]\nos.chdir("/tmp/")\nos.system("rm -rf /tmp/%s/" %jobnumid) \n'
     preexecute = '\nenv = os.environ\njobnumid = env["PBS_JOBID"]\nos.system("mkdir /tmp/%s/" %jobnumid)\nos.chdir("/tmp/%s/" %jobnumid)\nos.environ["PATH"]+=":."\n'
     queue_name = 'PBS_QUEUE'
     shared_python_executable = False
     submit_res_pattern = '^(?P<id>\\d*)\\.pbs\\s*'
     submit_str = 'cd %s; qsub %s %s %s %s'
     timeout = 300

In [4]:config.SGE
SGE : internal SGE command line interface
     jobid_name = 'JOB_ID'
     jobnameopt = 'N'
     kill_res_pattern = '(has registered the job +\\d+ +for deletion)|(denied: job +"\\d+" +does not exist)'
     kill_str = 'qdel %s'
     postexecute = ''
     preexecute = 'os.chdir(os.environ["TMPDIR"])\nos.environ["PATH"]+=":."'
     queue_name = 'QUEUE'
     shared_python_executable = False
     submit_res_pattern = 'Your job (?P<id>\\d+) (.+)'
     submit_str = 'cd %s; qsub -cwd -V %s %s %s %s'
     timeout = 300

Meaning of batch configuration parameters

Some of the configuration parameters are regular expressions (regex) patterns. See standard python re module: http://docs.python.org/lib/module-re.html

submit_str
Command to submit a job. The occurrences of %s are replaced by: temporary directory, queue option, stderr, stdout, script command.
submit_res_pattern
Regex pattern to parse the output of submit command. The (?P<id>AAA) is the pattern for job id number where AAA is usually \d+, which mean non zero number of digits. This pattern is used to extract the batch id of the job and other parameters such as queue.
kill_str
Command to kill a job. The %s is replaced by the job batch id.
kill_res_pattern
Regex pattern to parse the output of kill command. The (?P<id>\d+) pattern is used for job id number. The kill_res_pattern must be an empty string if your batch system silently kills the job.
preexecute
Python script executed before executing of the job wrapper script.
postexecute
Python script executed after executing of the job wrapper script.
queue_name
Name of environment which contains queue name.
jobid_name
Name of environment which contains batch job id.
shared_python_executable
DO NOT touch it, if you don't understand what it means. If True then the python executable interpreter which is used by ganga client is used by the wrapper script on the worker node. If False, then the default python interpreter defined on the worker node is picked up. $ timeout Timeout in seconds after which a job is declared killed if it has not touched its heartbeat file. Heartbeat is touched every 30s so do not set this below 120 or so.

Configuring your batch backend

This is a recipe how to get going with configuration of your batch backend.

Create simple shell script, for example test_batch.sh:

echo 'Curent dir is ' `pwd`
python -V
env

Submit this shell script to your batch system:

LSF
bsub test_batch.sh
PBS
qsub test_batch.sh
SGE
qsub test_batch.sh

The answer from batch system vary depending on type of batch system and system settings. Let us suppose you have an answer like this:

Job 13579 is submitted to queue SHORT at the.best.farm.uk 
You need to define submit_res_pattern which extracts the id number and queue name from this answer string.

submit_res_pattern = ^Job (?P<id>\d+) is submitted to queue (?P<queue>\S+) at .*

This means:

^
match only at the begining of the string
Job
literal part of the string
(?P<id>\d+)
1 or more decimal digits, which will be used as id number of job
is submitted to queue
literal part of the string
(?P<queue>\S+)
1 or more non-whitespace characters, which will be used as queue name of job
at
part of string
.*
any remaining characters

For more information see the standard python re module: http://docs.python.org/lib/module-re.html

To check the correctness of pattern, run in python:

>>> import re
>>> answer = 'Job 13579 is submitted to queue SHORT at the.best.farm.uk'
>>> pattern = '^Job (?P<id>\d+) is submitted to queue (?P<queue>\S+) at .*'
>>> re.compile(pattern,re.M).search(answer).groups()
You must have ('13579', 'SHORT') as result.

After that you need to define kill_res_pattern. Add

sleep 360 
to the end of the test_batch.sh. Submit this file to batch system and then kill batch job with:
LSF
bkill 13579
PBS
qdel 13579
SGE
qdel 13579
where 13579 is the batch job number. If batch system answer something after killing your job, you need to define kill_res_pattern. In other case, it must be equal empty string.

Repeat the bkill(qdel) command without submitting job. The batch system answer must be added to kill_res_pattern:

kill_res_pattern = (Answer after first kill of job \d+)|(Answer after second kill of job \d+)

where \d+ replace jobs id.

Some pitfalls

  • When printing the config values, each "\" is printed as "\\". You should use only one "\" for each "\". Using two slashed ("\\" ) means escaping the slash ("\") character itself, so it does not have a special meaning to the re module.
  • Now consider the fact that in python, as in C/C++, you need to escape the slash: "\n" means new line, while "\\n" means a slash followed by "n". Read more about how to handle this using raw strings on http://www.python.org/doc/2.3.5/tut/node5.html
  • When you are entering data into your site wide configuration file, take care that strings should be entered without single quotes. See your private .gangarc file for examples.


Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2009-04-22 - UlrikEgede
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback