Batch , Signals and gLite WMS
Most batch systems before killing a job for what ever reason including exceeding requested resources send some warning signals.
Since the gLite 3.1 WMS was released various singnals are caught by the JobWrapper to try and provide a log to the user.
Batch Systems
Batch |
Signals |
Notes |
Torque |
SIGTERM sleep 2s SIGKILL |
The time can be increased with a server kill_delay configuration |
LSF |
SIGINT sleep 10s SIGTERM sleep 10s SIGKILL |
Can be configured |
SGE |
SIGKILL |
Anything can be introduced such as a SIGTERM and delay via a termination_method configuration. |
Condor |
SIGTERM sleep ? SIGKILL |
There seems to a be configuration but am unsure what default is. |
SLURM |
SIGTERM sleep 30s SIGKILL |
KillWait configurable in slurm.conf |
gLite WMS jobWrapper
From WMS 3.1 traps were added to the job wrapper scripts to do sensible things when various signals are
received. See
BUG:17509
and currently
PATCH:2597
.
SIGNALs |
glite 3.1 |
gLite 3.2 |
SIGTERM |
fatal_error |
fatal_error |
SIGXCPU |
fatal_error |
fatal_error |
SIGINT |
fatal_error |
fatal_error |
SIGQUIT |
|
fatal_error |
When a
fatal_error is executed:
- "Job has been terminated by the batch system (SIGTERM)" is outputted to the maradona log.
- "Job has been terminated by the batch system (SIGTERM)" is submitted to the L&B service.
- In the case of WMS 3.2 an attempt is made to return the Output Sandbox.
- An attempt is made to return the maradona output.
- It exits so as to return an "Abort" to the user as the job status.
Experimentation
Submitting 20 jobs to 30 minute queue at Glasgow containing a 45m sleep job.
- All the jobs are now in state "Aborted" and were put there at 30 minutes it seems by the wrapper script.
- All the jobs failed to get Maradona's log back.
- All the jobs but two contained in the login and book keeping a Job has been terminated by the batch system (SIGTERM) record.
Presumably 2 seconds is long enough most of the time to send an L&B message. After setting the kill_delay to 10s at Glasgow
then all 20 jobs come back with an Abort and all have the "Job termination" message. Arranging for all sites to increase this parameter
makes sense. It is an incredibly simple change to make.
PATCH:2935
requests this change for
YAIM.
It's still the case that maradona scripts failed to get back though even with the 10 second delay.
--
SteveTraylen - 23 Mar 2009