Batch , Signals and gLite WMS

Most batch systems before killing a job for what ever reason including exceeding requested resources send some warning signals. Since the gLite 3.1 WMS was released various singnals are caught by the JobWrapper to try and provide a log to the user.

Batch Systems

Batch Signals Notes
Torque SIGTERM
sleep 2s
SIGKILL
The time can be increased with a server kill_delay configuration
LSF SIGINT
sleep 10s
SIGTERM
sleep 10s
SIGKILL
Can be configured
SGE SIGKILL Anything can be introduced such as a SIGTERM and delay via a termination_method configuration.
Condor SIGTERM
sleep ?
SIGKILL
There seems to a be configuration but am unsure what default is.

gLite WMS jobWrapper

From WMS 3.1 traps were added to the job wrapper scripts to do sensible things when various signals are received. See BUG:17509 and currently PATCH:2597.

SIGNALs glite 3.1 gLite 3.2 HEAD
SIGTERM fatal_error fatal_error fatal_error
SIGXCPU fatal_error fatal_error fatal_error
SIGINT fatal_error fatal_error fatal_error
SIGQUIT   fatal_error fatal_error

When a fatal_error is executed:

  1. "Job has been terminated by the batch system (SIGTERM)" is outputted to the maradona log.
  2. "Job has been terminated by the batch system (SIGTERM)" is submitted to the L&B service.
  3. An attempt is made to return the maradona output.
  4. It exits so as to return an "Abort" to the user as the job status.

Experimentation

Submitting 20 jobs to 30 minute queue at Glasgow containing a 45m sleep job.
  • All the jobs are now in state "Aborted" and were put there at 30 minutes it seems by the wrapper script.
  • All the jobs failed to get Maradona's log back.
  • All the jobs but two contained in the loggin and book keeping a Job has been terminated by the batch system (SIGTERM) record.

Presumably 2 seconds is long enough most of the time to send an L&B message but not to return Maradona's logs.

-- SteveTraylen - 23 Mar 2009

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2009-03-24 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback