Batch , Signals and gLite WMS

Most batch systems before killing a job for what ever reason including exceeding requested resources send some warning signals. Since the gLite 3.1 WMS was released various singnals are caught by the JobWrapper to try and provide a log to the user.

Batch Systems

Batch SignalsSorted ascending Notes
LSF SIGINT
sleep 10s
SIGTERM
sleep 10s
SIGKILL
Can be configured
SGE SIGKILL Anything can be introduced such as a SIGTERM and delay via a termination_method configuration.
Torque SIGTERM
sleep 2s
SIGKILL
The time can be increased with a server kill_delay configuration
SLURM SIGTERM
sleep 30s
SIGKILL
KillWait configurable in slurm.conf
Condor SIGTERM
sleep ?
SIGKILL
There seems to a be configuration but am unsure what default is.

gLite WMS jobWrapper

From WMS 3.1 traps were added to the job wrapper scripts to do sensible things when various signals are received. See BUG:17509 and currently PATCH:2597.

SIGNALs glite 3.1 gLite 3.2
SIGTERM fatal_error fatal_error
SIGXCPU fatal_error fatal_error
SIGINT fatal_error fatal_error
SIGQUIT   fatal_error

When a fatal_error is executed:

  1. "Job has been terminated by the batch system (SIGTERM)" is outputted to the maradona log.
  2. "Job has been terminated by the batch system (SIGTERM)" is submitted to the L&B service.
  3. In the case of WMS 3.2 an attempt is made to return the Output Sandbox.
  4. An attempt is made to return the maradona output.
  5. It exits so as to return an "Abort" to the user as the job status.

Experimentation

Submitting 20 jobs to 30 minute queue at Glasgow containing a 45m sleep job.
  • All the jobs are now in state "Aborted" and were put there at 30 minutes it seems by the wrapper script.
  • All the jobs failed to get Maradona's log back.
  • All the jobs but two contained in the login and book keeping a Job has been terminated by the batch system (SIGTERM) record.

Presumably 2 seconds is long enough most of the time to send an L&B message. After setting the kill_delay to 10s at Glasgow then all 20 jobs come back with an Abort and all have the "Job termination" message. Arranging for all sites to increase this parameter makes sense. It is an incredibly simple change to make. PATCH:2935 requests this change for YAIM.

It's still the case that maradona scripts failed to get back though even with the 10 second delay.

-- SteveTraylen - 23 Mar 2009

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-02-03 - AndrejFilipcic
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback