Job Recovery

Introduction

The task of the Panda pilot job recovery is to scan the disk for old 'lost' job directories and log files. When it finds a lost job, in the form of a job state file which has not been modified recently, it will try to recover the job and update the Panda server for it.

Algorithm

The Panda pilot job recovery works by scanning all work directories found on the local disk for existing job state files. When such a file is found, the algorithm first checks when the job state file was last updated; if it was less than two heartbeats ago (i.e. 60 minutes), it will skip it temporarily (it will be picked up again at a later time by the next pilot arriving to the node). If the last update was done more than two heartbeats ago, it will try to read the file. If the work directory of the lost job still exists, it will try to create the log file. If it doesn't exist, it will look for the compressed work directory, i.e. the log. If it finds anything, it will register the log. If there are no work directories or log files, the job state will be set to failed with an pilot exit code of 1153 (lost job did not finished). Any existing remaining data files (from a failed DQ2 transfer) will be moved to the local SE and registered. Before ending the iteration of the job state file loop, the Panda server will be updated with the now known fate of the lost job. The work directory will be cleaned for successful recoveries (also if the lost job can never be recovered). Full implementation details can be see in the activity diagram below.

The algorithm is executed at the pilot startup.

The lost job recovery can be switched on/off with the "-j True/False" pilot flag. Job state files will still be maintained.

diagram

Activity diagram for the job recovery algorithm. [Draft 12, implemented in pilot v SPOCK 1f]

Job Recovery Error Codes

Error Code Description
1153 Job Recovery: Lost job did not finish - lost forever
1154 Job Recovery: Failed to re-register log file
1155 Job Recovery: Failed to move output files
1156 Job Recovery: Missing log and work dir
1157 Job Recovery: Could not create log


Major updates:
-- PaulNilsson - 06 Oct 2006



Responsible: PaulNilsson

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg PandaPilotJobRecovery2.jpg r1 manage 56.3 K 2006-10-08 - 17:55 UnknownUser Job recovery algorithm, UML activity diagram, draft 2
JPEGjpg PandaPilotJobRecovery3.jpg r1 manage 64.9 K 2006-10-08 - 21:25 UnknownUser Job recovery algorithm, UML activity diagram, draft 3 (implemented in pilot v SPOCK beta)
JPEGjpg jobRecovery-v4.jpg r1 manage 119.3 K 2006-11-14 - 11:08 UnknownUser Job recovery algorithm (Draft 8, implemented in pilot v SPOCK beta)
PNGpng jobRecovery.png r1 manage 36.7 K 2006-10-06 - 19:02 UnknownUser Job recovery algorithm
PNGpng jobRecovery10.png r1 manage 81.2 K 2006-11-20 - 12:18 UnknownUser Draft 10
PNGpng jobRecovery11.png r1 manage 85.9 K 2006-11-20 - 16:02 UnknownUser Draft 11
PNGpng jobRecovery12.png r1 manage 97.4 K 2006-12-12 - 17:59 UnknownUser Current implementation (SPOCK 1e)
PNGpng jobRecovery5.png r1 manage 49.1 K 2006-10-09 - 12:05 UnknownUser Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta)
PNGpng jobRecovery5b.png r1 manage 47.5 K 2006-10-09 - 12:15 UnknownUser Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta) [cropped]
PNGpng jobRecovery6.png r1 manage 51.8 K 2006-10-09 - 12:31 UnknownUser Job recovery algorithm (Draft 6, implemented in pilot v SPOCK beta)
PNGpng jobRecovery7.png r1 manage 54.4 K 2006-10-09 - 12:43 UnknownUser Job recovery algorithm (Draft 7, implemented in pilot v SPOCK beta)
JPEGjpg jobRecovery8.jpg r1 manage 72.3 K 2006-11-14 - 20:13 UnknownUser Job recovery algorithm (Draft 8b, implemented in pilot v SPOCK beta)
PNGpng jobRecovery9.png r1 manage 81.3 K 2006-11-16 - 12:24 UnknownUser Job recovery algorithm (Draft 9, implemented in pilot v SPOCK beta)
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2006-12-21 - PaulNilssonSecondary
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback