Job Recovery

Introduction

The task of the Panda pilot job recovery is to scan the disk for old 'lost' job directories and log files. When it finds a lost job, in the form of a job state file which has not been modified recently, it will try to recover the job and update the Panda server for it.

Algorithm

[will be updated] The Panda pilot job recovery works by scanning all work directories found on the local disk for existing job state files. When such a file is found, it first checks when it was last updated; if it was last updated less than two heartbeats ago (i.e. 60 minutes), it will skip it temporarily (it will be picked up again at a later time by the next pilot arriving to the node). If the last update was done more than two heartbeats ago, it will try to read the file. If the work directory of the lost job still exists, it will try to create the log file. If it doesn't exist, it will look for the compressed work directory, i.e. the log. If it finds anything, it will register the log. If there are no work directories or log files, the job state will be set to failed with an pilot exit code of 1153 (lost job never finished). Before ending the iteration of the job state file loop, it will update the Panda server with the now known fate of the lost job, and cleanup the old work directory and job state file (stored in the site work directory, although only if the log or data files were transfered to the local DDM). The algorithm is executed at the pilot startup.

diagram

Activity diagram for the job recovery algorithm. [Draft 9, implemented in pilot v SPOCK beta]

Job Recovery Error Codes

Error CodeSorted ascending Description
1153 Job Recovery: Lost job did not finish - lost forever
1154 Job Recovery: Failed to re-register log file
1155 Job Recovery: Failed to move output files
1156 Job Recovery: Missing log and work dir
1157 Job Recovery: Could not create log


Major updates:
-- PaulNilsson - 06 Oct 2006



Responsible:

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg PandaPilotJobRecovery2.jpg r1 manage 56.3 K 2006-10-08 - 17:55 UnknownUser Job recovery algorithm, UML activity diagram, draft 2
JPEGjpg PandaPilotJobRecovery3.jpg r1 manage 64.9 K 2006-10-08 - 21:25 UnknownUser Job recovery algorithm, UML activity diagram, draft 3 (implemented in pilot v SPOCK beta)
JPEGjpg jobRecovery-v4.jpg r1 manage 119.3 K 2006-11-14 - 11:08 UnknownUser Job recovery algorithm (Draft 8, implemented in pilot v SPOCK beta)
PNGpng jobRecovery.png r1 manage 36.7 K 2006-10-06 - 19:02 UnknownUser Job recovery algorithm
PNGpng jobRecovery5.png r1 manage 49.1 K 2006-10-09 - 12:05 UnknownUser Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta)
PNGpng jobRecovery5b.png r1 manage 47.5 K 2006-10-09 - 12:15 UnknownUser Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta) [cropped]
PNGpng jobRecovery6.png r1 manage 51.8 K 2006-10-09 - 12:31 UnknownUser Job recovery algorithm (Draft 6, implemented in pilot v SPOCK beta)
PNGpng jobRecovery7.png r1 manage 54.4 K 2006-10-09 - 12:43 UnknownUser Job recovery algorithm (Draft 7, implemented in pilot v SPOCK beta)
JPEGjpg jobRecovery8.jpg r1 manage 72.3 K 2006-11-14 - 20:13 UnknownUser Job recovery algorithm (Draft 8b, implemented in pilot v SPOCK beta)
PNGpng jobRecovery9.png r1 manage 81.3 K 2006-11-16 - 12:24 UnknownUser Job recovery algorithm (Draft 9, implemented in pilot v SPOCK beta)

This topic: PanDA > AtlasDistributedComputing > PanDA > PandaPilot > PandaPilotJobRecovery
Topic revision: r7 - 2006-11-16 - PaulNilssonSecondary
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback