Job Recovery
Introduction
The task of the Panda pilot job recovery is to scan the disk for old 'lost'
job directories and log files. When it finds a lost job, in the form of a job state
file which has not been modified recently, it will try to recover the job and update the
Panda server for it.
Algorithm
The Panda pilot job recovery works by scanning all work directories found on the
local disk for existing
job state files. When such a file is found,
the algorithm first checks when the job state file was last updated; if it was less than
two heartbeats ago (i.e. 60 minutes), it will skip it temporarily (it will be picked up again
at a later time by the next pilot arriving to the node). If the last update was
done more than two heartbeats ago, it will try to read the file.
If the work directory of the lost job still exists, it will try to
create the log file. If it doesn't exist, it will look for the compressed work
directory, i.e. the log. If it finds anything, it will register the log. If there are no
work directories or log files, the job state will be set to failed with an pilot exit
code of 1153 (lost job did not finished). Any existing remaining data files (from a
failed DQ2 transfer) will be moved to the local SE and registered. Before ending the
iteration of the job state file loop, the Panda server will be updated with the now known
fate of the lost job. The work directory will be cleaned for successful recoveries (also if
the lost job can never be recovered). Full implementation details can be see in the
activity diagram below.
The algorithm is executed at the pilot startup.
The lost job recovery can be switched on/off with the "-j True/False" pilot flag.
Job state files will still be maintained.
Activity diagram for the job recovery algorithm. [Draft 12, implemented in pilot v SPOCK 1f]
Job Recovery Error Codes
Error Code |
Description |
1153 |
Job Recovery: Lost job did not finish - lost forever |
1154 |
Job Recovery: Failed to re-register log file |
1155 |
Job Recovery: Failed to move output files |
1156 |
Job Recovery: Missing log and work dir |
1157 |
Job Recovery: Could not create log |
Major updates:
--
PaulNilsson - 06 Oct 2006
Responsible: PaulNilsson
Topic attachments
I |
Attachment |
History |
Action |
Size |
Date |
Who |
Comment |
jpg |
PandaPilotJobRecovery2.jpg |
r1 |
manage |
56.3 K |
2006-10-08 - 17:55 |
UnknownUser |
Job recovery algorithm, UML activity diagram, draft 2 |
jpg |
PandaPilotJobRecovery3.jpg |
r1 |
manage |
64.9 K |
2006-10-08 - 21:25 |
UnknownUser |
Job recovery algorithm, UML activity diagram, draft 3 (implemented in pilot v SPOCK beta) |
jpg |
jobRecovery-v4.jpg |
r1 |
manage |
119.3 K |
2006-11-14 - 11:08 |
UnknownUser |
Job recovery algorithm (Draft 8, implemented in pilot v SPOCK beta) |
png |
jobRecovery.png |
r1 |
manage |
36.7 K |
2006-10-06 - 19:02 |
UnknownUser |
Job recovery algorithm |
png |
jobRecovery10.png |
r1 |
manage |
81.2 K |
2006-11-20 - 12:18 |
UnknownUser |
Draft 10 |
png |
jobRecovery11.png |
r1 |
manage |
85.9 K |
2006-11-20 - 16:02 |
UnknownUser |
Draft 11 |
png |
jobRecovery12.png |
r1 |
manage |
97.4 K |
2006-12-12 - 17:59 |
UnknownUser |
Current implementation (SPOCK 1e) |
png |
jobRecovery5.png |
r1 |
manage |
49.1 K |
2006-10-09 - 12:05 |
UnknownUser |
Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta) |
png |
jobRecovery5b.png |
r1 |
manage |
47.5 K |
2006-10-09 - 12:15 |
UnknownUser |
Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta) [cropped] |
png |
jobRecovery6.png |
r1 |
manage |
51.8 K |
2006-10-09 - 12:31 |
UnknownUser |
Job recovery algorithm (Draft 6, implemented in pilot v SPOCK beta) |
png |
jobRecovery7.png |
r1 |
manage |
54.4 K |
2006-10-09 - 12:43 |
UnknownUser |
Job recovery algorithm (Draft 7, implemented in pilot v SPOCK beta) |
jpg |
jobRecovery8.jpg |
r1 |
manage |
72.3 K |
2006-11-14 - 20:13 |
UnknownUser |
Job recovery algorithm (Draft 8b, implemented in pilot v SPOCK beta) |
png |
jobRecovery9.png |
r1 |
manage |
81.3 K |
2006-11-16 - 12:24 |
UnknownUser |
Job recovery algorithm (Draft 9, implemented in pilot v SPOCK beta) |