Job Recovery
Introduction
The task of the Panda pilot job recovery is to scan the disk for old 'lost'
job directories and log files. When it finds a lost job, in the form of a job state
file which has not been modified recently, it will try to recover the job and update the
Panda server for it.
Algorithm
[will be updated]
The Panda pilot job recovery works by scanning all work directories found on the
local disk for existing job state files. When such a file is found, it first checks
when it was last updated; if it was last updated less than two heartbeats ago
(i.e. 60 minutes), it will skip it temporarily (it will be picked up again at a later time
by the next pilot arriving to the node). If the last update was
done more than two heartbeats ago, it will try to read the file.
If the work directory of the lost job still exists, it will try to
create the log file. If it doesn't exist, it will look for the compressed work
directory, i.e. the log. If it finds anything, it will register the log. If there are no
work directories or log files, the job state will be set to failed with an pilot exit
code of 1153 (lost job never finished). Before ending the iteration of the job state
file loop, it will update the Panda server with the now known fate of the lost job,
and cleanup the old work directory and job state file (stored in the site work directory,
although only if the log or data files were transfered to the local DDM).
The algorithm is executed at the pilot startup.
Activity diagram for the job recovery algorithm. [Draft 9, implemented in pilot v SPOCK beta]
Job Recovery Error Codes
Error Code |
Description |
1153 |
Job Recovery: Lost job did not finish - lost forever |
1154 |
Job Recovery: Failed to re-register log file |
1155 |
Job Recovery: Failed to move output files |
1156 |
Job Recovery: Missing log and work dir |
1157 |
Job Recovery: Could not create log |
Major updates:
--
PaulNilsson - 06 Oct 2006
Responsible:
Topic attachments
I |
Attachment |
History |
Action |
Size |
Date |
Who |
Comment |
jpg |
PandaPilotJobRecovery2.jpg |
r1 |
manage |
56.3 K |
2006-10-08 - 17:55 |
UnknownUser |
Job recovery algorithm, UML activity diagram, draft 2 |
jpg |
PandaPilotJobRecovery3.jpg |
r1 |
manage |
64.9 K |
2006-10-08 - 21:25 |
UnknownUser |
Job recovery algorithm, UML activity diagram, draft 3 (implemented in pilot v SPOCK beta) |
jpg |
jobRecovery-v4.jpg |
r1 |
manage |
119.3 K |
2006-11-14 - 11:08 |
UnknownUser |
Job recovery algorithm (Draft 8, implemented in pilot v SPOCK beta) |
png |
jobRecovery.png |
r1 |
manage |
36.7 K |
2006-10-06 - 19:02 |
UnknownUser |
Job recovery algorithm |
png |
jobRecovery5.png |
r1 |
manage |
49.1 K |
2006-10-09 - 12:05 |
UnknownUser |
Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta) |
png |
jobRecovery5b.png |
r1 |
manage |
47.5 K |
2006-10-09 - 12:15 |
UnknownUser |
Job recovery algorithm (Draft 5, implemented in pilot v SPOCK beta) [cropped] |
png |
jobRecovery6.png |
r1 |
manage |
51.8 K |
2006-10-09 - 12:31 |
UnknownUser |
Job recovery algorithm (Draft 6, implemented in pilot v SPOCK beta) |
png |
jobRecovery7.png |
r1 |
manage |
54.4 K |
2006-10-09 - 12:43 |
UnknownUser |
Job recovery algorithm (Draft 7, implemented in pilot v SPOCK beta) |
jpg |
jobRecovery8.jpg |
r1 |
manage |
72.3 K |
2006-11-14 - 20:13 |
UnknownUser |
Job recovery algorithm (Draft 8b, implemented in pilot v SPOCK beta) |
png |
jobRecovery9.png |
r1 |
manage |
81.3 K |
2006-11-16 - 12:24 |
UnknownUser |
Job recovery algorithm (Draft 9, implemented in pilot v SPOCK beta) |