Job State
Introduction
The purpose of the job state algorithm is to provide a job recovery mechanism for the Panda pilot.
Whenever the pilot updates the Panda server with the progress of a job, it also creates/updates
a job state file with all information necessary for later job recovery. If there is a crash and the
job is lost, the job state file will remain in the job work directory for later recovery by a different pilot.
This new pilot will find the lost job state file and will attempt to recover the lost job. If the payload of the
lost job finished (but e.g. failed to transfer the data and log files), the job recovery algorithm will move
the log to the local SE and will register the data file.
Algorithm
Activity diagram for the job state algorithm. [Correction: a job state file will in fact be re-created
on every Panda server update. There is no check if the file already exists.]
Additional details
The job state object knows how to store itself and how to be read back, as well as delete itself and the old work directory. If there are
no more job state files in the site work directory, this directory will also be deleted. When a job finishes (successfully) and all output files and log are copied to the SE, the job state file is removed from the site work directory.
File format
File name:
jobState-<jobId>.pickle
File object information
The job state file is a container for all information necessary for job recovery. The file contains the following objects stored in pickle format:
There are a few overlaps with these objects but they are insignificant. A typical job state file size is about 4kB.
Major updates:
--
PaulNilsson - 06 Oct 2006
Responsible: PaulNilsson