Environment variables for multi-core jobs - Job to Machine channel
DRAFT 0.1 (16/12/2013)
Abstract
This is a placeholder page only for now....
please disregard any content.
This document provides a proposal for the definition of a communication channel from the Job to the Machine.
The objective of this communication channel is to provide the machine owner enough information to pick "the best job" to vacate when needed.
The specification is optimized for the pilot use case in mind, but checkpointable user jobs may be a good fit as well.
Introduction
The proposed schema builds on the work done for the Machine to Job communication channel.
Definitions
Environment variables
For each job, one environment variable has to be set, with the following name:
This environment variable is the base interface for the user payload. They must be set inside the job environment.
Directories
The directories to which the environment variable points contains job specific information. The file name is the key, the contents are the values.
Requirements
- The propose schema must be unique and leave no room for interpretation of the values provided.
- For this reason, basic information is used which is well defined across sites.
List of requirements
Job specific information which are:
- found in the directory pointed to by $JOBSTATUS
- owned by the user who is executing the original job. In the case of pilots this would be the pilot user at the site.
- created by the job, and will be updated several times during its lifetime
Identifier |
File Name (key) |
Originating use cases |
Value |
(Optional) Comments |
3.1 |
used_CPU |
NA |
Number of used cores by the job. |
Must be less or equal than allocated_CPU. |
3.2 |
can_postpone_deadline |
NA |
If set to 0 (interpreted as False), job_deadline_secs is guaranteed to not increase. |
The job can change it back to true (i.e. non 0), but must not change the deadline for at least 10 minutes since the change. |
3.3 |
job_deadline_secs |
NA |
UNIX time when the job is guaranteed to terminate |
Unless promised not to change it, the job is allow to postpone this as needed. However, the host system is allowed to kill the job, if it hits the deadline. |
3.4 |
job_termination_secs |
NA |
UNIX time when the job is expected to terminate |
Optional. If set, must be before the deadline. Also, just an estimate, and thus likely to change. |
3.x |
last_jobstart_secs |
NA |
Last time a job was started |
|
3.x |
first_exp_job_end |
|
|
|
3.x |
add_uncom_time_1k |
NA |
|
|
3.x |
add_final_exp_waste_1k |
NA |
|
|
3.x |
priority_factor |
NA |
Relative priority of this job among all jobs of this user. |
Integer, higher is better. |
Notes:
External Resources
TBD
Impact
TBD
Recommendations
Conclusions
The new mechanism allows to propagate basic information from the user payload to the machine owner. The interface is independent of the batch system in use.
References
Igor's presentation at
CHEP 2013
--
IgorSfiligoi - 16 Dec 2013