Environment variables for multi-core jobs - Job to Machine channel

DRAFT 0.1 (16/12/2013)

Abstract

This is a placeholder page only for now.... please disregard any content.

This document provides a proposal for the definition of a communication channel from the Job to the Machine. The objective of this communication channel is to provide the machine owner enough information to pick "the best job" to vacate when needed.

The specification is optimized for the pilot use case in mind, but checkpointable user jobs may be a good fit as well.

Introduction

The proposed schema builds on the work done for the Machine to Job communication channel.

Definitions

Environment variables

For each job, one environment variable has to be set, with the following name:

Variable Contents Comments
JOBSTATUS Path to a directory Job specific information
This environment variable is the base interface for the user payload. They must be set inside the job environment.

Directories

The directories to which the environment variable points contains job specific information. The file name is the key, the contents are the values.

Requirements

  • The propose schema must be unique and leave no room for interpretation of the values provided.
  • For this reason, basic information is used which is well defined across sites.

List of requirements

Job specific information which are:
  • found in the directory pointed to by $JOBSTATUS
  • owned by the user who is executing the original job. In the case of pilots this would be the pilot user at the site.
  • created by the job, and will be updated several times during its lifetime

Identifier File Name (key) Originating use cases Value (Optional) Comments
3.1 used_CPU NA Number of used cores by the job. Must be less or equal than allocated_CPU.
3.2 can_postpone_deadline NA If set to 0 (interpreted as False), job_deadline_secs is guaranteed to not increase. The job can change it back to true (i.e. non 0), but must not change the deadline for at least 10 minutes since the change.
3.3 job_deadline_secs NA UNIX time when the job is guaranteed to terminate Unless promised not to change it, the job is allow to postpone this as needed. However, the host system is allowed to kill the job, if it hits the deadline.
3.4 job_termination_secs NA UNIX time when the job is expected to terminate Optional. If set, must be before the deadline. Also, just an estimate, and thus likely to change.
3.x last_jobstart_secs NA Last time a job was started  
3.x first_exp_job_end      
3.x add_uncom_time_1k NA    
3.x add_final_exp_waste_1k NA    
3.x priority_factor NA Relative priority of this job among all jobs of this user. Integer, higher is better.

Notes:

External Resources

TBD

Impact

TBD

Recommendations

Conclusions

The new mechanism allows to propagate basic information from the user payload to the machine owner. The interface is independent of the batch system in use.

References

Igor's presentation at CHEP 2013

-- IgorSfiligoi - 16 Dec 2013

Edit | Attach | Watch | Print version | History: r8 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2014-01-16 - IgorSfiligoi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback