Environment variables for multi-core jobs

DRAFT 0.98 (11/02/2013)

Abstract

Within the HEPiX virtualization group a mechanism was discussed which allows to access detailed information about the current host and the current job from a unified place on the worker node. This allows user payload to access meta information, independent of the current batch system, to access information like the performance of the node or calculate the remaining run time for the current job.

The schema is made to be extensible so that it can be used to add additional information like the number of CPUs which have been allocated to the current job. This document specifies the

Introduction

The proposed schema is made to be extensible so that it can be used to add additional information like the number of CPUs which have been allocated to the current job. The purpose of this document is to define the specifications and use case of this schema. It should be seen as the source of information for the actual implementation of the required scripts by the sites.

Definitions

Environment variables

For each job, two environment variables have to be set, with the following names:

Variable Contents Comments
MACHINEFEATURES Path to a directory Execution specific information
JOBFEATURES Path to a directory Job specific information
These environment variables are the base interface for the user payload. They must be set inside the job environment

Directories

The directories to which the two environment variables point contain job or host specific information. The file name is the key, the contents are the values.

Use cases

Use cases to be covered are
Identifier Actors Pre-conditions Scenario Outcome (Optional) What to avoid
1. user job starter script The job needs to calculate the remaining time it is allowed to run    
2. user job starter scripts The job needs to know how long it was already running    
3. user host setup The job wants to know the performance of the host in order to calculate the remaining time it will need to complete (for CPU intensive jobs)    
4. site host setup A host needs to be drained. The payload needs to be informed of the planned shutdown time    
5. site job starter script A multi-core user job on a non-exclusive node needs to know how many threads it is allowed to start. This is specifically of interest in a late-binding scenario where the pilot reserved the cores and the user payload needs to know about this.  
6. site job starter script A user job wants to know how many job slots are allocated to the current job  
7. site job starter script A user jobs wants to know the maximum amount of disk space it is allowed to use  
8. site job starter script A user job wants to setup memory limits to protect itself from being killed by the batch system automatically  

Requirements

  • The propose schema must be unique and leave no room for interpretation of the values provided.
  • For this reason, basic information is used which is well defined across sites.
  • Host and Job information can be both static (like the HS06 rating) and dynamic (eg shutdown time may be set at any time by the site).
  • Job specific files will be owned by the user and reside on a /tmp like area

General explications

The implementation, that is the creation of the files and their contents can be highly site specific. A sample implementation can be done per batch system in use, but it is understood that sites are allowed to change the implementation, provided that the created numbers match the definitions given in this document.

Clarification on terms

Normalization and CPU factors

At many sites batch resources consist of a mixture of different hardware types which have different performance.When a user submits a job to a queue, this queue typically sets limits on the CPU time and the wall clock time of the job. If such a job ends up on a faster node, it will terminate quicker. To avoid that jobs which run on slower nodes are terminated prematurely, CPU and Wall clock times are usually scaled with a factor which depends on the performance of the machine. This factor is called the CPU factor. A reference machine has a CPU factor of 1. For such a machine, normalized and real time values for CPU are the same. A CPU factor below 1 means that the worker node is slower than a given reference. In this case normalized times are larger than the real time values, and jobs are allowed to run longer in order to terminate. A CPU factor above 1 means that the worker node is faster than a given reference. In this case normalized times are smaller than the real time values.

List of requirements

Job specific information which are:
  • found in the directory pointed to by $JOBFEATURES
  • owned by the user who is executing the original job. In the case of pilots this would be the pilot user at the site.
  • created before the user job starts, eg during a job starter script.

Identifier File Name (key) Originating use cases Value (Optional) Comments
1.1 cpufactor_lrms 1,3 Normalization factor as used by the batch system. Can be site specific
1.2.1 cpu_limit_secs_lrms 1 CPU limit in seconds, normalized Divide by cpufactor_lrms to retrieve the real time seconds. For multi-core jobs it's the total.
1.2.2 cpu_limit_secs 1 CPU limit in seconds, real time (not normalized) For multi-core jobs it's the total.
1.3.1 wall_limit_secs_lrms 1 Run time limit in seconds, normalized Divide by cpufactor_lrms to retrieve the real time seconds
1.3.2 wall_limit_secs 1 Run time limit in seconds, real time (not normalized)  
1.4 disk_limit_GB 7 Scratch space limit in GB (if any) If no quotas are used on a shared system, this corresponds to the full scratch space available to all jobs which run on the host. Counting is 1GB = 1000MB = 1000^2kB
1.5 jobstart_secs 2 Unix time stamp (in seconds) of the time when the job started in the batch farm. This is what the batch system sees, not when the user payload started to work.
1.6 mem_limit_MB 8 Memory limit (if any) in MB. Total memory. Count with 1000 not 1024, that is 4GB corresponds to 4000
1.7 allocated_CPU 5 number of allocated cores to the current job Allocated cores can be physical or logical
1.8 shutdowntime_job 1 dynamic value, shutdown time as a UNIX time stamp (in seconds) optional, if the file is missing no job shutdown is foreseen. The job needs to have finished all its processing when the shutdowntime has arrived

Host specific information:

Identifier File Name (key) Originating use cases Value (Optional) Comments
2.1 hs06 3 HS06 rating of the full machine in it's current setup Static value. HS06 is measured following the HEPiX recommendations. If Hyperthreading is enabled, the additional cores are treats as if they were full cores  
2.2 shutdowntime 4 dynamic value, shutdowntime as a UNIX time stamp (in seconds) Dynamic. If the file is missing, no shutdown is foreseen. The value is in real time, and must be in the future. Must be removed if the shutdowntime has arrived
2.3 jobslots 7 Number of job slots for the host dynamic value, can change with batch reconfigurations
2.4 phys_cores 3 number of physical cores -
2.5 log_cores 3 number of logical cores can be zero if hyperthreading is off
2.6 shutdown_command 4 path to a command on the machine optional, only relevant for virtual machines. A command provided by the site which provides a hook for the user to properly destroy the virtual machine and unregister it

Notes:

External Resources

A prototype for LSF exists and is installed at CERN. It needs to be adapted to follow the definitions in this document.

Impact

User jobs and pilot frame works will have to be updated in order to profit from the above declarations.

Proposed extensions

There has been a proposal to allow for the job to give information back to the machine owner. The proposed syntax and semantics is available on the Job to Machine Web page.

Recommendations

Conclusions

The new mechanism allows to propagate basic information to user payload. The interface is independent of the batch system in use. The given information is designed to be sufficient to cover all mentioned use cases above.

References

  • CERN prototype implementation for LSF (being implemented). This prototype will be CERN specific. It is likely that it will have to be adapted for use at other sites.

-- UlrichSchwickerath - 10-Jul-2012

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2013-12-17 - StefanRoiser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback