Environment variables for multi-core jobs
DRAFT 0.98 (11/02/2013)
Abstract
Within the HEPiX virtualization group a mechanism was discussed which allows to access detailed information about the current host and the current job from a unified place on the worker node. This allows user payload to access meta information, independent of the current batch system, to access information like the performance of the node or calculate the remaining run time for the current job.
The schema is made to be extensible so that it can be used to add additional information like the number of CPUs which have been allocated to the current job. This document specifies the
Introduction
The proposed schema is made to be extensible so that it can be used to add additional information like the number of CPUs which have been allocated to the current job. The purpose of this document is to define the specifications and use case of this schema. It should be seen as the source of information for the actual implementation of the required scripts by the sites.
Definitions
Environment variables
For each job, two environment variables have to be set, with the following names:
Variable |
Contents |
Comments |
MACHINEFEATURES |
Path to a directory |
Execution specific information |
JOBFEATURES |
Path to a directory |
Job specific information |
These environment variables are the base interface for the user payload. They must be set inside the job environment
Directories
The directories to which the two environment variables point contain job or host specific information. The file name is the key, the contents are the values.
Use cases
Use cases to be covered are
Identifier |
Actors |
Pre-conditions |
Scenario |
Outcome |
(Optional) What to avoid |
1. |
user |
job starter script |
The job needs to calculate the remaining time it is allowed to run |
|
|
2. |
user |
job starter scripts |
The job needs to know how long it was already running |
|
|
3. |
user |
host setup |
The job wants to know the performance of the host in order to calculate the remaining time it will need to complete (for CPU intensive jobs) |
|
|
4. |
site |
host setup |
A host needs to be drained. The payload needs to be informed of the planned shutdown time |
|
|
5. |
site |
job starter script |
A multi-core user job on a non-exclusive node needs to know how many threads it is allowed to start. This is specifically of interest in a late-binding scenario where the pilot reserved the cores and the user payload needs to know about this. |
|
6. |
site |
job starter script |
A user job wants to know how many job slots are allocated to the current job |
|
7. |
site |
job starter script |
A user jobs wants to know the maximum amount of disk space it is allowed to use |
|
8. |
site |
job starter script |
A user job wants to setup memory limits to protect itself from being killed by the batch system automatically |
|
Requirements
- The propose schema must be unique and leave no room for interpretation of the values provided.
- For this reason, basic information is used which is well defined across sites.
- Host information can be both static (like the HS06 rating) and dynamic (eg shutdown time may be set at any time by the site).
- Job specific limits need to be set at the startup of the job. Usually, the related files will be owned by the user and reside on a /tmp like area
General explications
The implementation, that is the creation of the files and their contents can be highly site specific. A sample implementation can be done per batch system in use, but it is understood that sites are allowed to change the implementation, provided that the created numbers match the definitions given in this document.
Clarification on terms
Normalization and CPU factors
At many sites batch resources consist of a mixture of different hardware types which have different performance.When a user submits a job to a queue, this queue typically sets limits on the CPU time and the wall clock time of the job. If such a job ends up on a faster node, it will terminate quicker. To avoid that jobs which run on slower nodes are terminated prematurely, CPU and Wall clock times are usually scaled with a factor which depends on the performance of the machine. This factor is called the CPU factor.
A reference machine has a CPU factor of 1. For such a machine, normalized and real time values for CPU are the same.
A CPU factor below 1 means that the worker node is slower than a given reference. In this case normalized times are larger than the real time values, and jobs are allowed to run longer in order to terminate.
A CPU factor above 1 means that the worker node is faster than a given reference. In this case normalized times are smaller than the real time values.
List of requirements
Job specific information which are:
- found in the directory pointed to by $JOBFEATURES
- owned by the user who is executing the original job. In the case of pilots this would be the pilot user at the site.
- created before the user job starts, eg during a job starter script.
Identifier |
File Name (key) |
Originating use cases |
Value |
(Optional) Comments |
1.1 |
cpufactor_lrms |
1,3 |
Normalization factor as used by the batch system. |
Can be site specific |
1.2.1 |
cpu_limit_secs_lrms |
1 |
CPU limit in seconds, normalized |
Divide by cpufactor_lrms to retrieve the real time seconds. For multi-core jobs it's the total. |
1.2.2 |
cpu_limit_secs |
1 |
CPU limit in seconds, real time (not normalized) |
For multi-core jobs it's the total. |
1.3.1 |
wall_limit_secs_lrms |
1 |
Run time limit in seconds, normalized |
Divide by cpufactor_lrms to retrieve the real time seconds |
1.3.2 |
wall_limit_secs |
1 |
Run time limit in seconds, real time (not normalized) |
|
1.4 |
disk_limit_GB |
7 |
Scratch space limit in GB (if any) |
If no quotas are used on a shared system, this corresponds to the full scratch space available to all jobs which run on the host. Counting is 1GB = 1000MB = 1000^2kB |
1.5 |
jobstart_secs |
2 |
Unix time stamp (in seconds) of the time when the job started in the batch farm. |
This is what the batch system sees, not when the user payload started to work. |
1.6 |
mem_limit_MB |
8 |
Memory limit (if any) in MB. |
Total memory. Count with 1000 not 1024, that is 4GB corresponds to 4000 |
1.7 |
allocated_CPU |
5 |
number of allocated cores to the current job |
Allocated cores can be physical or logical |
Host specific information:
Identifier |
File Name (key) |
Originating use cases |
Value |
(Optional) Comments |
2.1 |
hs06 |
3 |
HS06 rating of the full machine in it's current setup |
Static value. HS06 is measured following the HEPiX recommendations. If Hyperthreading is enabled, the additional cores are treats as if they were full cores |
|
2.2 |
shutdowntime |
4 |
dynamic value, shutdowntime as a UNIX time stamp (in seconds) |
Dynamic. If the file is missing, no shutdown is foreseen. The value is in real time, and must be in the future. Must be removed if the shutdowntime has arrived |
2.3 |
jobslots |
7 |
Number of job slots for the host |
dynamic value, can change with batch reconfigurations |
2.4 |
phys_cores |
3 |
number of physical cores |
- |
2.5 |
log_cores |
3 |
number of logical cores |
can be zero if hyperthreading is off |
2.6 |
shutdown_command |
4 |
path to a command on the machine |
optional, only relevant for virtual machines. A command provided by the site which provides a hook for the user to properly destroy the virtual machine and unregister it |
Notes:
External Resources
A prototype for
LSF exists and is installed at CERN. It needs to be adapted to follow the definitions in this document.
Impact
User jobs and pilot frame works will have to be updated in order to profit from the above declarations.
Proposed extensions
There has been a proposal to allow for the job to give information back to the machine owner. The proposed syntax and semantics is available on
the Job to Machine Web page.
Recommendations
Conclusions
The new mechanism allows to propagate basic information to user payload. The interface is independent of the batch system in use. The given information is designed to be sufficient to cover all mentioned use cases above.
References
- CERN prototype implementation for LSF (being implemented). This prototype will be CERN specific. It is likely that it will have to be adapted for use at other sites.
--
UlrichSchwickerath - 10-Jul-2012