I/O-bound and CPU-bound tagging

Introduction

Many sites have also expressed interest with scheduling based on Resource Constraints. Grid worker nodes can support more CPU bound jobs than I/O bound jobs. DESY have found it beneficial to schedule a user jobs per node, and fill the remaining slots with Monte-Carlo jobs.

I/O throughput on sites storage systems have an optimum Load, and global throughput can be reduced when a load above a certain threshold is reached. Excessive Storage load could also lead to Storage system instability.

The questions was raised, whether cpusets or affinities (see the relevant part of the WM TEG wiki) could solve this issue at a site level. It turns out these are quota tools, preventing one job from starving another job of resources; each works at a Single Worker Node level. We believe these techniques are complimentary to job tagging and do not remove the use case for it.

Proposal

The original proposal was to tag jobs and have a boolean list of constraints, i.e. a job could be flagged as either CPU or I/O bound. The WLMTEG decided that the proposed changes to the JDL should have extra flexibility. A weighted value between 0 and 1 (with 0 being CPU-bound and 1 I/O-bound); initially, just the values 0 and 1 will be used to gain experience with the idea. Further Tags may be added at a later date.

As knowledge of job characteristics improve, a VO can specify different values; at a later stage we may decide to define an exact metric. This parameter does not translate into a specific resource to be allocated to the job, but may help sites distributing the jobs in a convenient way. It is up to the experiments to decide which of their jobs will be tagged one way or the other.

There was a long discussion on the accuracy of constraint tagging. One first conclusion was that users adding constraint tagging is to be avoided, since users are commonly unaware of such details and inconstant in their actions. In general, accuracy in the scalar value was not seen as important, and sites must expect user communities will not be consistent in weighted values. On the other hand, experiments do know some jobs will be CPU or I/O bound, and tagging these jobs as such will help some sites scheduling them more efficiently. It is recognized that jobs often change constraints during execution, and precise measurement is not seen as easy.

We therefore propose that Resource Constraint Tags should be agreed and added to the JDL as a set of weighted values between 0 and 1; we expect Resource Constraint Tags may not be limited to I/O and CPU bound constraints.

These Constraint Tags should be honored by the CE and passed to the LRMS in an agreed way, allowing sites to customize scheduling if they so desire. Sites that do not want to use to make use of Resource Constraint Tags will not need to use them. Pilot frameworks should honor the Resource Constraints between the pilot and the jobs the pilot executes.

-- DavideSalomoni - 03-Feb-2012

Edit | Attach | Watch | Print version | History: r7 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2012-02-06 - OwenMillingtonSyngeExCern
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback