Advanced scheduling with job requirements: matching task and worker capabilities

The default scheduling mode in DIANE is first-in,first-out: a free worker picks the first free task in the queue. However, sometimes it may be useful to make sure that certain tasks are only processed by certain workers. This is may be achieved by a scheduling algorithm with requirement-capability matching.

For example, a dataset A is available in site X, and a dataset B in site Y. One task may process one dataset. Workers which run in site X should only pick the tasks which process the dataset A (otherwise a task would fail or be inefficient). To achieve this, worker agents which run in site X should advertise their capability of processing the dataset A. Relevant tasks would define their requirement to be the processing of a dataset A. In a simple case, the requirement/capability could be represented as a simple string "A". The matching algorithm would simply compare two strings and if they match, the worker would be allowed to pick the task.

Adding requirement-capability matching to your application

You may refer to the reference manual for description of DIANE classes used below.

Outline

You may add the matching functionality to an existing scheduler, such as the SimpleTaskScheduler, without much hassle by replacing the standard task queue. By default the SimpleTaskScheduler uses TodoTaskQueue class which uses a standard python Queue object to manage the task queue. When the master selects the task for the worker it calls the queue.get(w) method which returns a task object. The argument w is a WorkerEntry object which is a hint to the queue to indicate which worker is going to get the task. The get() method is effectively a hook where you may add the matching functionality.

AtlasPilotJobs example

This example extends on a basic ExecutableApplication example. The source code is available in the repository: http://svnweb.cern.ch/world/wsvn/diane/trunk/apps/AtlasPilotJobs

Describing and implementing capability matching

Instead of a simple string we use python's frozenset to describe capability/requirements. The capability matches the requirement if requirements.issubset(capabilities). So for example, if a worker is capable to process datasets ('A','B','C') and task requires ('A','C') then they both match.

The PilotTodoTaskQueue defines this matching algorithm. The AtlasPilotScheduler class is used simply to replace the queue implementation.

Defining worker capability

Worker capability is defined at the startup of the worker in initialize() method and is hardcoded in the AtlasPilotWorker class. Obviously, in real life the capability would be dynamically calculated by some other means (a hostname, environment variable, a query to the data management system, etc). The capability data structure is passed back as result of the worker.initialize() call. This data is then transfered back to the master and kept as WorkerEntry.init_output. It is later made available to the task queue get(w) method via the argument hint w.

Our scheduler is derived from SimpleTaskScheduler and it therefore inherits all other functionalities and policies. So, it by default cleans up the data stored in WorkerEntry.init_output unless INIT_DATA_CLEANUP = False. Therefore, we change the default policy value for our AtlasPilotScheduler.

Defining task requirements

Task requirements are simply set as task.requirement attribute of the task data object (AtlasPilotTaskData). In our example, we extend the ExecutableTaskData class to create an empty requirement in the constructor.

The requirements are then set in the run file test/hello.run. For all tasks except one the requirements are set to match the hardcoded worker capabilities.

Running it

The example may be run it-out-of-the-box:

  1. checkout the code from SVN to ~/diane/apps
  2. setup diane environment with diane-env -d bash
  3. cd ~/diane/apps/AtlasPilotJobs/test
  4. run the master: diane-run hello.run
  5. submit a local worker: ganga LocalSubmitter.py

The full logfile is available in ~/diane/runs/.../master.log

You will see that the master will successfully dispatch all tasks except for one, which defines the requirements which do not match the default capabilities advertised by the worker. To see what happens, search for "match" in the full log file. The master will process all but one task and will start waiting for a worker with matching capability. When you arrive to this stage, you may terminate the master (diane-master-ping kill).

-- JakubMoscicki - 14-Jan-2011

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-01-28 - JakubMoscicki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback