xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Scheduling for Responsive Grids

User-level scheduling

User-level (or application-level) scheduling is a virtualization layer at the application side: instead of being executed directly, the application is executed via an overlay scheduling layer (user-level scheduler). The overlay scheduling layer runs as a set of regular user jobs and therefore it operates entirely inside the user space (Fig 1)


User-level scheduling does not require modifications to the Grid middleware and infrastructure nor the deployment of special services in the Grid sites. Therefore it is much easier to setup and operate a user-level scheduling system to exploit the full range of a Grid sites which are available for a given user.


The user-level scheduling approach has the following constraints:

  • user jobs must be instrumented with the scheduling functionality;
  • jobs with user-level scheduling must compete on the same basis with all other jobs on the Grid and therefore is is not possible to obtain more guarantees of the resource availability that is provided by underlying middleware.

A user-level scheduler may be be embedded into the application or external to it. A scheduler embedded into the application is developed and optimized specifically for a given application, typically by refactoring and instrumenting the original application code. It allows to fine tune and customize the scheduling according to specific execution patterns of the application. Such scheduler is intrusive at the application source code level which means that the code reuse of the scheduler is reduced and the development effort is high for each application. A scheduler external to the application relies on the general properties of the application such as a particular parallel decomposition pattern (e.g. iterative decomposition, geometric decomposition or divide-and-conquer). An application adapter connects the external scheduler to the application at runtime. Depending on the decomposition pattern, the application refactoring at the source code level may or may not be required. The disadvantage of external schedulers is that it may be very hard to generalize execution patterns for irregular or speculative parallelism. In this case a development of a specialized embedded scheduler may be necessary.

It is not possible in general to guarantee the availability of Grid resources with user-level scheduling techniques. Jobs instrumented with user-level scheduling obey the same resource allocation rules as regular jobs. Unless middleware provides mechanisms for resource reservation, pre-emption or general processor scheduling with fast queues, the hard QoS requirements may only be implemented with dedication of resources. Partially, this problem may be solved with user-level scheduler which delays the execution of jobs until the requested number of resources is available. However this strategy violates the fair-share in the multiuser environment.

# - it is not possible to guarantee the availability of resources because # the user level scheduling jobs are competing for the resources on the # same footing as all other jobs.

On the other side the user-level scheduling may improve the Quality of Service on the Grid in the following ways:

  • reduce the job turnaround time (makespan);
  • provide a sustained job output rate;
  • optimize the failure recovery.

In the next sections we examine two user-level schedulers: an external scheduler for generic master-worker applications (DIANE) and an embedded scheduler for medical image processing (gPTM3D).

DIANE: a generic, external scheduler


DIANE (DIstributed ANalysis Environment) is a R&D project developed in Information Technology Department at CERN, Geneva. It is a generic user-level scheduler based on the extended task farming (master/slave) processing [ref!]. The runtime behaviour of the framework, such as failure recovery or task dispatching, may be customized with a set of hot-pluggable policy functions. This enables a fine-tuning of the scheduler according to the needs of particular application and also provides a support for other parallel decomposition patterns (e.g. divide-and-conquer).


DIANE provides a python-based framework and enables a rapid integration with existing applications. Both the transparent and intrusive application integration have been demonstrated. Data analysis in Athena framework for Atlas experiment [ref], Autodock-based Avian Flu drug search [ref] and frequency compatibility analysis for International Telecommunication Union RRC06 [ref] are all examples of transparent application integration i.e the application adapters in the form of python packages have been developed without modifying the original application code. The examples of intrusive integrations include the particle simulation in medical physics using Geant 4 toolkit [ref] and BLAST... The parallelization of these applications has been based on the iterative decomposition and master/worker processing model with fully independent tasks.

Execution model

In the DIANE execution model, a temporary virtual Master/Worker overlay network is created for each user job and is destroyed when the job terminates. This is compatible with the multi-user fair-share scheduling on the Grid and guarantees that the resources are not monopolized by a single user.

The job is split into a number of tasks which are executed by a number of Worker agents in the Grid. The worker agents run as regular Grid jobs. Each tasks is defined by a set of application-specific parameters. The dispatching of tasks is a process of allocating the tasks to workers by sending appropriate parameters to the Worker agents. The communication overhead is typically much smaller than in the systems based on checkpointing and task migration. It allows to achieve a high message rate (incoming and outgoing tasks). For the ITU frequency analysis application, the combined dispatching/receiving rate achieved peaks of 115 Hz.

DO MOJEGO PAPIERU: In the DIANE execution model, a virtual Master/Worker overlay network is created for each user job separately. The Master agent is started on a computer with external, incoming connectivity enabled. The user job is activated and the Master agent splits the job into the tasks. Independently, a set of Worker agents is started via a Grid User Interface. The address of the Master agent is shipped to each worker agent via Grid sandbox. When the Worker agent is started on the Grid Worker Node, it creates a permanent TCP/IP connection to the Master. Master allocates tasks to the free Workers by sending a task parameters (such as loop index or configuration parameters). Therefore the size of the exchanged messages is typically small.

High-granularity splitting

DIANE supports high-granularity job splitting, i.e. partitioning a job into a large number of short or very short tasks. For example, the ITU frequency compatibility analysis jobs, have been split into approximately 50 thousand tasks performed simultaneously by around 300 worker agents in 6 EGEE Grid sites across Europe (Fig.2). Task duration was highly variable: from few seconds (majority of the tasks) to 30 minutes (few individual tasks). The exact distribution of the task duration was not known until the job was fully executed, i.e. it was not possible to a priori agglomerate short tasks and isolate long tasks. Without the dynamic load-balancing the total job turnaround time would be orders of magnitude higher.


Output rate

User-level scheduling provides a more sustained job output rate. Fig.3 shows the number of completed tasks in the function of time with (red) and without (green) user-level scheduling for a Geant 4 release validation application . The job has been split in 207 tasks and average task duration was around 400 seconds. In the Grid, the load on the Computing Elements (queuing time) and the load on the Resource Broker (efficiency of matchmaking) may change dynamically in short periods of time. The user-level scheduler assures that even if the number of effectively available resources is low and varying, the job output throughput is stable (provided the high-granularity splitting).


Error recovery

Efficient and accurate failure recovery is an important factor of the Quality of Service. Large distributed systems such as Grid are prone to diverse configuration and system errors. A generic strategy of handling errors does not exist and depends on the application as well as the environment. An application-oriented scheduler such as DIANE is capable of distinguishing application and system errors and reacting appropriately via customizable error recovery methods. Crashing worker agents are automatically taken off the worker pool. Transient connectivity problems in the WAN are detected. The failed tasks are automatically re-dispatched to another worker agents. The mechanism uses a direct, highly efficient communication links in the virtual Master/Worker network and is much more efficient than a standard metascheduling techniques implemented in the middleware (JDL RetryCount parameter) which involve the full submission cycle.

A part of recent Avian Flu Drug Search have been performed using DIANE scheduler. A Master agent spanning several weeks was taking care of efficient error recovery and the system was operated by a single person. Because of the long durartion of the job, the worker agents were periodically aborted because the exceeded the time limits in the queues at the Computing Elements. The operator was adding new worker agents to the system so that at least 200 were available at any time. DIANE was able to dynamically reconfigure the virtual Master/Worker network to accomodate the new worker agents.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2006-06-26 - JakubMoscicki
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback