User Level Scheduling in the Grid: the outline of the technology potential and the research directions

The User Level Scheduling (ULS) techniques have been very successfully
applied in a number of application areas such as bio-informatics,
image processing, telecommunications and physics simulation. The ULS
helped to improve Quality of Service (QoS) in the Grid, what have been
experienced as reduced job turnaround time, more efficient usage of
resources, more predictable, reliable and stable application
execution. To be universally adopted however, the ULS techniques must
be proved to be compatible with the fundamental assumptions of the
Grid computing model such as respect for the resource usage policies,
fair-share toward other users, traceability of user's activities and
so on.  In this talk we will the outline the main benefits and
possible pitfalls of the ULS techniques. We will also try to introduce
initial research ideas for modeling and measuring of the Quality of
Service on the Grid and for analysis of the impact of the ULS on other
users (fair-share). Finally we will present ideas for enhanced support
for certain applications such as the iterative algorithms or
parameter-sweep.

This presentation builds on the "The Quality of Service on the Grid with 
user-level scheduling" presented at UvA on September 1st 2006.

Start Presentation PhDSlideTemplate SlideShowPlugin

Slide 1: Table of Contents

User Level Scheduling in the Grid

the outline of the technology potential and the research directions
  • Introduction
    • the context of my work
    • summary of the activities so far
    • short description what ULS is
  • Research ideas
    • definition of QoS metrics
    • analysis of the impact: fair share
  • Development ideas
    • enhanced support for iterative and parameter-sweep applications

Slide 2: recall of the current activities

Slide 3: Placeholders and late binding

  • the technology is also called: placeholder, late binding, pilot jobs
  • you do not send specific job to the resource, you acquire the resource and assign the job at runtime, you free the resource when you are done
  • some examples:
    • HEP production systems (centralized task queue, server acts on behalf of the user): Alien (Atlas), DIRAC (LHCb), PANDA (Atlas)
    • Condor glide-ins (build a virtual Condor pool from Globus resources)
    • Boinc (CPU cycle scavanging)

Slide 4: User Level Scheduling

  • it's the late binding technology
  • the scheduler has the application knowledge (may make better error recovery or load balancing decisions)
  • it runs in the user space (resources are accountable for and tracability is not compromized)
  • it is capable of creating transient/volatile overlays on top of the regular infrastructure ("virtual clusters")
  • DIANE implementation:
    • Master/Worker architecture with cutomizable task dependencies, constraints and fault tolerance
    • not specific to any particular technology or infrastructure (Grid, LSF/PBS, explicit IP host list + mixing of the resources)
    • portable (python and CORBA)
    • self-contained and small distribution with fully automatic installation

Slide 5: Outstanding issues of User Level Scheduling

  • Improvement of QoS characteristics
    • extra reliability (fail-safety and application-specific fine tuning)
    • reduction of turnaround time
    • predictability of job execution (and more stable output inter-arrival rate)
  • Potential flaws
    • effect on fair-share: would other users be penalized by ULS jobs?
    • potential harmfullness of the redundant batch requests

Slide 6: Area of applicability

  • communication non-intensive applications:
    • GOOD: data analysis, physics simulation (independent tasks), image processing (iterative), docking (parameter sweep)
    • BAD: FME with thousands of cells communicating every few computing cycles
  • ability to partition the problem into parametric tasks - units of execution
    • parameters are passed around to start the application execution unit - so no classic task migration

Slide 7: Research Directions

Slide 8: Grid slot model

  • slot i is defined by: tq(i) = queuing time, ta(i) = available computing time (wallclock), p(i) = weight (power) of the resource
  • W is the total work-load of the job
  • DIANE-job-anatomy.png

Slide 9: Estimation of the number of needed slots

  • We can derive N (the minimal number of slots needed) from this equation:
    \[ W + overhead = \sum_{i=1}^{N}ta_{i}*p_{i} \]
  • overhead represents the ULS overhead (networking, scheduling) and the adjustment for the unit of execution
  • additional complication: p(i) may change with time (time-sharing on the worker node with other jobs)
  • either a largely redundant slot acquisition (the case now) or adding a capability to acquire more resources on demand while the jobs are running (in the future)
  • currently we do a rough estimation of the total CPU demand and then request a double or so slots assuming some average processor power (largely ficticious)
    • initial estimation could be done from the Grid information system
    • the application-specific benchmarks (so the measured w) could be stored externally and reused in subsequent runs

Slide 10: Estimation of the turnaround time

  • Currently we do not predict the tq - queing time, however:
    • promising techniques exist (e.g. BMBP Binomial Method Batch Predictor)
      • relying on long traces from batch-queue logs + parallel workload archives
      • using order-based statistics (quantiles) rather than the moment-based statistics (mean, stddev)
      • comparing to model-based predictors (which assume certain model of the queue) arguably BMBP:
        • has a similar success rate but has more accurate predictions (less "over-prediction")
    • we have a wealth of monitoring data from user activities (ARDA Dashboard e.g. http://lxarda02.cern.ch:8088/atlas)
    • additionally we try to capture the 'background' by sending very short jobs a few times daily to monitor the responsiveness of the system (in 3 different VOs)
  • Assuming that we can get reliable estimates on tq on the Grid (is it possible/been done yet?)
  • In real applications the user may also be interested in partial output (e.g. histograms)
  • The main parameter: P(k,t) = probability to complete the k% of the run in the time not greater than t (similar to modelling the queuing time in BMBP)
  • P depends on how k is defined:
    • if all units of execution are equal or their distribution if they are not equal
    • if there are synchronization patterns or units are independent
  • P depends also on other parameters:
    • change in the load of the worker node and a failure rate of the worker nodes
    • if there are contraints: may any worker node process any task or some are restricted

Slide 11: Predictability of the job execution

  • the coefficient of the variation of the job output
  • illustrate with the G4 graphs

Slide 12: Fault-tolerance and measure of the reliability

  • Reliability should be measured "on the application's reliability background": minimize the infrastrucure faults which have no relation to application problems
    • If application has an intrinsic problem then there is not so much we can do (e.g. a segfault on some data)
    • If there is a configuration problem on the sites, a middleware inefficiency, worker node crash then we can enhance reliability of the system as observed by the user by providing fault-tolerance
    • Additionally, we can customize the fault-tolerance behaviour
  • How to measure it?
    • Classify the faults, disregard the intrinsic application faults, the ratio of failures is the reliability measure

Slide 13: Fair share

  • would other users be penalized by ULS jobs?
  • would fair share policies defined in the CE be compromised?
  • effect on fair-share:
    • fair-share can be measured (find the paper)
    • can be modeled and simulated
  • potential harmfullness of the redundant batch requests
    • pure redundant requests (submit n, execute 1, cancel n-1) have been studied (-> paper)
      • jobs which do not use redundant requests are penalized (stretch increses linearly wrt the number of jobs using redundant requests)
      • load on middleware may be a problem
    • ULS have certain degree of redundancy (submit n, execute k, cancel n-k)
      • measure the harmfullness in this case
      • how to cope with it: meta-submitter would steadily increase the number of submission according to needs
      • this is clearly in conflict with minimizing the global turnaround time (QoS), what should the balance be?
    • estimate the level of redundancy
    • what can be done: artificial restriction of the slot (either static or dynamic)

Slide 14: Other potential pitfalls

  • Very large activities using ULS could potentially block-off one another from using the Grid for certain time.

Slide 15: Other ideas

  • Prediction of the availability of the worker nodes based on (semi) Markov Processes
    • used quite successfully with the FGCS (Fine-Grained Cycle Scavanging) system
    • tested with iShare: a peer-to-peer decentralized internet sharing system
    • host availability modeled as states of Markov Process, subject to interactive (and chaotic) user activity
  • Could this be used?
    • jobs on the grid have identifiable states

Slide 16: Iterative applications

  • A new application has been recently deployed with DIANE
    • xmipp - X-Window-based Microscopy Image Processing Package
    • image alignment on the Grid
    • one step (iteration) is a full Master/Worker job
    • the stopping criteria defined by the Master
    • the workers are reused in many iterations, great speed-up

Slide 17: White-board for parameter sweep

  • Autodock application: searching for drug candidates in a huge compound/ligand matrix
  • Do a smarter search than just all combinations (or at least do all combinations in a smarter order)
  • Some rows or columns of the matrix may be more interesting than others
  • A collaboration whiteboard: select interesting areas and assign a master for the job
  • If one master terminates the workers may join the other master

Slide 18: Development Directions

  • sending additional information (benchmarks, constraints, worker group identification)
  • dynamic change of connection-oriented / connection-less policy
  • support of the constraints layer
  • dynamic task decomposition
    • managing the message rate on the master by splitting into larger tasks which may be dynamically decomposed into smaller units
  • a directory service: late master binding, switching to multiple masters at runtime
  • checkpointing of the jobs to be able to resume an interrupted master

The slide template:


%SLIDETITLE% %SLIDENAVNEXT%

%SLIDETEXT%
%SLIDENAVALL% Slide %SLIDENUM% of %SLIDEMAX% Jakub Moscicki 2019


-- JakubMoscicki - 06 Dec 2006

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 6602105e0231508f1e395ad1e277e5fa.png   manage 1.7 K 2006-12-07 - 18:00 UnknownUser  
PNGpng DIANE-job-anatomy.png r1 manage 218.6 K 2006-12-07 - 19:59 JakubMoscicki  
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2006-12-07 - JakubMoscicki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback