Difference: AmsterdamSeminarDecember2006 (4 vs. 5)

Revision 52006-12-07 - JakubMoscicki

Line: 1 to 1
 
META TOPICPARENT name="JakubMoscicki"

User Level Scheduling in the Grid: the outline of the technology potential and the research directions

Line: 24 to 24
 user-level scheduling" presented at UvA on September 1st 2006.
Added:
>
>
Start Presentation PhDSlideTemplate SlideShowPlugin

Slide 1: Table of Contents

User Level Scheduling in the Grid

the outline of the technology potential and the research directions
  • Introduction
    • the context of my work
    • summary of the activities so far
    • short description what ULS is
  • Research ideas
    • definition of QoS metrics
    • analysis of the impact: fair share
  • Development ideas
    • enhanced support for iterative and parameter-sweep applications

 

recall of the current activities

Placeholders and late binding

Line: 40 to 54
 
  • it runs in the user space (resources are accountable for and tracability is not compromized)
  • it is capable of creating transient/volatile overlays on top of the regular infrastructure ("virtual clusters")
  • DIANE implementation:
Added:
>
>
    • Master/Worker architecture with cutomizable task dependencies, constraints and fault tolerance
 
    • not specific to any particular technology or infrastructure (Grid, LSF/PBS, explicit IP host list + mixing of the resources)
    • portable (python and CORBA)
    • self-contained and small distribution with fully automatic installation
Line: 48 to 63
 
  • Improvement of QoS characteristics
    • extra reliability (fail-safety and application-specific fine tuning)
    • reduction of turnaround time
Changed:
<
<
    • stabilization of the output inter-arrival rate (which is also more predictable)
>
>
    • predictability of job execution (and more stable output inter-arrival rate)
 
  • Potential flaws
    • effect on fair-share: would other users be penalized by ULS jobs?
    • potential harmfullness of the redundant batch requests

Area of applicability

Added:
>
>
  • communication non-intensive applications:
    • GOOD: data analysis, physics simulation (independent tasks), image processing (iterative), docking (parameter sweep)
    • BAD: FME with thousands of cells communicating every few computing cycles
  • ability to partition the problem into parametric tasks - units of execution
    • parameters are passed around to start the application execution unit - so no classic task migration
 

Research Directions

Changed:
<
<

Grid slot model

>
>

Grid slot model

 
  • slot i is defined by: tq(i) = queuing time, ta(i) = available computing time (wallclock), p(i) = weight (power) of the resource
  • W is the total work-load of the job
Changed:
<
<
Estimation of the number of needed slots
>
>
  • DIANE-job-anatomy.png

Estimation of the number of needed slots

 
  • We can derive N (the minimal number of slots needed) from this equation:
    \[W + overhead = \sum_{i=1}^{N}ta_{i}*p_{i}\]
  • overhead represents the ULS overhead (networking, scheduling) and the adjustment for the unit of execution
  • additional complication: p(i) may change with time (time-sharing on the worker node with other jobs)
  • either a largely redundant slot acquisition (the case now) or adding a capability to acquire more resources on demand while the jobs are running (in the future)
  • currently we do a rough estimation of the total CPU demand and then request a double or so slots assuming some average processor power (largely ficticious)
Added:
>
>
    • initial estimation could be done from the Grid information system
    • the application-specific benchmarks (so the measured w) could be stored externally and reused in subsequent runs
 
Changed:
<
<
Estimation of the turnaround time
>
>

Estimation of the turnaround time

 
  • Currently we do not predict the tq - queing time, however:
Changed:
<
<
    • promising techniques exist (e.g. BMBP Binomial Method Batch Predictor) -> relying on long traces from batch-queue logs + parallel workload archives
    • we have a wealth of monitoring data (Dashboard)
    • we try to capture the 'background' by sending very short jobs a few times daily to monitor the responsiveness of the system (in 3 different VOs)
  • Provided that we can get reliable estimates on tq on the Grid (which has not been tried yet, AFAIK)
>
>
    • promising techniques exist (e.g. BMBP Binomial Method Batch Predictor)
      • relying on long traces from batch-queue logs + parallel workload archives
      • using order-based statistics (quantiles) rather than the moment-based statistics (mean, stddev)
      • comparing to model-based predictors (which assume certain model of the queue) arguably BMBP:
        • has a similar success rate but has more accurate predictions (less "over-prediction")
    • we have a wealth of monitoring data from user activities (ARDA Dashboard e.g. http://lxarda02.cern.ch:8088/atlas)
    • additionally we try to capture the 'background' by sending very short jobs a few times daily to monitor the responsiveness of the system (in 3 different VOs)
  • Assuming that we can get reliable estimates on tq on the Grid (is it possible/been done yet?)
 
  • In real applications the user may also be interested in partial output (e.g. histograms)
Added:
>
>
  • The main parameter: P(k,t) = probability to complete the k% of the run in the time not greater than t (similar to modelling the queuing time in BMBP)
  • P depends on how k is defined:
    • if all units of execution are equal or their distribution if they are not equal
    • if there are synchronization patterns or units are independent
  • P depends also on other parameters:
    • change in the load of the worker node and a failure rate of the worker nodes
    • if there are contraints: may any worker node process any task or some are restricted

Predictability of the job execution

  • the coefficient of the variation of the job output
  • illustrate with the G4 graphs
 
Changed:
<
<
Fault-tolerance and measure of the reliability
>
>

Fault-tolerance and measure of the reliability

 
  • Reliability should be measured "on the application's reliability background": minimize the infrastrucure faults which have no relation to application problems
Changed:
<
<
    • If application has an intrinsic problem then there is not so much we can do
    • If there is a configuration problem on the sites, then we can enhance reliability of the system as observed by the user by providing fault-tolerance
>
>
    • If application has an intrinsic problem then there is not so much we can do (e.g. a segfault on some data)
    • If there is a configuration problem on the sites, a middleware inefficiency, worker node crash then we can enhance reliability of the system as observed by the user by providing fault-tolerance
 
    • Additionally, we can customize the fault-tolerance behaviour
  • How to measure it?
    • Classify the faults, disregard the intrinsic application faults, the ratio of failures is the reliability measure
Changed:
<
<

Fair share

>
>

Fair share

 
  • would other users be penalized by ULS jobs?
  • would fair share policies defined in the CE be compromised?
  • effect on fair-share:
    • fair-share can be measured (find the paper)
    • can be modeled and simulated
  • potential harmfullness of the redundant batch requests
Changed:
<
<
    • pure redundant requests (submit n, execute 1, cancel n-1) have been studied (->):
>
>
    • pure redundant requests (submit n, execute 1, cancel n-1) have been studied (-> paper)
 
      • jobs which do not use redundant requests are penalized (stretch increses linearly wrt the number of jobs using redundant requests)
      • load on middleware may be a problem
    • ULS have certain degree of redundancy (submit n, execute k, cancel n-k)
Line: 98 to 136
 
      • how to cope with it: meta-submitter would steadily increase the number of submission according to needs
      • this is clearly in conflict with minimizing the global turnaround time (QoS), what should the balance be?
    • estimate the level of redundancy
Added:
>
>
    • what can be done: artificial restriction of the slot (either static or dynamic)
 
Changed:
<
<

Engineering Directions

Use-Cases

>
>

Other potential pitfalls

  • Very large activities using ULS could potentially block-off one another from using the Grid for certain time.
 
Added:
>
>

Other ideas

  • Prediction of the availability of the worker nodes based on (semi) Markov Processes
    • used quite successfully with the FGCS (Fine-Grained Cycle Scavanging) system
    • tested with iShare: a peer-to-peer decentralized internet sharing system
    • host availability modeled as states of Markov Process, subject to interactive (and chaotic) user activity
  • Could this be used?
    • jobs on the grid have identifiable states

Iterative applications

  • A new application has been recently deployed with DIANE
    • xmipp - X-Window-based Microscopy Image Processing Package
    • image alignment on the Grid
    • one step (iteration) is a full Master/Worker job
    • the stopping criteria defined by the Master
    • the workers are reused in many iterations, great speed-up

White-board for parameter sweep

  • Autodock application: searching for drug candidates in a huge compound/ligand matrix
  • Do a smarter search than just all combinations (or at least do all combinations in a smarter order)
  • Some rows or columns of the matrix may be more interesting than others
  • A collaboration whiteboard: select interesting areas and assign a master for the job
  • If one master terminates the workers may join the other master

Development Directions

  • sending additional information (benchmarks, constraints, worker group identification)
  • dynamic change of connection-oriented / connection-less policy
  • support of the constraints layer
  • dynamic task decomposition
    • managing the message rate on the master by splitting into larger tasks which may be dynamically decomposed into smaller units
  • a directory service: late master binding, switching to multiple masters at runtime
  • checkpointing of the jobs to be able to resume an interrupted master

%SLIDESHOWEND% The slide template:


<!-- TWiki Slide Show -->
%SLIDETITLE% %SLIDENAVNEXT%

%SLIDETEXT%
%SLIDENAVALL% Slide %SLIDENUM% of %SLIDEMAX% Jakub Moscicki 2019


  -- JakubMoscicki - 06 Dec 2006
Changed:
<
<
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165495666" name="7aa85fd5751efebe325628a40604689d.png" path="7aa85fd5751efebe325628a40604689d.png" size="1258" user="UnknownUser" version=""
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165492720" name="bf17d2b6259bb6065011f86e0bfa5ee0.png" path="bf17d2b6259bb6065011f86e0bfa5ee0.png" size="1017" user="UnknownUser" version=""
>
>

META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165510802" name="6602105e0231508f1e395ad1e277e5fa.png" path="6602105e0231508f1e395ad1e277e5fa.png" size="1716" user="UnknownUser" version=""
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165517988" name="DIANE-job-anatomy.png" path="DIANE-job-anatomy.png" size="223822" user="Main.JakubMoscicki" version="1"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback