Difference: AmsterdamSeminarDecember2006 (5 vs. 6)

Revision 62006-12-08 - JakubMoscicki

Line: 1 to 1
 
META TOPICPARENT name="JakubMoscicki"
Changed:
<
<

User Level Scheduling in the Grid: the outline of the technology potential and the research directions

>
>

User Level Scheduling in the Grid

the outline of the technology potential and the research directions

 
The User Level Scheduling (ULS) techniques have been very successfully
applied in a number of application areas such as bio-informatics,
Line: 24 to 25
 user-level scheduling" presented at UvA on September 1st 2006.
Changed:
<
<
Start Presentation PhDSlideTemplate SlideShowPlugin
>
>
Start Presentation PhDSlideTemplate SlideShowPlugin
 

Table of Contents

User Level Scheduling in the Grid

the outline of the technology potential and the research directions
Line: 37 to 38
 
    • analysis of the impact: fair share
  • Development ideas
    • enhanced support for iterative and parameter-sweep applications
Added:
>
>
  • Summary
 
Changed:
<
<

recall of the current activities

>
>

Context of my work

  • ARDA group at CERN arda.png
  • enabling applications and users on the Grid:
    • High Energy Physics: LHC experiments
    • EGEE: biomed, special acitvities
  • Grid
    • LCG and EGEE Grid
    • the largest Grid infrastructure to date
    • over 200 sites
    • over 20K worker nodes
    • over 5 Pb of storage

Activities

 

Placeholders and late binding

  • the technology is also called: placeholder, late binding, pilot jobs
Changed:
<
<
  • you do not send specific job to the resource, you acquire the resource and assign the job at runtime, you free the resource when you are done
>
>
  • you do not send specific job to the resource
  • you acquire the resource and assign the job dynamically to it
  • you free the resource when you are done
 
  • some examples:
Changed:
<
<
    • HEP production systems (centralized task queue, server acts on behalf of the user): Alien (Atlas), DIRAC (LHCb), PANDA (Atlas)
>
>
    • HEP production systems (centralized task queue, server acts on behalf of the user):
      • Alien (Atlas), DIRAC (LHCb), PANDA (Atlas)
 
    • Condor glide-ins (build a virtual Condor pool from Globus resources)
    • Boinc (CPU cycle scavanging)

User Level Scheduling

  • it's the late binding technology
Changed:
<
<
  • the scheduler has the application knowledge (may make better error recovery or load balancing decisions)
  • it runs in the user space (resources are accountable for and tracability is not compromized)
  • it is capable of creating transient/volatile overlays on top of the regular infrastructure ("virtual clusters")
  • DIANE implementation:
>
>
  • the scheduler has the application knowledge
    • may make better error recovery or load balancing decisions
  • it runs in the user space
    • resources are accountable for and tracability is not compromized
  • it is capable of creating transient/volatile overlays on top of the regular infrastructure
    • "virtual clusters"

DIANE prototype

  • DIANE prototype DIANEVirtualCluster.png
 
    • Master/Worker architecture with cutomizable task dependencies, constraints and fault tolerance
Changed:
<
<
    • not specific to any particular technology or infrastructure (Grid, LSF/PBS, explicit IP host list + mixing of the resources)
>
>
    • not specific to any particular technology or infrastructure:
      • Grid, LSF/PBS, explicit IP host list + mixing of the resources
 
    • portable (python and CORBA)
    • self-contained and small distribution with fully automatic installation
Deleted:
<
<

Outstanding issues of User Level Scheduling

  • Improvement of QoS characteristics
    • extra reliability (fail-safety and application-specific fine tuning)
    • reduction of turnaround time
    • predictability of job execution (and more stable output inter-arrival rate)
  • Potential flaws
    • effect on fair-share: would other users be penalized by ULS jobs?
    • potential harmfullness of the redundant batch requests
 

Area of applicability

  • communication non-intensive applications:
Changed:
<
<
    • GOOD: data analysis, physics simulation (independent tasks), image processing (iterative), docking (parameter sweep)
    • BAD: FME with thousands of cells communicating every few computing cycles
>
>
    • GOOD examples
      • data analysis
      • physics simulation (independent tasks)
      • image processing (iterative)
      • docking (bioinformatics parameter sweep)
    • BAD example:
      • FME with thousands of cells communicating every few computing cycles
 
  • ability to partition the problem into parametric tasks - units of execution
    • parameters are passed around to start the application execution unit - so no classic task migration
Changed:
<
<

Research Directions

>
>

Review of User Level Scheduling

  • Improvement of QoS characteristics
    • extra reliability: fail-safety and application-specific fine tuning
    • reduction of turnaround time
    • predictability of job execution (and more stable output inter-arrival rate)
  • Potential flaws
    • effects on fair-share: would other users be penalized by ULS jobs?
    • potential harmfullness of the redundant batch requests
 
Changed:
<
<

Grid slot model

  • slot i is defined by: tq(i) = queuing time, ta(i) = available computing time (wallclock), p(i) = weight (power) of the resource
>
>

Research Ideas

  • Quantify what we already experimented and sucessfully applied
    • what it means "a better service" from the Grid?
      • formalization of how Grid resources are acquired (slot model)
      • measuring the usefulness of ULS
        • definition of quantifiable QoS metrics
        • identification and analysis of potential flaws
  • Can we provide useful estimates of the QoS and fulfil them (in statistical sense)?
    • given reliable methods of predicting Grid queing time exist
  • Can we assure the QoS using mixed, Grid and dedicated, resources
  • We want to do simulation (if required) and practical verification using real applications

Slot model for Grid resources

  • slot i is defined by: DIANE-job-anatomy.png
    • tq(i) = queuing time
    • ta(i) = available computing time (wallclock)
    • p(i) = weight (power) of the resource
 
  • W is the total work-load of the job
Changed:
<
<
  • DIANE-job-anatomy.png
>
>
 

Estimation of the number of needed slots

Changed:
<
<
  • We can derive N (the minimal number of slots needed) from this equation:
    \[ W + overhead = \sum_{i=1}^{N}ta_{i}*p_{i} \]
  • overhead represents the ULS overhead (networking, scheduling) and the adjustment for the unit of execution
  • additional complication: p(i) may change with time (time-sharing on the worker node with other jobs)
  • either a largely redundant slot acquisition (the case now) or adding a capability to acquire more resources on demand while the jobs are running (in the future)
  • currently we do a rough estimation of the total CPU demand and then request a double or so slots assuming some average processor power (largely ficticious)
>
>
  • N = minimal number of slots needed:
    \[ W + overhead = \sum_{i=1}^{N}ta_{i}*p_{i} \]
  • overhead represents:
    • the ULS overhead (networking, scheduling)
    • how well we can tile the slot with units of execution ("the tails")
  • additional complication: p(i) may change with time
    • in case of time-sharing on the worker node with other jobs

Estimation of the number of needed slots (2)

  • Main options:
    • a largely redundant slot acquisition at the beginning
    • adding a capability to acquire more resources on demand while the jobs are running
  • Current practice:
    • we do a rough estimation of the total CPU demand and then request a double or so slots assuming some average processor power (largely ficticious)
  • Future extensions:
 
    • initial estimation could be done from the Grid information system
Changed:
<
<
    • the application-specific benchmarks (so the measured w) could be stored externally and reused in subsequent runs
>
>
    • the application-specific benchmarks (so the measured p) could be stored externally and reused in subsequent runs
 

Estimation of the turnaround time

  • Currently we do not predict the tq - queing time, however:
    • promising techniques exist (e.g. BMBP Binomial Method Batch Predictor)
      • relying on long traces from batch-queue logs + parallel workload archives
Changed:
<
<
      • using order-based statistics (quantiles) rather than the moment-based statistics (mean, stddev)
      • comparing to model-based predictors (which assume certain model of the queue) arguably BMBP:
        • has a similar success rate but has more accurate predictions (less "over-prediction")
>
>
      • using order-based statistics (quantiles) rather than moment-based statistics (mean, stddev)
      • arguably in comparison to model-based predictors (which assume certain model of the queue)
        • BMBP has a similar success rate but has more accurate predictions (less "over-prediction")
 
Changed:
<
<
    • additionally we try to capture the 'background' by sending very short jobs a few times daily to monitor the responsiveness of the system (in 3 different VOs)
>
>
    • additionally we try to capture the Grid 'background'
      • by sending very short jobs a few times daily to monitor the responsiveness of the system (in 3 different VOs)

Estimation of the turnaround time (2)

 
  • Assuming that we can get reliable estimates on tq on the Grid (is it possible/been done yet?)
Changed:
<
<
  • In real applications the user may also be interested in partial output (e.g. histograms)
  • The main parameter: P(k,t) = probability to complete the k% of the run in the time not greater than t (similar to modelling the queuing time in BMBP)
>
>
  • Let P(k,t) = probability to complete the k% of the run in the time not greater than t
    • in real applications the user may also be interested in partial output (e.g. histograms)
    • similar to modelling the queuing time in BMBP
 
  • P depends on how k is defined:
Changed:
<
<
    • if all units of execution are equal or their distribution if they are not equal
    • if there are synchronization patterns or units are independent
>
>
    • if all execution units are equal or their distribution is known
    • if there are synchronization patterns or execution units are independent
 
  • P depends also on other parameters:
    • change in the load of the worker node and a failure rate of the worker nodes
Changed:
<
<
    • if there are contraints: may any worker node process any task or some are restricted
>
>
    • the contraints: may any worker node process any task or some are restricted (for example the data access)
 

Predictability of the job execution

Changed:
<
<
  • the coefficient of the variation of the job output
  • illustrate with the G4 graphs

Fault-tolerance and measure of the reliability

  • Reliability should be measured "on the application's reliability background": minimize the infrastrucure faults which have no relation to application problems
    • If application has an intrinsic problem then there is not so much we can do (e.g. a segfault on some data)
    • If there is a configuration problem on the sites, a middleware inefficiency, worker node crash then we can enhance reliability of the system as observed by the user by providing fault-tolerance
    • Additionally, we can customize the fault-tolerance behaviour
  • How to measure it?
    • Classify the faults, disregard the intrinsic application faults, the ratio of failures is the reliability measure
>
>
  • Define the coefficient of the variation of the job output arrival times
    • depends on k i.e. the performance model
      • if all execution units are equal and independent then the expected output arrival time vs time is a straight line
  • we can do the fitting of known performance model to the measured data points
G4Prod_output_rate.png

Fault-tolerance and reliability

  • Reliability should be measured "on the application's reliability background":
    • the goal: minimize the infrastructure faults which have no relation to intrinsic application problems
      • example of instrinsic application problem: a segfault on particular data or parameter
    • We can enhance reliability of the system as observed by the user by providing fault-tolerance:
      • for configuration problems on the sites, middleware inefficiency, worker node crash etc.
      • we can customize the fault-tolerance behaviour
  • How to measure the reliability?
    • Classify the faults, disregard the intrinsic application faults, the ratio of failures is the reliability measure
 

Fair share

  • would other users be penalized by ULS jobs?
Deleted:
<
<
  • would fair share policies defined in the CE be compromised?
  • effect on fair-share:
    • fair-share can be measured (find the paper)
    • can be modeled and simulated
 
  • potential harmfullness of the redundant batch requests
    • pure redundant requests (submit n, execute 1, cancel n-1) have been studied (-> paper)
Changed:
<
<
      • jobs which do not use redundant requests are penalized (stretch increses linearly wrt the number of jobs using redundant requests)
>
>
      • jobs which do not use redundant requests are penalized
        • stretch increses linearly wrt the number of jobs using redundant requests
 
      • load on middleware may be a problem
Added:
>
>

Fair share (2)

 
    • ULS have certain degree of redundancy (submit n, execute k, cancel n-k)
Deleted:
<
<
      • measure the harmfullness in this case
      • how to cope with it: meta-submitter would steadily increase the number of submission according to needs
      • this is clearly in conflict with minimizing the global turnaround time (QoS), what should the balance be?
 
    • estimate the level of redundancy
Changed:
<
<
    • what can be done: artificial restriction of the slot (either static or dynamic)
>
>
      • measure the harmfullness in this case
    • Possible solution:
      • meta-submitter would steadily increase the number of submission according to needs
      • this is clearly in conflict with minimizing the global turnaround time (QoS)
      • what should the balance be?
 
Changed:
<
<

Other potential pitfalls

>
>

Fair share (3)

  • would fair share policies defined in the CE be compromised?
    • heterogenity: fair share algorithms are different at different sites
    • concepts of window, depth, decay in the accounting data are applied to VO, groups and users
  • fair-share may be modeled and simulated
    • e.g. SimGrid toolkit for sites,clusters,batch schedulers and streams of jobs
  • Other pitfalls:
 
  • Very large activities using ULS could potentially block-off one another from using the Grid for certain time.
Added:
>
>
  • Possible solution:
    • artificial restriction of the slot (either static or dynamic)
    • in conflict with QoS
 

Other ideas

  • Prediction of the availability of the worker nodes based on (semi) Markov Processes
Line: 150 to 226
 
    • jobs on the grid have identifiable states

Iterative applications

Added:
>
>

Iterative applications

 
  • A new application has been recently deployed with DIANE
    • xmipp - X-Window-based Microscopy Image Processing Package
    • image alignment on the Grid
Line: 158 to 235
 
    • the workers are reused in many iterations, great speed-up

White-board for parameter sweep

Added:
>
>

White-board for parameter sweep

 
  • Autodock application: searching for drug candidates in a huge compound/ligand matrix
  • Do a smarter search than just all combinations (or at least do all combinations in a smarter order)
  • Some rows or columns of the matrix may be more interesting than others
  • A collaboration whiteboard: select interesting areas and assign a master for the job
  • If one master terminates the workers may join the other master
Changed:
<
<

Development Directions

>
>

Development Ideas

Development Ideas

 
  • sending additional information (benchmarks, constraints, worker group identification)
  • dynamic change of connection-oriented / connection-less policy
  • support of the constraints layer
  • dynamic task decomposition
Added:
>
>
    • so far we have a rather primitive self-balancing with static decomposition
 
    • managing the message rate on the master by splitting into larger tasks which may be dynamically decomposed into smaller units
  • a directory service: late master binding, switching to multiple masters at runtime
  • checkpointing of the jobs to be able to resume an interrupted master
Changed:
<
<
%SLIDESHOWEND% The slide template:
>
>

Summary

  • we have developed a useful technique which has been succesfully used
  • first we want to quantify the improvement in the QoS we observed
  • then we want to analyse possible pitfalls (fair-share)
  • we also want to make the system more automatic
    • it should compute the number of needed resources to fulfil the QoS requirements
    • dynamic decomposition should relieve the user from making manual splitting decisions
  • finally we want to extend our ULS prototype to cover more use-cases
 
Changed:
<
<

<!-- TWiki Slide Show -->
%SLIDETITLE% %SLIDENAVNEXT%

%SLIDETEXT%
%SLIDENAVALL% Slide %SLIDENUM% of %SLIDEMAX% Jakub Moscicki 2019


>
>
%SLIDESHOWEND%
  -- JakubMoscicki - 06 Dec 2006

Changed:
<
<
>
>
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165526168" name="arda.png" path="arda.png" size="2177" user="Main.JakubMoscicki" version="1"
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165531052" name="G4Prod_output_rate.png" path="G4Prod_output_rate.png" size="8225" user="Main.JakubMoscicki" version="1"
 
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165510802" name="6602105e0231508f1e395ad1e277e5fa.png" path="6602105e0231508f1e395ad1e277e5fa.png" size="1716" user="UnknownUser" version=""
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165517988" name="DIANE-job-anatomy.png" path="DIANE-job-anatomy.png" size="223822" user="Main.JakubMoscicki" version="1"
Added:
>
>
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165533185" name="DIANEVirtualCluster.png" path="DIANEVirtualCluster.png" size="14173" user="Main.JakubMoscicki" version="1"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback