Difference: AmsterdamSeminarDecember2006 (3 vs. 4)

Revision 42006-12-07 - JakubMoscicki

Line: 1 to 1
 
META TOPICPARENT name="JakubMoscicki"

User Level Scheduling in the Grid: the outline of the technology potential and the research directions

Line: 48 to 47
 

Outstanding issues of User Level Scheduling

  • Improvement of QoS characteristics
    • extra reliability (fail-safety and application-specific fine tuning)
Changed:
<
<
    • reduction of stretch (aka makespan, turnaround time)
>
>
    • reduction of turnaround time
 
    • stabilization of the output inter-arrival rate (which is also more predictable)
  • Potential flaws
    • effect on fair-share: would other users be penalized by ULS jobs?
Changed:
<
<
    • potential harmfullness of the redundant batch requests, estimate the level of redundancy
>
>
    • potential harmfullness of the redundant batch requests
 

Area of applicability

Research Directions

Changed:
<
<
  • effect on fair-share: would other users be penalized by ULS jobs?
>
>

Grid slot model

  • slot i is defined by: tq(i) = queuing time, ta(i) = available computing time (wallclock), p(i) = weight (power) of the resource
  • W is the total work-load of the job

Estimation of the number of needed slots
  • We can derive N (the minimal number of slots needed) from this equation:
    \[W + overhead = \sum_{i=1}^{N}ta_{i}*p_{i}\]
  • overhead represents the ULS overhead (networking, scheduling) and the adjustment for the unit of execution
  • additional complication: p(i) may change with time (time-sharing on the worker node with other jobs)
  • either a largely redundant slot acquisition (the case now) or adding a capability to acquire more resources on demand while the jobs are running (in the future)
  • currently we do a rough estimation of the total CPU demand and then request a double or so slots assuming some average processor power (largely ficticious)

Estimation of the turnaround time
  • Currently we do not predict the tq - queing time, however:
    • promising techniques exist (e.g. BMBP Binomial Method Batch Predictor) -> relying on long traces from batch-queue logs + parallel workload archives
    • we have a wealth of monitoring data (Dashboard)
    • we try to capture the 'background' by sending very short jobs a few times daily to monitor the responsiveness of the system (in 3 different VOs)
  • Provided that we can get reliable estimates on tq on the Grid (which has not been tried yet, AFAIK)
  • In real applications the user may also be interested in partial output (e.g. histograms)

Fault-tolerance and measure of the reliability
  • Reliability should be measured "on the application's reliability background": minimize the infrastrucure faults which have no relation to application problems
    • If application has an intrinsic problem then there is not so much we can do
    • If there is a configuration problem on the sites, then we can enhance reliability of the system as observed by the user by providing fault-tolerance
    • Additionally, we can customize the fault-tolerance behaviour
  • How to measure it?
    • Classify the faults, disregard the intrinsic application faults, the ratio of failures is the reliability measure

Fair share

  • would other users be penalized by ULS jobs?
  • would fair share policies defined in the CE be compromised?
  • effect on fair-share:
 
    • fair-share can be measured (find the paper)
    • can be modeled and simulated
Added:
>
>
  • potential harmfullness of the redundant batch requests
    • pure redundant requests (submit n, execute 1, cancel n-1) have been studied (->):
      • jobs which do not use redundant requests are penalized (stretch increses linearly wrt the number of jobs using redundant requests)
      • load on middleware may be a problem
    • ULS have certain degree of redundancy (submit n, execute k, cancel n-k)
      • measure the harmfullness in this case
      • how to cope with it: meta-submitter would steadily increase the number of submission according to needs
      • this is clearly in conflict with minimizing the global turnaround time (QoS), what should the balance be?
    • estimate the level of redundancy
 

Engineering Directions

Line: 67 to 105
 

-- JakubMoscicki - 06 Dec 2006

Added:
>
>
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165495666" name="7aa85fd5751efebe325628a40604689d.png" path="7aa85fd5751efebe325628a40604689d.png" size="1258" user="UnknownUser" version=""
META FILEATTACHMENT attr="" autoattached="1" comment="" date="1165492720" name="bf17d2b6259bb6065011f86e0bfa5ee0.png" path="bf17d2b6259bb6065011f86e0bfa5ee0.png" size="1017" user="UnknownUser" version=""
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback