TWiki> ArdaGrid Web>AgentFactory (revision 2)EditAttachPDF

presentation

  • introduction:
    • latticeQCD: long term theoretical physics computation, 3 months so far, ~1000 workers, 1.2M iterations, ...
    • problem at hand: automatic worker submission, keeping specified amount of workers alive, etc
    • issues: jobs do not live forever, architecture issues, various failures (more in error analysis)
  • meta-algorithm:
    • dynamic algorithm that can adapt to changes on the grid
    • good computing elements and bad computing elements - where to draw the line?
      • positive feedback: running jobs, jobs that finished running without any errors
      • negative feedback: pending jobs (to avoid over submitting), failed jobs, all other jobs without clear status
    • fitness: a measure of performance; the ratio of running + completed (error free) jobs in total number of jobs
    • how does it work: maintain a list of known computing elements and their corresponding fitness ratios and non-deterministically choose a computing element with the probability proportional to its fitness; alternatively, submit a job to the grid without specifying any computing element at all
    • characteristics:
      • once the fitness hits 0 no job will be submitted to the computing element explicitly (might be submitted through generic grid slot)
      • forgetting about the old data - influences the decision process and reflects the dynamic character of the grid (conditions change)
      • forgetting about unpromising computing elements: result of removing the old data; once the data is removed and the script is restarted, the information about computing element will be forgotten
  • examples:
    • good computing element (fitness ~1), bad computing element (fitness ~0)
    • handling jobs stuck in the queue, effects on the fitness
  • generic Grid slot:
    • balancing element offsetting part of the decision process to the grid
    • fitness=1
    • used for discovery of new computing elements
    • a chance for a computing element with fitness 0 to rehabilitate itself (instead of waiting for the old data to be removed)
    • examples: starting up the script
  • usage scenarios + experience
    • acrontab (periodically start up and run for a specified number of hours - 3h, 6h, etc); good for use with lxplus
    • llive process
    • only one instance per workspace allowed (due to the active monitoring which doesn't like more than one process)
    • file lock system to prevent many instances running in parallel
    • simple kill mechanism: when running on lxplus via acrontab it is unclear on which computer the process is running so the alternative way to kill it is very useful
  • observation results:
    • does not over submit jobs to a particular computing element
    • bad computing elements quickly tend to 0 fitness
    • handles temporarily unavailable computing elements well (i.e. readjusts fitness over time)

error analysis

  • save error logs (stderr/stdout), print job to file, j.backend.loginfo -> both pending and failed jobs
  • gangadir/factory/....
  • job longevity analysis (per CE)
  • detection of false-positives

-- MaciejWos - 06 Aug 2008

Edit | Attach | Watch | Print version | History: r9 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2008-08-18 - MaciejWos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback