presentation
- introduction:
- latticeQCD: long term theoretical physics computation, 3 months so far, ~1000 workers, 1.2M iterations, ...
- computational model, snapshots, maximum worker amount, snapshots based on the beta parameter (physics)
- problem at hand: automatic worker submission, keeping specified amount of workers alive, etc
- issues: jobs do not live forever, architecture issues, various failures (more in error analysis)
- meta-algorithm:
- dynamic algorithm that can adapt to changes on the grid
- heuristic approach independent of underlying application (relies only on the application exit code); no glue (JDL) requirements
- good computing elements and bad computing elements - where to draw the line?
- positive feedback: running jobs, jobs that finished running without any errors (based on diane stderr, application/framework error)
- negative feedback: pending jobs (to avoid over submitting), failed jobs, all other jobs without clear status (stuck)
- fitness: a measure of reliability; the ratio of running + completed (error free) jobs in total number of jobs
- how does it work: maintain a list of known computing elements and their corresponding fitness ratios and non-deterministically choose a computing element with the probability proportional to its fitness; alternatively, submit a job to the grid without specifying any computing element at all
- characteristics:
- once the fitness hits 0 no job will be submitted to the computing element explicitly (might be submitted through generic grid slot)
- forgetting about the old data - influences the decision process and reflects the dynamic character of the grid (conditions change)
- forgetting about unpromising computing elements: result of removing the old data; once the data is removed and the script is restarted, the information about computing element will be forgotten
- examples:
- good computing element (fitness ~1), bad computing element (fitness ~0)
- handling jobs stuck in the queue, effects on the fitness
- generic Grid slot:
- balancing element offsetting part of the decision process to the grid
- fitness=1
- used for discovery of new computing elements
- a chance for a computing element with fitness 0 to rehabilitate itself (instead of waiting for the old data to be removed)
- examples: starting up the script
- implementation:
- Ganga script (stores data in gangadir directory)
- information from the master (#workers)
- information from Ganga from job repostitory
- usage parameters
- usage scenarios + experience
- acrontab (periodically start up and run for a specified number of hours - 3h, 6h, etc); good for use with lxplus
- llive process
- only one instance per workspace allowed (due to the active monitoring which doesn't like more than one process)
- file lock system to prevent many instances running in parallel
- simple kill mechanism: when running on lxplus via acrontab it is unclear on which computer the process is running so the alternative way to kill it is very useful
- observation results:
- does not over submit jobs to a particular computing element
- bad computing elements quickly tend to 0 fitness
- handles temporarily unavailable computing elements well (i.e. readjusts fitness over time)
error analysis
- jobs with stderr; types of errors (mostly due to incompatible architecture?)
- no stderr, but has loginfo - check how long it was running on workernode
- save error logs (stderr/stdout), print job to file, j.backend.loginfo -> both pending and failed jobs
- gangadir/agent_factory/....
- job longevity analysis (per CE)
- detection of false-positives
-+ future extensions/ideas
- performance based fitness parameter based on processing power of computing element; requires greater diane integration (amount of completed tasks in time)
- directory service + agent factory as a another service to greatly simplify using the grid (automatically create the worker pool and let directory service manage it)
--
MaciejWos - 06 Aug 2008