latticeQCD: long term theoretical physics computation, 3 months so far, ~1000 workers, 1.2M iterations, ...
problem at hand: automatic worker submission, keeping specified amount of workers alive, etc
issues: jobs do not live forever, architecture issues, various failures (more in error analysis)
meta-algorithm:
dynamic algorithm that can adapt to changes on the grid
good computing elements and bad computing elements - where to draw the line?
positive feedback: running jobs, jobs that finished running without any errors
negative feedback: pending jobs (to avoid over submitting), failed jobs, all other jobs without clear status
fitness: a measure of performance; the ratio of running + completed (error free) jobs in total number of jobs
how does it work: maintain a list of known computing elements and their corresponding fitness ratios and non-deterministically choose a computing element with the probability proportional to its fitness; alternatively, submit a job to the grid without specifying any computing element at all
characteristics:
once the fitness hits 0 no job will be submitted to the computing element explicitly (might be submitted through generic grid slot)
forgetting about the old data - influences the decision process and reflects the dynamic character of the grid (conditions change)
forgetting about unpromising computing elements: result of removing the old data; once the data is removed and the script is restarted, the information about computing element will be forgotten
examples:
good computing element (fitness ~1), bad computing element (fitness ~0)
handling jobs stuck in the queue, effects on the fitness
generic Grid slot:
balancing element offsetting part of the decision process to the grid
fitness=1
used for discovery of new computing elements
a chance for a computing element with fitness 0 to rehabilitate itself (instead of waiting for the old data to be removed)
examples: starting up the script
usage scenarios + experience
acrontab (periodically start up and run for a specified number of hours - 3h, 6h, etc); good for use with lxplus
llive process
only one instance per workspace allowed (due to the active monitoring which doesn't like more than one process)
file lock system to prevent many instances running in parallel
simple kill mechanism: when running on lxplus via acrontab it is unclear on which computer the process is running so the alternative way to kill it is very useful
observation results:
does not over submit jobs to a particular computing element
bad computing elements quickly tend to 0 fitness
handles temporarily unavailable computing elements well (i.e. readjusts fitness over time)
error analysis
save error logs (stderr/stdout), print job to file, j.backend.loginfo -> both pending and failed jobs