TWiki> ArdaGrid Web>AgentFactory (revision 4)EditAttachPDF

DRAFT - Agent Factory - DRAFT

Lattice QCD: searching for QCD critical point

Quantum Chromodynamics (QCD) describes the strong interactions between quarks and gluons, which are normally confined inside protons, neutrons and other baryons. At high temperature or density, however, the situation changes: baryons "melt" and quarks and gluons, now deconfined, form a plasma. When the net baryon density is zero, this change is rapid but "smooth" (analytic) as the temperature is raised. On the contrary, at high baryon density it is believed that this change is accompanied by a phase transition, similar to that of ice turning into water. For a particular intermediate baryon density, the latent heat of this phase transition vanishes (the phase transition is second-order).The corresponding temperature and baryon density define the so-called QCD critical point, which is the object of both experimental and theoretical searches, the former by heavy-ion collision experiments at RHIC (Brookhaven) and soon at LHC, the latter by numerical lattice simulations.

What makes the project particularly interesting is that the measured response to an increase in baryon density is opposite in sign to standard expectations. This surprising result is obtained, for now, on a very coarse lattice, and needs to be confirmed on finer ones. If it turns out to be a property of the continuum QCD theory, it will have a profound impact on the QCD phase diagram. In particular, it will make it unlikely to find a QCD critical point at small baryon density.

To simulate QCD on a computer, one discretizes space and time into a 4-dimensional grid. The quark and gluon fields live respectively on the sites and bonds of this "lattice". The computer generates a sample of the most important configurations of quark and gluon fields, evolving them one Monte Carlo step at a time. In the present project, the same system is studied at 18 different temperatures around that of the phase transition, and what is measured is the response to a small increase in the baryon density. The signal is tiny and easily drowned by the statistical fluctuations. The necessary high statistics are obtained by running, at each temperature, many independent Monte Carlo threads. This strategy allows the simultaneous use of many Grid nodes in an embarrassingly parallel mode.

Computational model

The Lattice QCD computation is organized in the following way:

  • Quark and gluon fields configurations are stored in snapshot files, each one about 10MB in size
  • At present there are 1063 snapshots
  • Each snapshot presents a separate computational task

So, we're running the Lattice QCD simulation on the grid, as the Monte Carlo methods are inheritely parallel (lots of independend random sampling). The computation is organized in the following way: each of the particle configuration snapshots is a file, about 10MB big, which contains the values of the quark and gluon fields on the grid lattice, etc. The task is to take each of those snapshots in turn and perform the computation to evolve them to the next state. Additionally, some of the snapshots are more important than the other so a specific scheduling scheme is applied.

So we were running this simulation on the grid as the Monte Carlo methods are inherently parallel (lots of independent random sampling). In Lattice QCD we have ~1063 snapshots, each of which represents a separate computational task. Thus, it became apparent that we will do best if we manage to get equal number of workers to work on them; As submitting lots of jobs over the extended period of time can be a tedious task, we've come up with the idea to automate the process and write a script for Ganga/diane to handle generating workers and maintaining their number over time.


For the Lattice QCD computation we use the following setup:

The diane master is responsible for task scheduling. It decides the order in which the snapshots should be processed and schedules the tasks to individual workers.

The workers (running on the grid) do the computation. Each worker receives a snapshot to process and keeps evolving it. In order to avoid unnecessary network traffic, once the worker receives a particular snapshot it will continue to evolve it for as long as possible. This way it is not necessary to send a new snapshot each time the worker completes an iteration (the results of each iteration have to be sent back though).

Finally, the Agent Factory generates the workers and tries to maintain their number on a predefined level.

It became apparent that we will do best if we manage to get the number of workers equal to the number of snapshots. As submitting lots of jobs over the extended period of time can be a tedious task, we've come up with the idea to automate the process and write a script for Ganga/diane to handle both generating Worker Agents and maintaining their number over time.


One thing we noticed when I started to work on this project was that when we've submitted a lot of jobs to the grid, most of them failed shortly afterwards. The reason was the executable being incompatible with AMD processors and, as it turned out, significant part of our resources consisted of AMD Opteron based machines. Furthermore there was a vast array of other reasons why the Worker Agents were failing. Those include: missing libraries, network connectivity/firewall problems, server side configuration problems, etc. Furthermore it is not uncommon for the faulty computing elements to have plenty of tasks scheduled by the resource broker. I believe it is due to the fact that quickly failing jobs do not use up the available resources and thus the computing element, being unoccupied, is considered to be ready to accept more tasks

The idea was that we can try to do better than simply sending jobs to the grid by utilizing some sort of feedback on its current state. To do this I have devised an algorithm based on the fitness proportional selection algorithm commonly used in genetic algorithms.

The algorithm aims to improve the submission success rate by micromanaging the computing elements. The idea is that we discourage submitting jobs to the computing elements that are deemed to be unreliable. At the same time we also aim to balance the good computing elements and avoid over submitting; Ideally we would push out as many workers as the computing element is willing to accept and stay close to this boundary.

Good / bad CE - where to draw the line?

For this approach to work we need to find a way to rank the reliability of the computing elements. After a job is submitted there are few possible scenarios of what can happen. The job can sit in the queue, run, fail (immediately or after a while), be removed by the resource manager, etc. You can divide all of those cases into two categories; (i) those that count in its favour or (ii) those that do not.

In the group (i), being the positive feedback are the number of running jobs (in the "running" status) and the number of jobs that finished running without rising any erros (in the "completed" or "failed" state, but without any error messages on stdout; the job is considered failed when "ERROR" or "EXCEPTION" keywords appear on its stderr).

The negative feedback on the other hand is everything else: failed jobs, jobs waiting to be run, completing (or in any other transient status), etc.

Fitness = a measure of reliability

So the Agent Factory uses a fitness measure which is defined in the following: (running + completed (without errors)) / total

Nondeterministic decision process

Once we compute the fitness for each of the computing elements we can then decide where to send individual jobs. We use the method called fitness proportional selection which nondeterministcially chooses the computing element with probability proportional to its fitness and the fitness of the whole population (i.e. all discovered computing elements).

It works in the following way: First we add up all the fitnesses of the individual computing elements, each of which is a number between 0 and 1; You can think of it graphically as constructing a line segment where each computing element occupies some part of it; Then we add +1 to this number to include one extra element - generic grid slot (which simply means that we submit the job without specifying any computing element and we let the grid handle this). Once we've done this, we can start choosing the computing elements.

We select a random number drawn uniformly from the range between 0 and the total fitness (including the +1); It points to somewhere on the line segment; we then check which computing element occupies this region and select it to run the job.

One of the properties of this method is that when the fitness of a given computing element hits 0, we will never explicitly submit a job to it (as it would be equivalent to occupying a part of line sement of width 0, which is impossible to select). If we however submit the job to the grid (through the generic grid slot), without the computing element annotation, we have no control on where it will be run and it is possible that it will be submitted to a computing element which we dismissed as being "bad", allowing it to rehabiliate itself.

Worker Agent submission

Finally there is the job submission phase to consider. In Agent Factory the submission mechanism builds on top of Submitter class. The Submitter object encapsulates the logic to submit a single job; that is, it specifies where the job should be run (the local backend, grid, etc.), while Agent Factory encapsulates the logic to submit many jobs over time, in order to keep the steady worker level; The default submitter is the LCGSubmitter; it is a slightly changed version of the script which was available in previous versions of diane (and is probably still available now). The only real difference is that while was a script, LCGSubmitter is now a class and it makes it more reusable and extensible.


Agent Factory is a Ganga script written in Python and follows Ganga directory structure. When run, it creates the agent_factory directory in the gangadir where it stores it's data; this includes the internal data as well as logging information. As a part of logging, data about failed jobs is gathered and stored in failure_log directory. This includes full print, loginfo, stdout, stderr, etc.

Additionally a simple file lock system is implemented to prevent many instances running concurretnly. This is necessary because Ganga allows for one monitoring loop only and is very likely to crash, should two instances with monitoring loop were started.

Usage scenarios

I've already mentioned that you can start the Agent Factory via acrontab. There are also other ways to run it and Agent Factory has few command line options to accommodate for the possible cases.

For example you can just run it as a live process using the following command:

ganga --config=config-lostman.gear --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300

The only options that have to be specified are --diane-worker-number which is the desired number of Worker Agents that you'd like to keep alive and --repeat-interval which specifies how often should the AgentFactory query the Master for the number of alive workers. Additionally you can specify --diane-max-pending which puts a limit on the number of jobs you are willing to submit while having other jobs already waiting to be run. For example if you specify --diane-max-pending=50 and then submit 50 jobs and none of them will finish or start running, then the script will stop submitting new Worker Agents until situation changes.

Running Agent Factory as a live process does have certain problems though. If you store any data on AFS (example: proxy certificates) then after 24h your AFS token will expire and it will cause errors (Ganga needs the access to the proxy file). The solution to this is to run the Agent Factory via acrontab which starts each scheduled process with a fresh ticket. The following command can be used to start it via acrontab:

ganga --config=config-lostman.gear --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 --run-time=10800 >& /dev/null

There is also a simple mechanism to kill the script. If you run with --kill option, it will place the marker file in the agent_factory directory and upon detecting this file, the script will terminate (and remove the marker too). This is particularly usefull when running Agent Factory on lxplus via acrontab - if started that way you can't see on which node the script will run and thus killing it would be a very hard task (unless you wan't to check all lxplus nodes).

ganga --config=config-lostman.gear --kill

The only difference between this and the previous one is the addition of --run-time option and redirecting the standard output. The first one is just a means to make the script finish before the AFS token expires and let the new one be started and the second one redirects the stdout and stderr to /dev/null as the acrontab sends users the emails containing all standard output. Its much more convenient to access the logs instead.

Agent Factory becomes really useful when a large amount of diane workers is required over an extended period of time and thus it has been equipped with a variety of configuration options to accommodate for the many possible use cases; I would like to explain some of those options here and demonstrate two possible usage scenarios - running Agent Factory continuously as a live process or periodically through Acrontab (cron);

Fianlly, Agent Factory is not without limitations. There can be only once instance running, so if more are required a separate configuration file and gangadir needs to be created. Furthermore, there is no way to extends the life time of the proxy. Once submitted, each Worker Agent has a finite lifetime and when proxy expires all the Worker Agents are killed. One way to alleviate this is to create new proxy few days before old one finishes and let the workers be gradually replaced (as some of them are likely fail, even after running for a while).

Lattice QCD production data

Lattice QCD: the story so far

  • Production run using Gear VO
    • Collaboration sites:
    • Swiss National Supercomputing Centre
    • National Institute for Subatomic Physics, Netherlands
    • CYFRONET, Poland
    • CERN
  • ~2 million cpu hours / 3 months = 231 years on a single machine! *~4.3 TB of data transferred


  • introduction:
    • latticeQCD: long term theoretical physics computation, 3 months so far, ~1000 workers, 1.2M iterations, ...
    • computational model, snapshots, maximum worker amount, snapshots based on the beta parameter (physics)
    • problem at hand: automatic worker submission, keeping specified amount of workers alive, etc
    • issues: jobs do not live forever, architecture issues, various failures (more in error analysis)
  • meta-algorithm:
    • dynamic algorithm that can adapt to changes on the grid
    • heuristic approach independent of underlying application (relies only on the application exit code); no glue (JDL) requirements
    • good computing elements and bad computing elements - where to draw the line?
      • positive feedback: running jobs, jobs that finished running without any errors (based on diane stderr, application/framework error)
      • negative feedback: pending jobs (to avoid over submitting), failed jobs, all other jobs without clear status (stuck)
    • fitness: a measure of reliability; the ratio of running + completed (error free) jobs in total number of jobs
    • how does it work: maintain a list of known computing elements and their corresponding fitness ratios and non-deterministically choose a computing element with the probability proportional to its fitness; alternatively, submit a job to the grid without specifying any computing element at all
    • characteristics:
      • once the fitness hits 0 no job will be submitted to the computing element explicitly (might be submitted through generic grid slot)
      • forgetting about the old data - influences the decision process and reflects the dynamic character of the grid (conditions change)
      • forgetting about unpromising computing elements: result of removing the old data; once the data is removed and the script is restarted, the information about computing element will be forgotten
  • examples:
    • good computing element (fitness ~1), bad computing element (fitness ~0)
    • handling jobs stuck in the queue, effects on the fitness
  • generic Grid slot:
    • balancing element offsetting part of the decision process to the grid
    • fitness=1
    • used for discovery of new computing elements
    • a chance for a computing element with fitness 0 to rehabilitate itself (instead of waiting for the old data to be removed)
    • examples: starting up the script
  • implementation:
    • Ganga script (stores data in gangadir directory)
    • information from the master (#workers)
    • information from Ganga from job repostitory
    • usage parameters
  • usage scenarios + experience
    • acrontab (periodically start up and run for a specified number of hours - 3h, 6h, etc); good for use with lxplus
    • llive process
    • only one instance per workspace allowed (due to the active monitoring which doesn't like more than one process)
    • file lock system to prevent many instances running in parallel
    • simple kill mechanism: when running on lxplus via acrontab it is unclear on which computer the process is running so the alternative way to kill it is very useful
  • observation results:
    • does not over submit jobs to a particular computing element
    • bad computing elements quickly tend to 0 fitness
    • handles temporarily unavailable computing elements well (i.e. readjusts fitness over time)

error analysis

  • jobs with stderr; types of errors (mostly due to incompatible architecture?)
  • no stderr, but has loginfo - check how long it was running on workernode
  • save error logs (stderr/stdout), print job to file, j.backend.loginfo -> both pending and failed jobs
  • gangadir/agent_factory/....
  • job longevity analysis (per CE)
  • detection of false-positives

-+ future extensions/ideas

  • performance based fitness parameter based on processing power of computing element; requires greater diane integration (amount of completed tasks in time)
  • directory service + agent factory as a another service to greatly simplify using the grid (automatically create the worker pool and let directory service manage it)

-- MaciejWos - 06 Aug 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf agent-factory-presentation.pdf r1 manage 1461.9 K 2008-09-10 - 16:20 MaciejWos A copy of my presentation about the Agent Factory and it's use in Lattice QCD project
Edit | Attach | Watch | Print version | History: r9 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2008-09-10 - MaciejWos
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback