TWiki> ArdaGrid Web>AgentFactory (revision 8)EditAttachPDF

DRAFT - Agent Factory - DRAFT

LQCDAgentFactoryDataFiles

Usage: acrontab / screen

AgentFactory is a Ganga script which is a part of diane distribution. Thus, it is imperative to check whether both Ganga and diane are available and whether the environment is set-up correctly; this can be done the following way:

(i) diane

> which diane-env /afs/cern.ch/sw/arda/diane/install/VERSION/bin/diane-env

if you don't have diane-env in PATH, please do the following: > $(/afs/cern.ch/sw/arda/diane/install/VERSION/bin/diane-env)

where VERSION is the suitable diane version; AF is distributed with 2.0-beta16 and newer.

(ii) Ganga

> which ganga /afs/cern.ch/sw/arda/diane/ganga/install/VERSION/bin/ganga

or if not present: export PATH=$PATH:/afs/cern.ch/sw/arda/diane/ganga/install/VERSION/bin

where VERSION is the suitable Ganga version

(iii) AgentFactory

> diane-env -d ganga AgentFactory.py --help

If everything is fine, running the above command should produce the AgentFactory help screen:

(...)
DIANE Ganga Submitter
usage: AgentFactory.py [options]

Submit worker agents to LCG. DIANE Worker Agent Submitter starts N worker
agents. The worker agents are configured according to the specification in a
run file and connect the specified run master or a directory service.
options:
  -h, --help            show this help message and exit
  --diane-run-file=RUNFILE
                        optional file containing the run description and
                        submission customization parameters (see diane-run
                        command)
  --diane-master=MASTERLOC
                        the run master identifier. With
                        MASTERLOC=workspace:auto (default) the workers will
                        connect to the last master (run) defined in the local
                        DIANE_USER_WORKSPACE. Use arbitrary file:
                        MASTERLOC=file:path/to/MasterOID (the file: prefix may
                        be skipped). Use the given run number:
                        MASTERLOC=workspace:N
(...)

(iv) Configuration file; AFS token expiry

(v) run the agent factory

> screen > diane-env -d ganga --config=CONFIG_FILE AgentFactory.py --diane-worker-number=1000 --diane-max-pending=50 --repeat-interval=300 --square-fitness

When no data is stored on AFS, it is possible to run AgentFactory as a live process. Since the most common use-case requires AF to run for extended period of time, it is worth running on screen; this can be done the following way:

ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 --run-time=10800 >& /dev/null

Lattice QCD: searching for QCD critical point

Quantum Chromodynamics (QCD) describes the strong interactions between quarks and gluons, which are normally confined inside protons, neutrons and other baryons. At high temperature or density, however, the situation changes: baryons "melt" and quarks and gluons, now deconfined, form a plasma. When the net baryon density is zero, this change is rapid but "smooth" (analytic) as the temperature is raised. On the contrary, at high baryon density it is believed that this change is accompanied by a phase transition, similar to that of ice turning into water. For a particular intermediate baryon density, the latent heat of this phase transition vanishes (the phase transition is second-order).The corresponding temperature and baryon density define the so-called QCD critical point, which is the object of both experimental and theoretical searches, the former by heavy-ion collision experiments at RHIC (Brookhaven) and soon at LHC, the latter by numerical lattice simulations.

What makes the project particularly interesting is that the measured response to an increase in baryon density is opposite in sign to standard expectations. This surprising result is obtained, for now, on a very coarse lattice, and needs to be confirmed on finer ones. If it turns out to be a property of the continuum QCD theory, it will have a profound impact on the QCD phase diagram. In particular, it will make it unlikely to find a QCD critical point at small baryon density.

To simulate QCD on a computer, one discretizes space and time into a 4-dimensional grid. The quark and gluon fields live respectively on the sites and bonds of this "lattice". The computer generates a sample of the most important configurations of quark and gluon fields, evolving them one Monte Carlo step at a time. In the present project, the same system is studied at 18 different temperatures around that of the phase transition, and what is measured is the response to a small increase in the baryon density. The signal is tiny and easily drowned by the statistical fluctuations. The necessary high statistics are obtained by running, at each temperature, many independent Monte Carlo threads. This strategy allows the simultaneous use of many Grid nodes in an embarrassingly parallel mode.

Computational model

The Lattice QCD computation is organized around the quark and gluon field snapshots. Those snapshots are files, each of the size of approximately 10 MB and they contain the most interesting field configurations, that is, the values of those fields spaced in discrete intervals.

computational-model.png

Each of the snapshots represents a separate computational task and its state is evolved independently. When a Worker Agent connects to the Master, it receives a snapshot to work on and keeps evolving it for as long as possible, sending the results of the consecutive iterations back to the Master. The number of snapshots varied during different phases of the project. Initially there were 1500 configurations to work with, but over time some of them have been deemed to be uninteresting and were discarded. At the time this text is being written there are 1063 snapshots left in the production run.

Furthermore, not all of the 18 different temperatures that the system is being studied at give equally interesting results. The scheduling algorithm has been adjusted to cope with this and it gives priority to certain configurations instead of evolving all of them equally.

Setup

All of the computation is run on the grid. The parallel nature of the Monte Carlo method (independent random sampling) makes it a perfect environment.

We use the following Ganga/diane setup:

setup.png

The diane master is responsible for task scheduling. It decides the order in which the snapshots should be processed and schedules the tasks to individual workers.

The workers do the computation and communicate the results back to the master. In order to avoid unnecessary network traffic, once a particular worker receives a snapshot, it will keep processing it until stopped by the master or by the resource manager. This way it is not necessary to send a new snapshot to the worker which has just finished running its previous iteration. It can continue to work on the same file.

Finally, the Agent Factory automatically generates the workers by submitting appropriate jobs to the grid. It also monitors their number and is responsible of maintaining it on a level specified by the user.

Motivation

One thing we noticed when Agent Factory development was started was that when a lot of jobs were submitted to the grid, most of the failed shortly afterwards. The reason was the executable being incompatible with AMD processors and, as it turned out, significant part of our grid resources consisted of AMD Opteron based machines. Further investigation also revealed a vast array of other reasons leading to workers failing to run properly. Those reasons include missing libraries, network connectivity/firewall problems, server side configuration problems, etc..

Furthermore, it is not uncommon for the faulty computing element to have numerous jobs scheduled by the resource broker. This might be happening because of the computing element being unoccupied (since no job can run on it) and thus considered ready to accept more tasks.

This led to a situation where the majority of the Worker Agents submitted failed to run and it was becoming increasingly tedious to send more and more of them "by hand".

Algorithm

The idea was that we try to do better than simply submitting workers to the grid and to do so we will utilize some sort of feedback on its current state. To do this, the fitness proportional selection (also known as roulette wheel algorithm commonly used in genetic algorithms) was adopted.

The algorithm aims to improve the submission success rate by micromanaging the computing elements. The idea is that submitting jobs to the computing elements which are deemed to be unreliable is discouraged, while at the same time the the good computing elements are balanced to avoid suffering from over submitting. Ideally, we would send as many workers as the computing element is willing to accept and stay close to this boundary. However, this boundary may change over time and we have to adjust our expectations to anticipate this.

Good / bad Computing Element - where to draw the line?

For the aforementioned idea to work we need to find a way to rank the reliability of the computing elements. In order to do so, we need to decide what kind of feedback to use and how to turn it into a reliability measure.

After a job is submitted to the grid, there are few possible scenarios of what can happen: the job can wait in the queue, run, fail (immediately or after running for some time), be removed from the queue by the resource manager, etc. One can classify those possibilities into two categories: (i) those that count in favor and (ii) those that do not and based on how we decide to carry on this classification, a simple reliability measure can be constructed.

In the group (i), being the positive feedback are the number of running jobs (in the "running" status) and the number of jobs that finished running without rising any erros (in the "completed" or "failed" state, but without any error messages on stdout; the job is considered failed when "ERROR" or "EXCEPTION" keywords appear on its stderr).

The negative feedback on the other hand is everything else: failed jobs, jobs waiting to be run, completing (or in any other transient status), etc.

Fitness = a measure of reliability

The Agent Factory ranks the computing elements using a reliability measure called fitness, defined in the following way:

( running + completed ) / total

fitness.png

This formula makes fitness a number in the [0,1] interval with fitness=1 meaning that the computing element is very reliable (all the jobs assigned to it are either running or finished running without any errors) and fitness=0 meaning that it is not reliable at all (none of the jobs assigned to it are running).

Nondeterministic decision process

Once we compute the fitness of each computing element, we can start deciding where to submit generated Worker Agents.

nondeterministic-decision-process.png

Once we compute the fitness for each of the computing elements we can then decide where to send individual jobs. We use the method called fitness proportional selection which nondeterministcially chooses the computing element with probability proportional to its fitness and the fitness of the whole population (i.e. entirety of the discovered computing elements).

It works in the following way: First we add up all the finesses of the individual computing elements, each of which is a number between 0 and 1; You can think of it graphically as constructing a line segment where each computing element occupies some part of this segment; Then we add +1 to this number to include one extra element - generic grid slot (which simply means that we submit the job without specifying any computing element and we let the grid handle this). Once we've done this, we can start choosing the computing elements.

We select a random number drawn uniformly from the range between 0 and the total fitness (including the +1); It points to somewhere on the line segment; we then check which computing element occupies this region and select it to run the job.

One of the properties of this method is that when the fitness of a given computing element hits 0, we will never explicitly submit a job to it (as it would be equivalent to occupying a part of line sement of width 0, which is impossible to select). If we however submit the job to the grid (through the generic grid slot), without the computing element annotation, we have no control on where it will be run and it is possible that it will be submitted to a computing element which we dismissed as being "bad", allowing it to rehabiliate itself.

Worker Agent submission

Finally there is the job submission phase. In Agent Factory the submission mechanism builds on top of Submitter class. The Submitter object encapsulates the logic to submit a single job; that is, it specifies where the job should be run (the local backend, grid, etc.), while Agent Factory encapsulates the logic to submit many jobs over time, in order to keep the steady worker level; The default submitter is the LCGSubmitter; it is a slightly changed version of the LCG.py script which was available in previous versions of diane (and is probably still available now). The only real difference is that while LCG.py was a script, LCGSubmitter is now a class and it makes it more reusable and extensible.

Implementation

Agent Factory is a Ganga script written in Python and it follows the Ganga directory structure. When run, it creates the agent_factory directory inside Ganga's output directory (~/gangadir) where it stores it's data. This includes the internal data as well as the additional logs. Attempt to gather all available information about failed jobs (full print, loginfo, stdout, stderr, etc.) is made, in order to provide an easy way to check what kind or problems one can expect.

Additionally a simple file lock system is implemented to prevent many Agent Factory instances from running in parallel. This is necessary as Ganga allows the existence one monitoring loop only and having two or more monitoring loops accessing the same workspace is very likely to result in a crash.

Usage scenarios

I've already mentioned that you can start the Agent Factory via acrontab. There are also other ways to run it and Agent Factory has few command line options to accommodate for the possible cases.

For example you can just run it as a live process using the following command:

ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300

The only options that have to be specified are --diane-worker-number which is the desired number of Worker Agents that you'd like to keep alive and --repeat-interval which specifies how often should the AgentFactory query the Master for the number of alive workers. Additionally you can specify --diane-max-pending which puts a limit on the number of jobs you are willing to submit while having other jobs already waiting to be run. For example if you specify --diane-max-pending=50 and then submit 50 jobs and none of them will finish or start running, then the script will stop submitting new Worker Agents until situation changes.

Running Agent Factory as a live process does have certain problems though. If you store any data on AFS (example: proxy certificates) then after 24h your AFS token will expire and it will cause errors (Ganga needs the access to the proxy file). The solution to this is to run the Agent Factory via acrontab which starts each scheduled process with a fresh ticket. The following command can be used to start it via acrontab:

ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 --run-time=10800 >& /dev/null

There is also a simple mechanism to kill the script. If you run AgentFactory.py with --kill option, it will place the marker file in the agent_factory directory and upon detecting this file, the script will terminate (and remove the marker too). This is particularly usefull when running Agent Factory on lxplus via acrontab - if started that way you can't see on which node the script will run and thus killing it would be a very hard task (unless you wan't to check all lxplus nodes).

ganga --config=config-lostman.gear AgentFactory.py --kill

The only difference between this and the previous one is the addition of --run-time option and redirecting the standard output. The first one is just a means to make the script finish before the AFS token expires and let the new one be started and the second one redirects the stdout and stderr to /dev/null as the acrontab sends users the emails containing all standard output. Its much more convenient to access the logs instead.

Agent Factory becomes really useful when a large amount of diane workers is required over an extended period of time and thus it has been equipped with a variety of configuration options to accommodate for the many possible use cases; I would like to explain some of those options here and demonstrate two possible usage scenarios - running Agent Factory continuously as a live process or periodically through Acrontab (cron);

Finally, Agent Factory is not without limitations. There can be only once instance running, so if more are required a separate configuration file and gangadir needs to be created. Furthermore, there is no way to extends the life time of the proxy. Once submitted, each Worker Agent has a finite lifetime and when proxy expires all the Worker Agents are killed. One way to alleviate this is to create new proxy few days before old one finishes and let the workers be gradually replaced (as some of them are likely fail, even after running for a while).

Add

Lattice QCD production data

Lattice QCD: the story so far

Lattice QCD has been running using Gear VO. There are four collaboration sites involved in this project: CERN, Swiss National Supercomputing Centre, National Institute for Subatomic Physics in Netherlands and CYFRONET in Poland. Additionally, part of the testing was carried on using Geant4 VO.

Over the period of the 3 months we have done more than 2 million CPU hours and transmitted more than 4 TB of data over the network. This is roughly equivalent to 231 years of computation on a single machine.

presentation

  • introduction:
    • latticeQCD: long term theoretical physics computation, 3 months so far, ~1000 workers, 1.2M iterations, ...
    • computational model, snapshots, maximum worker amount, snapshots based on the beta parameter (physics)
    • problem at hand: automatic worker submission, keeping specified amount of workers alive, etc
    • issues: jobs do not live forever, architecture issues, various failures (more in error analysis)
  • meta-algorithm:
    • dynamic algorithm that can adapt to changes on the grid
    • heuristic approach independent of underlying application (relies only on the application exit code); no glue (JDL) requirements
    • good computing elements and bad computing elements - where to draw the line?
      • positive feedback: running jobs, jobs that finished running without any errors (based on diane stderr, application/framework error)
      • negative feedback: pending jobs (to avoid over submitting), failed jobs, all other jobs without clear status (stuck)
    • fitness: a measure of reliability; the ratio of running + completed (error free) jobs in total number of jobs
    • how does it work: maintain a list of known computing elements and their corresponding fitness ratios and non-deterministically choose a computing element with the probability proportional to its fitness; alternatively, submit a job to the grid without specifying any computing element at all
    • characteristics:
      • once the fitness hits 0 no job will be submitted to the computing element explicitly (might be submitted through generic grid slot)
      • forgetting about the old data - influences the decision process and reflects the dynamic character of the grid (conditions change)
      • forgetting about unpromising computing elements: result of removing the old data; once the data is removed and the script is restarted, the information about computing element will be forgotten
  • examples:
    • good computing element (fitness ~1), bad computing element (fitness ~0)
    • handling jobs stuck in the queue, effects on the fitness
  • generic Grid slot:
    • balancing element offsetting part of the decision process to the grid
    • fitness=1
    • used for discovery of new computing elements
    • a chance for a computing element with fitness 0 to rehabilitate itself (instead of waiting for the old data to be removed)
    • examples: starting up the script
  • implementation:
    • Ganga script (stores data in gangadir directory)
    • information from the master (#workers)
    • information from Ganga from job repostitory
    • usage parameters
  • usage scenarios + experience
    • acrontab (periodically start up and run for a specified number of hours - 3h, 6h, etc); good for use with lxplus
    • llive process
    • only one instance per workspace allowed (due to the active monitoring which doesn't like more than one process)
    • file lock system to prevent many instances running in parallel
    • simple kill mechanism: when running on lxplus via acrontab it is unclear on which computer the process is running so the alternative way to kill it is very useful
  • observation results:
    • does not over submit jobs to a particular computing element
    • bad computing elements quickly tend to 0 fitness
    • handles temporarily unavailable computing elements well (i.e. readjusts fitness over time)

error analysis

  • jobs with stderr; types of errors (mostly due to incompatible architecture?)
  • no stderr, but has loginfo - check how long it was running on workernode
  • save error logs (stderr/stdout), print job to file, j.backend.loginfo -> both pending and failed jobs
  • gangadir/agent_factory/....
  • job longevity analysis (per CE)
  • detection of false-positives

-+ future extensions/ideas

  • performance based fitness parameter based on processing power of computing element; requires greater diane integration (amount of completed tasks in time)
  • directory service + agent factory as a another service to greatly simplify using the grid (automatically create the worker pool and let directory service manage it)

-- MaciejWos - 06 Aug 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf agent-factory-presentation.pdf r1 manage 1461.9 K 2008-09-10 - 16:20 MaciejWos A copy of my presentation about the Agent Factory and it's use in Lattice QCD project
PNGpng computational-model.png r2 r1 manage 49.5 K 2008-09-12 - 02:51 MaciejWos  
PNGpng fitness.png r1 manage 77.3 K 2008-09-12 - 02:36 MaciejWos  
PNGpng nondeterministic-decision-process.png r1 manage 52.5 K 2008-09-12 - 02:38 MaciejWos  
PNGpng setup.png r1 manage 76.9 K 2008-09-12 - 02:39 MaciejWos  
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009-02-13 - MaciejWos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback