Future DIANE developments

Infrastructure changes

Release infrastructure

Adapt to ganga infrastructure for the release, external packages, testing.

Installation tree:

install
  2.0.0
     bin
     python
     ...
external
  omniorb
  ApMon -> /afs/cern.ch/sw/ganga/external/ApMon

Project tree with AFS prefix /afs/cern.ch/sw/arda/diane

install
external
www
workdir

CVS restructure

Source tree (CVS):

diane
  bin
    diane-run
    diane-submit-workers
    diane-directory-service
  python
    diane
      PACKAGE.py
      core
      idl
    GangaDIANE
  applications
    crashtest
      __init__.py
      crashtest.py
      crashtest.job

Manage workers and master by Ganga

Input/output files, submission to the backends etc. - managed by Ganga (--ganga mode becomes default)

Hurng-Chun: Do we manage the workers as the Ganga subjobs if they are submitted by Ganga? The advantage is to having all the LCG workers submitted at once as a single bulk job.

Do we also manage master as a ganga local job?

Massimo: Not sure were to write this about "Master facilities". I tend to think (maybe politically incorrect) that the Ganga layer should be "invisible", i.e. diane should be powered by Ganga without exposing (at zero level) ganga commands. On the other hand, I would like to code something like the following example, which sounds very ganga-like...


for task in taskList:
    if !master.knownTask(task):
       master.addTask(task)

or

for task in master.failedTasks():
       task.print Summary()

In some sense, all this would be nice (making the master a skeleton application steered by scripts -i.e. a clear api) but this could the make the project too difficult and ultimately make it fail...

For some more basic facilities, I have a small list... probably some of these things are already there (but only Kuba does know how to use them wink

Master facilities

  • Ping
    • via directory service? I would go for something like: ds-list [ds-oid] to get all the master OIDs and then diane-ping master-oid
  • Switch trace level (for debugging).
    • To allow more/less info to go to the log file (local with the master)
    • diane-set-verbosity 0/1/2/3/4
  • Interrogate the logs from "anywhere"
    • diane-logtail [-n] master-oid sort of similar to the unix tail command

Share utilities with Ganga

For example logging/configuration?

Interface changes

Revisit and simplify shell commands

diane-run input.job [[WN,BKND]...]

This command is equivalent to diane.startjob.

diane-submit-workers

Allow Ganga as a user interface

See DIANEGangaIntegration prototypes.

Core changes

Heartbeat mechanisms from Workers to the Master

Currently the master pings the workers. It should be the other way round. Massimo: Maybe... Clearly you need it if there is no connection between workers and master... Could the DS help in this?

Dynamic change of connection-oriented / connection-less policy

The system should react better to broken TCP/IP connections ("!BiDirConn gone"). Actually it should also support a shutdown of idle connections for high-throughput applications with low output rate. The shutdown is a omniORB feature but the M/W logic must be correct for it to work.

Dynamic changes of threading model in the server

Even though omniORB supports thread pools and other thread management mechanisms, it currently does not work - M/W logic problem.

Out-of-process application adapters

Aka "WorkerBlind". Implemented already by Paola, need a review and maybe a rewrite.

Background dataset transfer on the worker nodes

The messages may be received in multiple parts. Especially is files are part of data messages and are delivered in chunks.

Constraints layer.

Sending additional information (benchmarks, constraints, worker group identification). Allow workers to be 'labeled' or organized into a arbitrary data structure (e.g. tree).

Task layer

Task manager interface: ability to arbitrarily modify the parallel pattern. Dynamic task decomposition. Managing the message rate on the master by splitting into larger tasks which may be dynamically decomposed into smaller units.

Secure connections and GSI/SSL integration

omniORB supports it and has been compiled already with SSL flags. How to configure it?

Persistent checkpointing of the state of the master

To be able to resume an interrupted master.

Application and job files

BOOT mechanism should be better defined. Policy methods may be defined in applications (as objects) and parameterized from job file. The application adapters (tarballs) should be shipped automatically to the worker nodes (as a part of worker initialization [via CORBA and not input sandbox]). The tarballs should be automatically refreshed should any files change within the application adapter package.

Other (also "crazy")

  • secure access integration A
  • MacOS A
  • xterm for local master A
  • output sandbox in GangaDIANE (define) A
  • getting the stdout and stderr of the worker while the job is running (peek functionality via ping method) K
  • Massimo: Old question: which is the main difference between a master and a worker? Presently this prevents the possibilities to create hierarchies of masters... My understanding is that the main master limitation comes from the Hz it can sustain (which is not really changed by having a multi-tier structure
  • Massimo: : in a presentation at the EGEE UF I saw a presentation (Biomed in Italy, main author G. Donvito) where they (if I remember correctly) separated the control layer (e.g. say to the next worker something like "do task #5") and actually passing the ingredients to do the task. What they do is that a master implements the control layer only (e.g. walking the list of task and assigning to workers) and the ingredients are fetched by the workers since these are entries of a (separate) database. The advantages might be:
    • higher in performance assuming several DBs could have different sections of the task space
    • more reliability (if a DB does not respond, try a different one)
    • more sophisticated model: the worker can get a bunch of task in one go, perform some, fail others and "refuse" others... think to a system where it is unclear where the input data are...

-- JakubMoscicki - 16 Aug 2006

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2007-08-16 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback