Future DIANE developments
Infrastructure changes
Release infrastructure
Adapt to ganga infrastructure for the release, external packages, testing.
Installation tree:
install
2.0.0
bin
python
...
external
omniorb
ApMon -> /afs/cern.ch/sw/ganga/external/ApMon
Project tree with AFS prefix
/afs/cern.ch/sw/arda/diane
install
external
www
workdir
CVS restructure
Source tree (CVS):
diane
bin
diane-run
diane-submit-workers
diane-directory-service
python
diane
PACKAGE.py
core
idl
GangaDIANE
applications
crashtest
__init__.py
crashtest.py
crashtest.job
Manage workers and master by Ganga
Input/output files, submission to the backends etc. - managed by Ganga (--ganga mode becomes default)
Hurng-Chun: Do we manage the workers as the Ganga subjobs if they are submitted by Ganga? The advantage is to having all the LCG workers submitted at once as a single bulk job.
Do we also manage master as a ganga local job?
Massimo: Not sure were to write this about "Master facilities". I tend to think (maybe politically incorrect) that the Ganga layer should be "invisible", i.e. diane should be powered by Ganga without exposing (at zero level) ganga commands. On the other hand, I would like to code something like the following example, which sounds very ganga-like...
for task in taskList:
if !master.knownTask(task):
master.addTask(task)
or
for task in master.failedTasks():
task.print Summary()
In some sense, all this would be nice (making the master a skeleton application steered by scripts -i.e. a clear api) but this could the make the project too difficult and ultimately make it fail...
For some
more basic
facilities, I have a small list... probably some of these things are already there (but only Kuba does know how to use them
Master facilities
- Ping
- via directory service? I would go for something like: ds-list [ds-oid] to get all the master OIDs and then diane-ping master-oid
- Switch trace level (for debugging).
- To allow more/less info to go to the log file (local with the master)
- diane-set-verbosity 0/1/2/3/4
- Interrogate the logs from "anywhere"
- diane-logtail [-n] master-oid sort of similar to the unix tail command
Share utilities with Ganga
For example logging/configuration?
Interface changes
Revisit and simplify shell commands
diane-run input.job [[WN,BKND]...]
This command is equivalent to diane.startjob.
diane-submit-workers
Allow Ganga as a user interface
See DIANEGangaIntegration prototypes.
Core changes
Heartbeat mechanisms from Workers to the Master
Currently the master pings the workers. It should be the other way round.
Massimo: Maybe... Clearly you need it if there is no connection between workers and master... Could the DS help in this?
Dynamic change of connection-oriented / connection-less policy
The system should react better to broken TCP/IP connections ("!BiDirConn gone"). Actually it should also support a shutdown of idle connections for high-throughput applications with low output rate. The shutdown is a omniORB feature but the M/W logic must be correct for it to work.
Dynamic changes of threading model in the server
Even though omniORB supports thread pools and other thread management mechanisms, it currently does not work - M/W logic problem.
Out-of-process application adapters
Aka "WorkerBlind". Implemented already by Paola, need a review and maybe a rewrite.
Background dataset transfer on the worker nodes
The messages may be received in multiple parts. Especially is files are part of data messages and are delivered in chunks.
Constraints layer.
Sending additional information (benchmarks, constraints, worker group identification). Allow workers to be 'labeled' or organized into a arbitrary data structure (e.g. tree).
Task layer
Task manager interface: ability to arbitrarily modify the parallel pattern. Dynamic task decomposition. Managing the message rate on the master by splitting into larger tasks which may be dynamically decomposed into smaller units.
Secure connections and GSI/SSL integration
omniORB supports it and has been compiled already with SSL flags. How to configure it?
Persistent checkpointing of the state of the master
To be able to resume an interrupted master.
Application and job files
BOOT mechanism should be better defined. Policy methods may be defined in applications (as objects) and parameterized from job file. The application adapters (tarballs) should be shipped automatically to the worker nodes (as a part of worker initialization [via CORBA and not input sandbox]). The tarballs should be automatically refreshed should any files change within the application adapter package.
Other (also "crazy")
- secure access integration A
- MacOS A
- xterm for local master A
- output sandbox in GangaDIANE (define) A
- getting the stdout and stderr of the worker while the job is running (peek functionality via ping method) K
- Massimo: Old question: which is the main difference between a master and a worker? Presently this prevents the possibilities to create hierarchies of masters... My understanding is that the main master limitation comes from the Hz it can sustain (which is not really changed by having a multi-tier structure
- Massimo: : in a presentation at the EGEE UF I saw a presentation (Biomed in Italy, main author G. Donvito) where they (if I remember correctly) separated the control layer (e.g. say to the next worker something like "do task #5") and actually passing the ingredients to do the task. What they do is that a master implements the control layer only (e.g. walking the list of task and assigning to workers) and the ingredients are fetched by the workers since these are entries of a (separate) database. The advantages might be:
- higher in performance assuming several DBs could have different sections of the task space
- more reliability (if a DB does not respond, try a different one)
- more sophisticated model: the worker can get a bunch of task in one go, perform some, fail others and "refuse" others... think to a system where it is unclear where the input data are...
--
JakubMoscicki - 16 Aug 2006