List of mandatory issues to be done before 2.0 release can be made
Documentation TODO
- generate web help for config options and scheduler policies
- DOC: MARSHAL PastMessage exception -> increase the omniorb buffer size...
- DOC: task scheduler callback must be NON BLOCKING! otherwise the TRANSIENT exceptions are seen on the worker agents
- DOC: explain the threading model for task scheduler and manager and the initialization sequence
TODO
- solve problem of application environment on the worker node (worker servlet)
- clarify command line syntax of diane-run etc.
- clarify passing around configuration parameters when run is initialized on the worker node
- clarify application boot (and shipping application modules) when run is initialized on the worker node
- ship application modules as tarball via internal FTS (this is to be able to fix the application module "on-the-fly" if necessary without restarting master c.f. lqcd AMD fix)
- diane workspace directories handling
- clarify the strategy to handle exceptions which are raised in the scheduler callbacks or threads (an exception in scheduler thread will just leave the master hanging...)
- BUG: worker registration before the task scheduler is fully initialized compromises the consistency of the scheduler
- saving user defined parameters in WorkerEntry
- define the default user struct and a way to override it
- MISSING FEATURE: how to map the task to the last worker in task_scheduler.tasks_completed() callback?
- MISSING FEATURE: checkpoint persistently the state of the scheduler and workers to allow master restart...
- TEST: memory leak in the core or in the LatticeQCD extension?
- BUG: ganga submitters: default value of Executable.args = ['Hello World'], the submitter appends to this argument
- introduce worker_registry.initialized_workers() in order to avoid the following race condition
2008-04-25 10:56:41,438 INFO: snapshot tasks (3)
2008-04-25 10:56:41,439 INFO: /storage/lqcd/apps/output/dat/snap_0000_5.1830_12345
2008-04-25 10:56:41,439 INFO: /storage/lqcd/apps/output/dat/snap_0000_5.1835_12352
2008-04-25 10:56:41,439 INFO: /storage/lqcd/apps/output/dat/snap_0000_5.1840_12359
2008-04-25 10:57:30,057 INFO: registering worker
2008-04-25 10:57:30,058 INFO: new worker registered: wid=1, worker_uuid=1274329418.31
Exception in thread diane.BaseThread.LQCDTaskScheduler:
Traceback (most recent call last):
File "/usr/lib/python2.3/threading.py", line 436, in __bootstrap
self.run()
File "/storage/lqcd/apps/LatticeQCD2/__init__.py", line 193, in run
alive_workers = [w for w in self.job_master.worker_registry.alive_workers.values() if not w.snapshot_task is None]
AttributeError: WorkerEntry instance has no attribute 'snapshot_task'
- BUG: built-in file transfer client overwrite exception raised on upload in do_work() creates this problem on the master:
omniORB: Caught an unexpected Python exception during up-call.
Traceback (most recent call last):
File "/home/moscicki/diane/install/2.0-beta11/python/diane/RunMaster.py", line 315, in put_task_result
task_result = streamer.loads(task_result)
File "/home/moscicki/diane/install/2.0-beta11/python/diane/streamer.py", line 9, in loads
return cPickle.loads(zlib.decompress(dxp_repr))
TypeError: ('__init__() takes exactly 2 arguments (1 given)', <class 'DIANE_CORBA.XFileTransferError'>, ())
DONE
- create combined ganga/diane package (installer)
- omniORB configuration (GSI...)
- download and install script
- setting diane environment (get rid of env.sh script?) and external packages (omniorb => lib/pythonX.Y problem)
- compile externals on slc4 (lxplus)
--
JakubMoscicki - 09 Jan 2008