List of mandatory issues to be done before 2.0 release can be made

Documentation TODO

  • generate web help for config options and scheduler policies
  • DOC: MARSHAL PastMessage exception -> increase the omniorb buffer size...
  • DOC: task scheduler callback must be NON BLOCKING! otherwise the TRANSIENT exceptions are seen on the worker agents
  • DOC: explain the threading model for task scheduler and manager and the initialization sequence

TODO

  • solve problem of application environment on the worker node (worker servlet)
  • clarify command line syntax of diane-run etc.
  • clarify passing around configuration parameters when run is initialized on the worker node
  • clarify application boot (and shipping application modules) when run is initialized on the worker node
    • ship application modules as tarball via internal FTS (this is to be able to fix the application module "on-the-fly" if necessary without restarting master c.f. lqcd AMD fix)
  • diane workspace directories handling
  • clarify the strategy to handle exceptions which are raised in the scheduler callbacks or threads (an exception in scheduler thread will just leave the master hanging...)
  • BUG: worker registration before the task scheduler is fully initialized compromises the consistency of the scheduler
  • saving user defined parameters in WorkerEntry
    • define the default user struct and a way to override it
  • MISSING FEATURE: how to map the task to the last worker in task_scheduler.tasks_completed() callback?
  • MISSING FEATURE: checkpoint persistently the state of the scheduler and workers to allow master restart...
  • TEST: memory leak in the core or in the LatticeQCD extension?
  • BUG: ganga submitters: default value of Executable.args = ['Hello World'], the submitter appends to this argument
  • introduce worker_registry.initialized_workers() in order to avoid the following race condition
2008-04-25 10:56:41,438 INFO: snapshot tasks (3)
2008-04-25 10:56:41,439 INFO: /storage/lqcd/apps/output/dat/snap_0000_5.1830_12345
2008-04-25 10:56:41,439 INFO: /storage/lqcd/apps/output/dat/snap_0000_5.1835_12352
2008-04-25 10:56:41,439 INFO: /storage/lqcd/apps/output/dat/snap_0000_5.1840_12359
2008-04-25 10:57:30,057 INFO: registering worker
2008-04-25 10:57:30,058 INFO: new worker registered: wid=1, worker_uuid=1274329418.31
Exception in thread diane.BaseThread.LQCDTaskScheduler:
Traceback (most recent call last):
  File "/usr/lib/python2.3/threading.py", line 436, in __bootstrap
    self.run()
  File "/storage/lqcd/apps/LatticeQCD2/__init__.py", line 193, in run
    alive_workers = [w for w in self.job_master.worker_registry.alive_workers.values() if not w.snapshot_task is None]
AttributeError: WorkerEntry instance has no attribute 'snapshot_task'
  • BUG: built-in file transfer client overwrite exception raised on upload in do_work() creates this problem on the master:
omniORB: Caught an unexpected Python exception during up-call.
Traceback (most recent call last):
  File "/home/moscicki/diane/install/2.0-beta11/python/diane/RunMaster.py", line 315, in put_task_result
    task_result = streamer.loads(task_result)
  File "/home/moscicki/diane/install/2.0-beta11/python/diane/streamer.py", line 9, in loads
    return cPickle.loads(zlib.decompress(dxp_repr))
TypeError: ('__init__() takes exactly 2 arguments (1 given)', <class 'DIANE_CORBA.XFileTransferError'>, ())

DONE

  • create combined ganga/diane package (installer)
  • omniORB configuration (GSI...)
  • download and install script
    • [90%]
  • setting diane environment (get rid of env.sh script?) and external packages (omniorb => lib/pythonX.Y problem)
    • [done]
  • compile externals on slc4 (lxplus)
    • [done - check needed]

-- JakubMoscicki - 09 Jan 2008

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2008-07-15 - JakubMoscicki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback