This document has been enchanced after the discussion between Alvin and Kuba at CERN on 13.09.2006

The issue

Currently, Ganga 4.2, does not handle the shutdown of the monitoring service in a clean way, which results sometimes in repository locks being left over, trying to download the job output without write permissions in the AFS area, etc. The clean shutdown of services is needed in two situations:

  • ganga process is terminated (e.g. user types ^D in the text shell)
  • credentials expire (such as AFS token or Grid proxy)

Ganga defines an atexit handler which makes sure that repository is flushed but it does not make any attempt to shutdown monitoring loop. Sometimes the commit operation done by monitoring loop may be interrupted by abrupt Ganga shutdown. Also any ongoing output downloads or job status queries may be aborted leaving Ganga in inconsistent state.

Also the proxy checking is not optimal: LCG handler does proxy-init when the LCG module is loaded. This proxy created as a side-effect is used by remote repository (in authenticated mode). The bootstrap procedure is as follows:

  • load system plugins (e.g. LCG -> ask for passphrase to create grid-proxy)
  • load custom (extension) plugin (e.g. GangaLHCb)
  • initialize repository
  • start monitoring (if enabled in the config file)
  • start user interface (e.g. IPython, GUI, ...)

Proposal

Internal Services

We develop a concept of InternalService which supports explicitly the correct runtime behaviour upon boostrap, shutdown or credentials expiry.

We define InternalService as a representation of runtime entities which may be started, stopped and may depend on the credentials validity.

Examples of the internal services include:

  • repository
    • Local repository on AFS depends on AFS token
    • Remote repository in authenticated mode depends on grid-proxy (or voms-proxy in the future)
  • workspace
    • Workspace on AFS depends on AFS token
  • backends
    • LCG backend depends on grid-proxy (if middleware is EDG) or voms-proxy (if middleware is gLite)
  • monitoring subsystem
    • monitoring threads depend on the repository, workspace and backends
      • if either repository or workspace is not shutdown, the whole monitoring is disabled
      • if a certain backend is shutdown, the specific backend monitoring is disabled

The InternalService supports the following public methods:

  • enabled(): return True if service is enabled (for example if a required credentials are valid), it does not mean that the service is started
  • start() : start the service, if already running or if not enabled() then has no effect
  • stop(timeout=None) : stop the service, return False if timeout exceeded, if timeout is None then block until the operation terminated and return True

It is legal to start, stop the service multiple times.

Bootstrap of Ganga

The new bootstrap procedure looks as follows:

  • load system plugins (e.g. LCG -> DO NOT ask for passphrase to create grid-proxy)
  • load custom (extension) plugins (e.g. GangaLHCb)
  • initialize repository
    • depending on Remote/Local, AFS/nonAFS, authenticated/nonauthenticated case define enabled() method to check for proxy validity
    • if not enabled() then do not start() the service i.e. do not connect to the repository
  • create workspace internal service
    • define enabled() method depending on AFS/nonAFS case
  • start monitoring (if enabled in the config file)
    • main monitoring loop: if repository or workspace services are not enabled() then stop() itself, i.e.:
      • do not insert any jobs for updateMonitoringInformation()
      • "broadcast a signal" (by setting approperiate global flag) that all ongoing updateMonitoringInformation() methods should terminated ASAP (see Threading model and checkpoints)
    • do the selection of not enabled() backends and apply the same stop procedure to them
    • stop repository and workspace (i.e. flush uncommited repository changes)
  • start user interface (e.g. IPython, GUI, ...)

Main monitoring loop checks periodically the credentials for the time left. If certain alarm threshold is reached (e.g. 10 minutes before credential expiry), user is notified that services will stop automatically when the stop threshold is reached (e.g. 5 minutes). Unless credentials are renewed, the services are stopped accordingly.

For optimization purposes the enabled() method on services may not call Credential.time_left() method dirrectly, but rely on a cached value. The caching may be done by the main monitoring loop.

User interactions

Credential expiration

At alarm time user should get a message like "AfsToken is going to expire in 10 minutes and services which use it will be stopped automatically in 5 minutes. Do AfsToken.renew() to re-enable the services."

By default monitoring loop should NOT open xterms etc and ask for passphrase. User may use AfsToken.renew() or GridProxy.renew(). The side offect of the renew() should be restart of all services which depend on it

Ganga shutdown

When Ganga is shut-down, then ongoing monitoring service must be stopped. This may take some time, as the monitoring loops should finish correctly (e.g. downloading output files).

There are 3 ways to proceed:

  • interactive mode - after certain timeout (e.g. 5 seconds) Ganga should issue a message saying like this "N remaining jobs in the monitoring loop. Aborting the monitoring loop may lead to inconsistent jobs (e.g. partially retrieved output). Do you want to abort the monitoring loop (y/n)?
  • forced mode - wait certain timeout and shutdown anyway
  • safe mode - wait until the monitoring is really shutdown

These options may be configurable.

Threading model and checkpoints

Deamonic threads are used for all internal services (e.g. monitoring). The reason is that non-daemonic python threads cannot be killed and Ganga should give a user a possibility to force the shutdown. This is the way Ganga 4.2 is implemented and it is OK.

However the threads are requested to cooperate with the Core system in order to react to shutdown request as soon as possible. This means, that threads should often check if the shutdown is requested and terminate promptly in a clean way.

There are conceptual checkpoints i.e. places in the code where it is safe to terminate the thread. For example, the backend monitoring has a loop:

for j in jobs: # <-- checkpoint
  do_some_monitoring_of(j)

If a shutdown has been requested, a thread will run until the next checkpoint and then exit. It is not possible to cleanly terminate the thread in-between the checkpoints. In the particular case of backend.updateMonitoringInformation() method, the special iterator of jobs collection will have the chckpointing mechanisms built-in transparently.

Is a job repository an Internal Service ?

Yes. The job repository however does not have separate threads of control:

  • stop() does the flush of the uncommited changes;
  • start() is no-op unless the repository was not initially connected;

-- JakubMoscicki - 27 Jun 2006

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2007-04-19 - AdrianMuraru
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback