Ganga External Monitoring
This page describes the design and implementation strategy for Ganga to do external monitoring via systems such as Monalisa and Dashboard.
Ganga is a client to the external monitoring systems in two places:
- job on a worker node via job wrapper script generated by Ganga
- Ganga client (interactive or script session)
This feature may be useful in the following ways:
-
- monitoring of the internal activity of the applications (number of processed events)
- collecting information about usage of ganga aka spyware
- potential crosschecking with other monitoring systems (RGMA, Dirac,...)
- monitoring user actions (for example Dashboard wants to know when the job is submitted)
Implementation (02 Jul 2007)
This is work in progress which sits on this branch in CVS:
Ganga-4-4-0-dev-branch-kuba-monitoring-services
Example of a specific monitoring service implementation
OutputServerMS
implements a simple debug utility which allows to quickly stream the stdout and stderr back to the client if the application fails to execute.
Ganga/Lib/MonitoringServices/OutputServerMS
How to use it:
- Run the server
% Ganga/Lib/MonitoringServices/OutputServerMS/ganga_output_server.py [port]
- If the server runs on the same machine as your ganga session and if it uses the default port, then skip this point. Otherwise specify the server address like this:
% export GANGA_OUTPUTSERVERMS_URL=http://server.host.address:port
- Enable the service in ganga
% ganga
>>> config['MonitoringServices']['Executable'] = 'Ganga.Lib.MonitoringServices.OutputServerMS.OutputServerMS'
>>> Job().submit()
Archive
Work so far (27 Jul 2006)
- B.G.:
- implemented Dashboard monitoring inside Athena wrapper script
- proposed a monitoring interface class
- use case:
- Athena exit code is meaningless (always 0)
- when Athena terminates it produces a log file which is then analyzed by a special tool which produces an xml file
- the xml file is parsed the 'proper' exit code is transmitted to Dashboard
- B.K.:
- implemented Gaudi Service which produces xml files with event information while Gaudi runs
- modified Localhost and LSF handler to read the xml files and transmit the information to Monalisa
Implementation strategy
- Monitoring interface defines
send()
methods to forward the information and it also defines the list of files which must be shipped to the worker node (e.g. monalisa modules)
- This interface is used on the client and on the worker node, one monitoring object is connects to one monitoring system
- Core framework automatically extracts the addtional files and puts them in
_python/GangaWN/Monitoring
directory on the worker node
- There is an implementation of monitoring interface which agregates monitoring objects and serves as a fan-out to these monitoring objects
- so from the point of view of the framework and wrapper scripts there is only one monitoring object!
- For the moment we decided that all backend scripts must be modified with the calls to monitoring object
send()
methods, however these lines are completely generic, like this:
sys.path.append['./_python']
from GangaWN.Monitoring import getMonitoringObject
monit = getMonitoringObject()
monit.send_running(jobid,...)
Actions (in order)
- B.G. sends to this page his monitoring interface
I have attached two files (IMonitoring.py and ARDADashboard.py) to the page. IMonitoring.py is the definition of the methods to be implemented by a monitoring object. Currently,
the number of methods is very small:
- one is called by the JobManager when a job is submitted (maybe to put called in an other place). For the dashboard, it requires to know already some application info (like dataset, application version), the Grid Job Id and this kind of things.
- one is called when building the sandbox. The method returns the list of files to be added to the sandbox for the monitoring to work properly. So far, it was called in the Athena (application) LCG handler in order not to fiddle too much with other applications. However, it sounds more consistent to append files to the sandbox in a common place for all applications.
There is nothing like a method to be called by the job when it runs on the worker node (however, we definitely need it). This comes from the fact the Athena/LCG job is a shell script
which I have added a couple of lines to called the monitoring python script (executable). This script is originally distributed by the dashboard guys for CMS and is not using the interface
I have defined. In principle, if we would stick to a shell script for Athena/LCG, the python monitoring script should at least be based on the interface so that everything would be consistent.
I did not do it just because the script exists already.
I believe when the LCG wrapper is less shell script and more python, monitoring will "come for free" thanks to the hooks we will have put in.
- B.K. tries to use it in his wrapper scripts (possible revisions of the interface)
For now I am only going to work on the Localhost handler since LSF is playing up. The job wrapper scripts are very similar for both so this shouldn't be too much of a problem except for testing.
- We have to make sure HC has the introduced the job wrapper script on the LCG (otherwise B.G. cannot move it out of the Athena script)
- Somebody will develop the agregated monitoring object (editor's notice: Andrew already volunteered
)
- K.M. will make approperiate changes to the core
--
JakubMoscicki - 27 Jul 2006