Ganga/DIANE Monitoring

Ganga/DIANE monitoring dashboard runs on port 80 at http://gangamon.cern.ch, the underlying host is currently voatlas90. You can view its Quattor templates here

Instructions for users

How to configure Ganga to send monitoring messages?

.gangarc:

[MonitoringServices]
Executable/* = Ganga.Lib.MonitoringServices.MSGMS.MSGMS

[MSGMS]
server = ganga.msg.cern.ch
port = 6163

See also: /afs/cern.ch/sw/arda/install/su3/2009/config-lostman.gear

How to configure DIANE to send monitoring messages?

Messages are automatically sent. It may be disabled in run() function in the run file: config.MSGMonitoring.MSG_MONITORING_ENABLED = False

Production service operations

The code is developed in SVN: http://svnweb.cern.ch/world/wsvn/ganga/trunk/external/dashb

The code is deployed in: /data/django/dashboard

MySQL DB access: configuration /data/django/dashboard/server/monitoringsite/settings.py

We use gangage as a service account.

How to deploy a new version of the service?

Make sure that the code is tagged in SVN already with tag dashboard-X-Y by the developers.

Login as gangage@gangamon

For convenience define:

export DASHTAG=dashboard-X-Y

Export the SVN tag:

cd /data/django
svn export svn+ssh://svn.cern.ch/reps/ganga/tags/packages/$DASHTAG

Fix configuration strings and write-protect the release area:

cd /data/django/service
python configure-access.py $DASHTAG
chmod -R a-w /data/django/$DASHTAG

Switch the production code:

cd /data/django
rm -f dashboard && ln -s $DASHTAG dashboard

Restart the service.

How to restart the service?

Login as gangage@gangamon

The collector is restarted automatically via a crontab which is defined in Quattor.

The file /data/django/service/restart_gm_service_if_needed is executed every minute, so we're able to manage our own pseudo-cronjobs by tagging them onto the end of this file. In this manner, we have defined the following tasks: * Every minute: update the list of users who've sent jobs via Ganga, otherwise they won't be able to find their jobs in the web interface. * Every minute: check the number of activeMQ connections open to the messaging brokers. If this is inconsistent with the number of brokers behind the alias dashb-mb.cern.ch, restart the GangaMon service and send a mail to Ivan. * Every 5 minutes: check to see if the GangaMon service is running; if not, restart and notify Ivan * Every hour at 2 minutes past: restart the service for good measure.

All service related commands are in /data/django/service.

If you want to disable the collector for maintenance (e.g. code upgrade) do this

cd /data/django/service
./switch_to_maintanance
......
./switch_to_production
./status_gm_service

Manual check if the collector is running:

ps -aux | grep runcollector

Restart apache (as root):

sudo /etc/init.d/httpd restart

Automatic check if the service is available (and alarm)....

Not implemented yet.

Creation of 'user name' lists

Both Gangamon and Dianemon present an interactive list of users as a starting point for interaction with the interface. These lists are generated from data held in the gangamon_users database table, which flags a user as having used either Ganga, Diane, or both.

To populate this table we need to run over all users in the Ganga and Diane tables. This is currently done once every 5 minutes, which seems to be a good compromise between how quickly a user might expect their jobs to appear on the display and how intensive the query to extract user names is (the operation takes around 12 seconds to complete).

Rather than request a new cronjob for the host, we tagged a command onto the script that gets called every minute to check the Gangamon services are running. The script lives at

/data/django/service/update_users.sh

and is called from within

/data/django/service/restart_gm_service_if_needed

Note that although the update_users.sh script is called every minute, it has an internal check that only permits the database query/update to run once every 5 minutes.

Backup

Historically (voatlas65) backups went to: voatlas30:/data/gangamon/gangamon

We now use TSM to back up the /data/backups/data directory of voatlas90. The client to interact with the backup server is dsmc.

To query the backup status of a file, call dsmc query backup "/data/backups/data/*" from voatlas90's command line:

[0]=> dsmc query backup "/data/backups/data/*"
IBM Tivoli Storage Manager
Command Line Backup/Archive Client Interface
  Client Version 5, Release 5, Level 1.0  
  Client date/time: 12/09/2010 12:19:30
(c) Copyright by IBM Corporation and other(s) 1990, 2008. All Rights Reserved.

Node Name: VOATLAS90
Session established with server TSM66: Linux/x86_64
  Server Version 5, Release 5, Level 4.1
  Server date/time: 12/09/2010 12:19:30  Last access: 12/09/2010 12:18:51

           Size        Backup Date                Mgmt Class           A/I File
           ----        -----------                ----------           --- ----
             0  B  12/03/2010 12:49:13             DEFAULT              A  /data/backups/data/backup.log
    81,150,805  B  12/09/2010 03:13:52             DEFAULT              A  /data/backups/data/django.tgz
    89,699,745  B  12/09/2010 03:13:55             DEFAULT              A  /data/backups/data/gangamon.mysql.backup.gz

Restoring backups

This page explains how you use the restore command to pull backups from the tape storage.

Some examples:

  • Restore a particular file to a given (new) location dsmc restore /data/backups/data/django.tgz /tmp/
  • Restore all files in the backup to their default locations dsmc restore "/data/backups/data*"
  • Restore a file to the state it was in as of 1.00 PM on August 17th 2010 restore -pitd=8/17/2010 -pitt=13:00:00 /data/backups/data/django.tgz

So, for example, the gangamon.mysql.backup.gz file was pulled from tape in the following way:

#> dsmc restore /data/backups/data/gangamon.mysql.backup.gz /tmp/mkenyon/
IBM Tivoli Storage Manager
Command Line Backup/Archive Client Interface
  Client Version 5, Release 5, Level 1.0  
  Client date/time: 12/09/2010 14:30:48
(c) Copyright by IBM Corporation and other(s) 1990, 2008. All Rights Reserved.

Node Name: VOATLAS90
Session established with server TSM66: Linux/x86_64
  Server Version 5, Release 5, Level 4.1
  Server date/time: 12/09/2010 14:30:48  Last access: 12/09/2010 14:30:01

Restore function invoked.

Restoring      89,699,745 /data/backups/data/gangamon.mysql.backup.gz --> /tmp/mkenyon/gangamon.mysql.backup.gz [Done]      

Restore processing finished.
                                  
Total number of objects restored:         1
Total number of objects failed:           0
Total number of bytes transferred:   85.55 MB
Data transfer time:                    1.32 sec
Network data transfer rate:        66,309.05 KB/sec
Aggregate data transfer rate:      28,200.17 KB/sec
Elapsed processing time:           00:00:03
and a quick checksum calculated to confirm that the file recovered from the backup is identical to the latest tarball on voatlas90:
#> md5sum /data/backups/data/gangamon.mysql.backup.gz /tmp/mkenyon/gangamon.mysql.backup.gz 
6368376e8eac07a11cffd8fc6010ad7d  /data/backups/data/gangamon.mysql.backup.gz
6368376e8eac07a11cffd8fc6010ad7d  /tmp/mkenyon/gangamon.mysql.backup.gz

Disk space management - database logs

You may need to remove db logs from time to time:

mysql> PURGE BINARY LOGS BEFORE '2010-10-1 22:46:26';

Production server configuration

The software is managed by quattor (list of exceptions below).

Lemon monitoring page: http://lemonweb.cern.ch/lemon-status/info.php?entity=voatlas90&type=host

Relevant quattor templates:

Installation notes for voatlas65 (local changes done by hand)

Local changes to voatlas65:

  • added /etc/httpd/conf.d/dashboard.conf
  • easy_installed the simplejson package in /usr/lib/python2.4/site-packages
  • created symlink: ln -s /usr/lib/python2.4/site-packages/django /usr/lib64/python2.5/site-packages/django
  • created symlink: ln -s /usr/lib/python2.4/site-packages/yaml /usr/lib64/python2.5/site-packages/yaml
  • disabled python_mod by renaming the file: /etc/httpd/conf.d/python.conf__
  • gangausage app: added pygooglechart in /data/django/external, it is referred to by /data/django/dashboard/server/django.wsgi file

Migration of services from voatlas65 to voatlas90

In addition to the above (including the voatlas65 specific notes), some directories that aren't in SVN were rsync ed across from voatlas65 to voatlas90:
  • voatlas65:/data/django/service/*
  • voatlas65:/data/django/external (which contains the pygooglechart package.

Note that /data/django/TRUNK/server/monitoringsite/settings.py had to be modified to include the correct MySQL password and SECRET_KEY, as these aren't stored in SVN.

The dashboard tools were checked out of the SVN Trunk:

cd /data/django
export SVNURL=svn+ssh://gangage@svn.cern.ch/reps/ganga
svn co $SVNURL/trunk/external/dashb ./TRUNK
ln -s TRUNK dashboard

apache configuration

# cat /etc/httpd/conf.d/dashboard.conf

WSGIScriptAlias /django /data/django/dashboard/server/django.wsgi

Alias /django_media /data/django/dashboard/server/monitoringsite/media

Alias /diane /data/django/dashboard/dianetaskmonitor/client
Alias /ganga /data/django/dashboard/gangataskmonitor/client

WSGIScriptAlias / /data/django/dashboard/server/django.wsgi

MSG Information

Destinations consumed by runcollector

Currently there is no protection for the production queues in the MSG server. NEVER run collector in the production mode outside of scope of the production service.

See runcollector.py source for up-to-date destinations.

As of April 2013:

Server
ganga.msg.cern.ch, which is actually a DNS alias in front of three Apollo messaging servers.
Port
6163
Ganga destinations
/queue/ganga.status, /topic/ganga.status
Diane destinations
/queue/diane.journal, /queue/diane.status

TODO:

  • Restrict read-access of queues to official runcollector.

Web access to the message queues

You need to have a valid Grid certificate to access these pages (tested on firefox).

Production server: https://gridmsg101.cern.ch/admin/queues.jsp

Development server: https://gridmsg001.cern.ch/admin/queues.jsp

Creating development environment

It is recommended that the development environment and production server are as close as possible, to avoid compatibility issues when moving the code into production.

Whenever possible try to use the same versions as installed on gangamon server. Check the installed packages in production.

What do we need?

  • Apache2 webserver (with mod_wsgi installed)
  • MySQL database
  • Python 2.5+
  • SVN
  • Django

Ubuntu

Install Environment

Install apache, mysql, python, svn
sudo apt-get install apache2 mysql-server python subversion

Install mod_wsgi for apache

sudo apt-get install libapache2-mod-wsgi

Install MySQL support for Python

sudo apt-get install python-mysqldb

Install Django:

sudo apt-get install python-django python-django-doc

If you want the latest development version here are some hints See also: http://docs.djangoproject.com/en/1.1/intro/install

Setup MySQL database

Create mysql database (gangamon) and user (replace myuser and mypassword with some real values) with proper privileges:

> mysql -u root
CREATE DATABASE gangamon CHARACTER SET utf8;
CREATE USER 'myuser'@'localhost' identified by 'mypasswd';
GRANT ALL PRIVILEGES ON gangamon.* to 'myuser'@'localhost';

Here is more information on this:

Install Ganga/Diane Dashboard application

In production the application is installed to /data/django. We assume the same location for the development environment because to setup up the application you anyway need to have root access to modify the apache configuration files. It also makes it simpler to manage the transition between development and production environments.

Create working copy of Ganga/Diane Dashboard application:

mkdir -p /data/django
cd /data/django
svn co svn+ssh://svn.cern.ch/reps/ganga/trunk/external/dashb dashboard

Update apache configuration:

sudo cp /data/django/dashboard/server/dashboard.conf /etc/apache2/conf.d
sudo service apache2 restart

Note: on Scientific Linux the apache conf path is: /etc/httpd/conf.d

Update settings.py file with DB connection information:

cd  /data/django/dashboard/server/monitoringsite
cp settings.py.TEMPLATE settings.py
#edit settings.py and update DATABASE_USER, DATABASE_PASSWORD and SECRET_KEY

NEVER commit settings.py file to SVN (it contains sensitive information)

Rename file settings.js-example (located in client/media/scripts/) to settings.js

Initialize django databases:

cd /data/dashboard/server/monitoringsite
python manage.py syncdb

You should get this output:

Creating table auth_permission
[...]

You just installed Django's auth system, which means you don't have any superusers defined.
Would you like to create one now? (yes/no): yes
Username (Leave blank to use 'myuser'): 
E-mail address: myuser@some.mail
Password: 
Password (again): 
Superuser created successfully.
Installing index for auth.Permission model
[...]

That's it! Try http://localhost/monitoring

If you want to test with web broweser running somewhere else than localhost, then you need to do one fix more: the javascript web application needs to be told at which url django is serving data. Edit /data/django/dashboard/client/media/scripts/settings.js file and replace localhost with your fully qualified server name.

You may also run a collector in test mode (using test queues).

Development of Task Monitoring applications

Code base for taskmonitor ("TRUNK") is developed here: http://svnweb.cern.ch/world/wsvn/ganga/trunk/external/dashb/taskmonitor

DIANE task monitoring is a branch of taskmonitor and it is kept here: http://svnweb.cern.ch/world/wsvn/ganga/trunk/external/dashb/dianetaskmonitor

The branch may/should be kept up-to-date with the trunk. Simply: svn merge svn+ssh://svn.cern.ch/reps/ganga/trunk/external/dashb/taskmonitor

The following are added and are DIANE-specific: client/media/scripts/settings.js (simple svn add setting.js)

The following are svn copies:

  • client/media/css is a copy of client/media/css_example (and may be merged within the branch separately if needed)
  • index.html

The corresponding commands are:

svn cp index.html_example index.html
cd client/media
svn cp css_example css
svn ci -m "your commit message"

Similarly for Ganga (TODO).

Send test messages to your development service instance

Run collector in the test mode

By default the collector runs in the test mode. So simply run: python manage.py runcollector

It will use the development server gridmsg001.cern.ch instead.

The msg destinations are used by the collector will be the same as on production server.

Configure Ganga to send messages in test mode

Now, suppose that you'd like to test the gangausage or monitoring messages messages. You may force Ganga to send messages to the development msg server like this:

ganga -o[MSGMS]server=gridmsg001.cern.ch

This works as of release 5.5.14 (previously the usage destination was hardcoded).

Configure DIANE to send messages in test mode

This works on code later than 2.1 release.

Set the DIANE_MSG_TEST environment variable and all messages will be published to the development server.

Notes

Other ways to restart Apache

/etc/init.d/apache2 restart
Or
/etc/init.d/httpd restart

Service Documentation Form

The service documentation form is available here: GangaDIANEMonitoring

Dashboard Release Notes

1-14 : 01/04/2011

Gangamon:Greatly optimised the DB lookups for finding Ganga jobs with subjobs. Previously, this was done by iterating over the table for each job to count its subjobs. Now we collect this information via ActiveMQ and place it straight into the gangamon_gangajobdetail table. This required changes of the server code; views.py, model.py and eventproc.py, plus the addition of a column in the gangajobdetail to hold the 'number of subjobs' integer. Note that the (single line) change to eventproc.py isn't included in the code tagged as 1-14 in SVN. This has since been added to the dashboard-trunk, so will automatically get included in 1-15 when that is tagged. -- JakubMoscicki - 2009-09-04

-- JakubMoscicki - 10-Nov-2010

-- RobCurrie - 2014-10-31

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-10-31 - RobCurrie
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback