ATLAS Job Monitoring development (via MSG)
Structure
ATLAS Job Monitoring consists of 3 main parts:
- MSG Consumer
- User Interfaces
- Database structure
MSG Consumer (installation details and troubleshooting)
It's running on pcitgd22 machine. Configuration file: /etc/msg-consume2db/msg-consume2db.conf
It contains the following:
- Consumer ID
- MSG Topic names
- Database connection details
- Database view name and columns description where messages are inserted to.
There are several log files:
- /var/log/msg-consume2db/msg-consume2db.log
- /var/log/msg-consume2db/msg-consume2db.err
- /var/log/msg-consume2db/msg-consume2db.out
Usually you should look into *.log files. But sometimes the consumer can't start due to some critical error; in this case its output is written to msg-consume2db.err.
To (re)start/stop the consumer do
service msg-consume2db (re)start/stop
as root.
Watchdog script
- Watchdog script (bash) checks: - whether the consume2db process is running and starts consumer when it stopped:
- current status of of log file. If the log file not updated during last 15 min checks reason and
- - sends mail (SMS) to responsible persons in case of errors with error messages from log file; only once when error occurs and another one with 'OK" status when problem solved ;
- adds error message from consume2db log file in watchdog_consume2db.log file;
- if there are no errors. Probably we don't have job monitoring information, so watchdog_consume2db.log file updated with message:"No job monitoring information yet"
Crontab launched Watchdog script every 10 minutes.
Common troubles
The latest version of the consumer is installed and it's assumed to run smoothly. But while using the previous version I faced 2 important troubles:
- The consumer can stop working due to an error. You should restart it.
Crontab watchdog_consume2db script checks if the consumer is running and restart it in case it's not. So it's not a problem.
- The consumer looses the connection to Oracle database (ORA-03114) and stops inserting messages.
In this case restart also works.
The increasing growth of the number of pending messages in your topic is pointing to that issue too. You can check it at MSG broker web interface.
Also I installed a db scheduler job which checks when last update happened. If it was more then 1 hour ago Julia and Laura get an e-mail with the alarm.
MSG web interface
We're getting messages from gridmsg002 MSG broker. This is the link to it's interface:
https://gridmsg002.cern.ch/admin/
But you should have a dteam membership to browse it. Here you can register:
https://lcg-voms.cern.ch:8443/vo/dteam/vomrs
Once the consumer connected to the broker a virtual queue is created inside. This queue is some kind of link between your consumer and topic you're going to get messages from. To get the list of queues you should click on the
Queues
link at the top of the page.
The name of the queue is a concatenation of the word Consumer + Connection_id (see consumer configuration file) + topic name.
Currently we have:
- Connection_id: GANGA_Notification_Ins_NEW
- Topic names: jobStatusTest, jobMetaTest, jobProcessingAttributes, taskMetaTest
By clicking the link with the selected queue name you'll get the number of messages which was enqueued, dequeued or dispatched from the topic. (Don't forget to click View Consumers!).
GANGA_Notification_Ins_NEW
link gives you a list of all your subscriptions and message information.
Some information about what this numbers mean:
- Enqueue counter tells you the number of messages sent to topics where you have an active subscription. Those messages have been allocated in a queue for delivery.
- Dequeue counter tells you the number of messages that you have received (and acknowledged). They have been subtracted from the queue.
- Dispatched Counter tells you the number of messages that have been sent to the consumer. Note that dispatched messages may not have all been acknowledged.
- Pending Queue size represents the number of messages that are stored in the server waiting for being delivered. If a consumer is running smoothly it should be 0.
The best situation is when Enqueue counter = Dequeue counter = Dispatched Counter. It means that all the messages sent to the topic were received by the consumer and successfully inserted to the database.
Job Monitoring web interface links
At the moment 2 monitoring applications are available:
They are pointed to ATLAS_DASHBOARD_JobMon database. And the data is coming from Dashboard Ganga plugin via MSG.
Info for the developers! All
changes that were added to CMS Job Monitoring User Interface in order to meet ATLAS needs.
Common troubles with pcarda09 (server where monitoring applications are running)
Sometimes the server looses Oracle connection. Restarting httpd service helps.
How to restart httpd on pcarda09:
- login as root
- Run
/usr/sbin/httpd -k stop
and then /usr/sbin/httpd
Database structure
All coming messages are inserted into the view which is described in the consumer configuration file. A trigger on that view inserts the messages into raw data table. And then another trigger inserts the data to database tables. Since we're getting messages from 4 different topics there are 4 raw data tables:
- JOB_STATUS_MSG
- JOB_META_MSG
- JOB_PROC_ATTRIBUTES_MSG
- TASK_META_MSG
The triggers on the tables insert data to different database tables:
- ins_job_meta_info (after insert on JOB_META_MSG)
- ins_job_proc_attrs_info (after insert on JOB_PROC_ATTRIBUTES_MSG)
- ins_to_job_and_task (after insert on JOB_STATUS_MSG)
- ins_task_meta_info (after insert on TASK_META_MSG)
All the inserting errors come to MSG_INSERT_ERRORS table. So it is good to look into it from time to time.
The sources for database schema, triggers, etc. you can find in SVN
arda.dashboard.dao-oracle-job
module.
More details on triggers logic can be found
here.
Common schema of ATLAS Job Monitoring:
--
IrinaSidorova - 07-May-2010