ATLAS Job Monitoring development (via MSG)

Structure

ATLAS Job Monitoring consists of 3 main parts:

  • MSG Consumer
  • User Interfaces
  • Database structure

MSG Consumer (installation details and troubleshooting)

It's running on pcitgd22 machine. Configuration file: /etc/msg-consume2db/msg-consume2db.conf

It contains the following:

  • Consumer ID
  • MSG Topic names
  • Database connection details
  • Database view name and columns description where messages are inserted to.

There are several log files:

  • /var/log/msg-consume2db/msg-consume2db.log
  • /var/log/msg-consume2db/msg-consume2db.err
  • /var/log/msg-consume2db/msg-consume2db.out

Usually you should look into *.log files. But sometimes the consumer can't start due to some critical error; in this case its output is written to msg-consume2db.err.

To (re)start/stop the consumer do service msg-consume2db (re)start/stop as root.

Watchdog script

  • Watchdog script (bash) checks: - whether the consume2db process is running and starts consumer when it stopped:
    • current status of of log file. If the log file not updated during last 15 min checks reason and
      • - sends mail (SMS) to responsible persons in case of errors with error messages from log file; only once when error occurs and another one with 'OK" status when problem solved ;
      • adds error message from consume2db log file in watchdog_consume2db.log file;
    • if there are no errors. Probably we don't have job monitoring information, so watchdog_consume2db.log file updated with message:"No job monitoring information yet"

Crontab launched Watchdog script every 10 minutes.

Common troubles

The latest version of the consumer is installed and it's assumed to run smoothly. But while using the previous version I faced 2 important troubles:

  • The consumer can stop working due to an error. You should restart it.
    Crontab watchdog_consume2db script checks if the consumer is running and restart it in case it's not. So it's not a problem.

  • The consumer looses the connection to Oracle database (ORA-03114) and stops inserting messages.
    In this case restart also works.
    The increasing growth of the number of pending messages in your topic is pointing to that issue too. You can check it at MSG broker web interface.

Also I installed a db scheduler job which checks when last update happened. If it was more then 1 hour ago Julia and Laura get an e-mail with the alarm.

MSG web interface

We're getting messages from gridmsg002 MSG broker. This is the link to it's interface: https://gridmsg002.cern.ch/admin/ But you should have a dteam membership to browse it. Here you can register: https://lcg-voms.cern.ch:8443/vo/dteam/vomrs Once the consumer connected to the broker a virtual queue is created inside. This queue is some kind of link between your consumer and topic you're going to get messages from. To get the list of queues you should click on the Queues link at the top of the page.
The name of the queue is a concatenation of the word Consumer + Connection_id (see consumer configuration file) + topic name.
Currently we have:

  • Connection_id: GANGA_Notification_Ins_NEW
  • Topic names: jobStatusTest, jobMetaTest, jobProcessingAttributes, taskMetaTest

By clicking the link with the selected queue name you'll get the number of messages which was enqueued, dequeued or dispatched from the topic. (Don't forget to click View Consumers!). GANGA_Notification_Ins_NEW link gives you a list of all your subscriptions and message information.

Some information about what this numbers mean:

  • Enqueue counter tells you the number of messages sent to topics where you have an active subscription. Those messages have been allocated in a queue for delivery.
  • Dequeue counter tells you the number of messages that you have received (and acknowledged). They have been subtracted from the queue.
  • Dispatched Counter tells you the number of messages that have been sent to the consumer. Note that dispatched messages may not have all been acknowledged.
  • Pending Queue size represents the number of messages that are stored in the server waiting for being delivered. If a consumer is running smoothly it should be 0.

The best situation is when Enqueue counter = Dequeue counter = Dispatched Counter. It means that all the messages sent to the topic were received by the consumer and successfully inserted to the database.

Job Monitoring web interface links

At the moment 2 monitoring applications are available:

They are pointed to ATLAS_DASHBOARD_JobMon database. And the data is coming from Dashboard Ganga plugin via MSG.

Info for the developers! All changes that were added to CMS Job Monitoring User Interface in order to meet ATLAS needs.

Common troubles with pcarda09 (server where monitoring applications are running)

Sometimes the server looses Oracle connection. Restarting httpd service helps.
How to restart httpd on pcarda09:

  • login as root
  • Run /usr/sbin/httpd -k stop and then /usr/sbin/httpd

Database structure

All coming messages are inserted into the view which is described in the consumer configuration file. A trigger on that view inserts the messages into raw data table. And then another trigger inserts the data to database tables. Since we're getting messages from 4 different topics there are 4 raw data tables:

  • JOB_STATUS_MSG
  • JOB_META_MSG
  • JOB_PROC_ATTRIBUTES_MSG
  • TASK_META_MSG

The triggers on the tables insert data to different database tables:

  • ins_job_meta_info (after insert on JOB_META_MSG)
  • ins_job_proc_attrs_info (after insert on JOB_PROC_ATTRIBUTES_MSG)
  • ins_to_job_and_task (after insert on JOB_STATUS_MSG)
  • ins_task_meta_info (after insert on TASK_META_MSG)

All the inserting errors come to MSG_INSERT_ERRORS table. So it is good to look into it from time to time.

The sources for database schema, triggers, etc. you can find in SVN arda.dashboard.dao-oracle-job module.
More details on triggers logic can be found here.

Common schema of ATLAS Job Monitoring:

ganga_mon_common_schema.jpg

-- IrinaSidorova - 07-May-2010

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2011-11-30 - LauraSargsyan
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback