Nagios-ActiveMQ binding

Table of Contents

Practical information

This page explains the new bridging between Nagios and ActiveMQ, that is what was modified and how to set up such a new architecture.

Basically, the old directory-based queue has been removed and messages directly flow into the messaging system through a local message broker.

-- JulienPerrochet - last update : 25 Aug 2009


Here is the current SVN address for the project :

Complementary Pages

  • NagiosActiveMQConfiguration : Details about ActiveMQs configuration.
  • JMXQueryTool : Details about a script using Jmx4Perl to get information from JMX MBeans on a server having the corresponding applet running.


The directory containing the updated scripts and data is structured in the following way :
  • config_scripts/ Directory holding the necessary data for automated configuration.
    • Call this script to configure the local ActiveMQ message broker.
    • Used by to generate the activemq.xml configuration file.
    • configs/ Holds configuration data.
      • activemq_raw_xml.xml Raw XML model for the configuration file. It has variables in it.
      • activemq-configurator.cfg Defines the variables contained in the raw .xml file.
  • message_handlers/ The updated message handlers.
    • New MetricOutput handler.
  • monitoring/ Scripts used for monitoring purposes.
    • Use Jmx4Perl to get informations about JMX MBeans on a server having the corresponding applet running.
  • msg-to-handler The updated daemon.
  • nagios_scripts Directory holding the new scripts.

Usefull Links


The New Idea

The idea is to mount a local ActiveMQ message broker and let it manage the job of forwarding messages from and to the global messaging infrastructure (and to queue them when the backbone is dead, etc.). As ActiveMQ is already used at the top level, it would homogenise the messaging network.

Furthermore, ActiveMQ in combination with Apache Camel can be a very powefull tool for routing and message transformation.

Production Tests

Real production tests have not yet been carried out, but testbenches very similar to real situations have given fully satisfying results.

Practical Comments

  • Diskspace : it is critical to react when the Nagios instance tells you it has not enough diskspace : if diskspace is not sufficient, ActiveMQ won't like it and just die. As a result, Nagios will fall back to the directory queue to store messages and fill up the disk even more until the point where ActiveMQ will even refuse to start. Actually Nagios doesn't use that much disk-space, but I had problems on a Virtual Machine having less than 10 Gb of disk-space.
  • Cache Recovery : Recovering disk-cached (via send_to_msg ) messages can take some time, especially if the local broker was down for several hours. In such cases, the script will generally take too long and nagios will return a SEND_TO_MSG UNKNOWN - Timeout while fetching results. for the check. This is bad because it unterrupts the upload process in an unpredictable way. I solved this by putting a limit on the number of messages that are uploaded in one time, so cache recovery will be totally done after the corresponding check has been trigerred several times.

Stress Tests

To get an idea of how far a new setup could be pushed, I have done some stress tests on an Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (2/8) box (lxbrf2711).

The system could absorb up to 20 messages per second durably and on the fly. Higher rates (tested up to 2000 msg/sec) begin to fill up the incoming queues, but still without saturating CPU resources. The bottleneck here is the command pipe, as dummy messages that are not written to it will not buffer in the queues unless they arrive at rates higher thant 1000 msg/sec.

The few tests I have been through allow me to conclude that such a setup should be able to handle a relatively high constant message flow and correctly deal with consequent message bursts.

message_LowStressTest2.png message_StressTest2.png

The upper image shows the message processing rate : enqueud messages are in red, dequeued messages are in green. The next images shows the corresponding CPU Usage. The brutal cliff in the left set of images reflects the purge of the pending messages. On the right set, they are not purged and you can see how the system catches up.

CPU_LowStressTest2.png message_StressTest2.png

A possible hint to this high system CPU usage when the message rate gets over 20 msg/s might be that several Nagios plugins and processes competing to write to the command pipe. This issue should be investigated if high message rates are to become common.

Informations relative to messaging on the server where obtained with the help of the JMXQueryTool.


Modified Scripts

  • handle_service_check_amq - updated - Now directly sends the MetricOutput to the local message broker. Messages will be stored on disk if the broker is unavailable.
  • handle_service_change_amq - updated - Now directly sends the Notifications to the local message broker. Messages will be stored on disk if the broker is unavailable.
  • recv_from_queue - removed - Not in use anymore : msg-to-queue now handles message importation.
  • check_config_amq - updated - Now does the following :
    • Checks if the local NCG SQLite database already holds new or updated imported configs. If so, it raises a warning.
    • Checks if the local NCG SQLite database holds configs to export and sends them to the message broker.
  • send_to_msg - updated to flush_dirqueue - Checks if some outgoing messages where cached on the disk and sends them to the local message broker. Messages are directly cached on the disk only if the local broker is unavailable.
  • msg-to-handler - updated - This is the daemon listening on the incoming queueus. Updated to listen on the local broker and on the correct queues.
  • test_local_broker - new - Sends a message on a queue and tries to receive it. Used to check the local broker capacities.
  • - updated - The MetricOutput message handler has been updated to directly write passive results to the command pipe.

Modified Nagios Services

  • org.egee.CheckLocalBrokerMessaging - new - Checks if the broker can receive and send a message to/from a queue.
  • org.egee.RecvFromQueue - removed - This service triggered recv_from_queue script and is therefore useless in the new architecture.

Configuration & Modifications


If you have no direct installation avalaible for the new architecture (through YUM or YAIM for example), you will have to do the modifications concerning Nagios yourself. That is :
  • Copy the new Nagios scripts to /usr/libexec/grid-monitoring/plugins/nagios/ .
  • Modify the object definition files in /etc/nagios/wlcg.d/ .
  • Replace the msg-to-queue daemon with the msg-to-handler one.

ActiveMQ :

This script will generate the /etc/activemq/activemq.xml file that ActiveMQ needs when it starts. It will read its own config file, a raw activemq configuration file and a message handler definition file to produce the corresponding ActiveMQ configuration.

It will use the perls script to create the file.

See NagiosActiveMQConfiguration for more details about the new architectures configuration.

ActiveMQ must be restarted in order to take the modifications into account. This is usually done by $ service activemq restart.

Used Files
  • configs/activemq_raw_xml.xml The model for the ActiveMQ configuration file. It also holds some static configurations.
  • configs/activemq-configurator.cfg Some variables that will be put into the ActiveMQ configuration file can be set here.
  • /etc/msg-to-queue/msg-to-queue.conf This file holds the definitions of the message handlers. We need to read this file because we have to bridge some topics and queues from the backbone. and they are defined in these files.


  • Current local messaging architecture (thank you to EmirImamagic for this picture) :

  • New messaging without failback details :


  • New messaging with failback details :

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng CPU_LowStressTest2.png r1 manage 37.6 K 2009-08-26 - 11:31 UnknownUser CPU Usage during messaging stress test
PNGpng CPU_StressTest2.png r1 manage 44.0 K 2009-08-27 - 09:29 UnknownUser Stress test with messages holding passive results - high rate then low constant - CPU
JPEGjpg MSG-Nagios-Bridge.jpg r1 manage 169.5 K 2009-08-14 - 15:29 JulienPerrochet  
PNGpng message_LowStressTest2.png r1 manage 20.2 K 2009-08-26 - 11:03 UnknownUser Stress test with messages holding passive results
PNGpng message_StressTest2.png r1 manage 24.5 K 2009-08-27 - 09:30 UnknownUser Stress test with messages holding passive results - high rate then low constant - Messages
JPEGjpg new_scheme.jpg r1 manage 42.3 K 2009-08-14 - 15:26 JulienPerrochet New local messaging scheme, without failback detail.
JPEGjpg new_scheme2.jpg r1 manage 128.3 K 2009-08-20 - 15:00 UnknownUser Second scheme revision
JPEGjpg new_scheme2_failback.jpg r1 manage 145.6 K 2009-08-20 - 15:00 UnknownUser Second scheme revision with failback detail.
JPEGjpg new_scheme_failback.jpg r1 manage 51.4 K 2009-08-14 - 15:28 JulienPerrochet  
Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r34 - 2009-09-15 - unknown
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback