My Nagios Deployment History
This page collects a bunch of information created for personal use.
This is NOT an official document on Nagios or Nagios installation at CERN
Installation and configuration
For the installation I used a single virtual machine, and followed the instruction at
GridMonitoringNcgYaim.
YAIM terminated successfully. Some of the link to the repo files are broken, it is advisable to use yaimgen to install the UI and then separately Nagios.
Issues
- The yaim function config_nrpe_share fails when NCG_NRPE_OUTPUT_DIR is not set.
- Nagios is not started when configuring it with Yaim.
The httpd and nagios service are correctly running.
The file /var/log/httpd/error_log has:
[Tue Apr 07 09:31:53 2009] [error] [client 127.0.0.1] Directory index forbidden by rule: /var/www/html/
The httpd server answers correctly to http requests but there are problems with https:
[root@vtb-generic-80 yum.repos.d]# curl http://localhost/
HELLO GIANNI!
[root@vtb-generic-80 yum.repos.d]# curl https://localhost/
curl: (60) SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
More details here: http://curl.haxx.se/docs/sslcerts.html
This problem has been solved appending the
BitFace CA certificate to the file
/usr/share/ssl/certs/ca-bundle.crt and adding the line
SSLCACertificateFile /usr/share/ssl/certs/ca-bundle.crt
to the file
/etc/httpd/conf.d/ssl.conf. This line was removed from that file by YAIM during the nagios configuration.
Yet, the curl
https://localhost/
test gives the same error.
At this point one should be able to see the Nagios web interface at:
https://SERVER_NAME/nagios/
.
Monitoring a Linux machine with native checks
Using NRPE
For installing and using NRPE the following document has been used
NRPE2.0
. Thanks to Ethan Galstad for writing such a clear introduction!
Issues
- NRPE configuration: on a SLC4 machine, where the FTS service was installed, the configuration failed because the C compiler was missing. It has been installed with 'yum install gcc'. Then the configuration script failed for missing SSL headers, they have been installed with 'yum install openssl-devel'.
- iptables configuration: if you get the following error when inserting a rule in the iptables chain:
[root@lxbra2310 nrpe-2.12]# iptables -I INPUT -p tcp -m tcp --dport 5666 -j accept
iptables v1.2.11: Couldn't load target `accept':/lib/iptables/libipt_accept.so: cannot open shared object file: No such file or directory
Try `iptables -h' or 'iptables --help' for more information.
you need to change '-j accept' into '-j ACCEPT'.
- iptables: the Nagios host cannot execute the check_nrpe on the remote host:
[root@vtb-generic-69 nrpe-2.12]# /usr/local/nagios/libexec/check_nrpe -H 128.142.182.87
Connection refused by host
After re-executing the previous iptables -I command, the problem disappeared, now the remote host is correctly contacted:
[root@vtb-generic-69 nrpe-2.12]# /usr/local/nagios/libexec/check_nrpe -H 128.142.182.87
NRPE v2.12
I found out that on the monitored machine there is cron job that runs hourly with the purpose of maintaining a certain configuration of the firewall, the right setup for a production environment has to be clarified.
After this, the check_nrpe!check_load service has been added to the object definition for the remote host and it worked fine.
The service details window in Nagios looks like the following picture:
Screnshot-1.png
Using specific tests without proxy
For this test we used the
FTS-basic tests available from the certification tests repository. The bash script
FTS-basic check the host, the Tomcat server and the LDAP server.
For this test the test script has been copied to the
/tmp directory and owned by the group
nagioscmd.
At this point, the object created to manage the FTS host checks is
fts32.cfg
Using specific tests that require a proxy
Most of the tests used to monitor grid services need a VOMS proxy in order to execute command from a UI.
As a first test, I used the FTS-service script which check some FTS properties using the CLI, for which you need a valid proxy.
The proxy file has been created using the nagios account (test_user key/cert owned by nagios):
[root@vtb-generic-69 ~]# ls -ltr /tmp/x509up_u100
-rw------- 1 nagios nagios 6415 May 28 10:55 /tmp/x509up_u100
The test script is in /tmp.
After testing the script from the nagios account manually, I updated the object file fts32.cgf to include:
define command{
command_name FTS-services
command_line /tmp/FTS-services --site cert-tb-cern --fts $HOSTADDRESS$ --bdii lxbra2305.cern.ch
}
define service{
use generic-service
host_name lxbra2310.cern.ch
service_description FTS service checks
check_command FTS-services
}
The following screenshot shows the successful execution of the check:
Screnshot-3.png
In a production installation, the proxy used by nagios has to be periodically renewed. NCG (see below) provides a script to do this using the
MyProxy server.
Using NCG
NCG is the Nagios configuration generator. It reads site specific information from a BDII and produces Nagios configuration files to monitor the resources published in the BDII for that site. The NCG installation is specified here:
GridMonitoringNcgYaim.
The NCG installation has been tried on a new virtual machine, vtb-generic-95.
Issues
- The yaim function config_nrpe_share fails when NCG_NRPE_OUTPUT_DIR is not set.
- The default ncg.conf works, but to automatically add hosts found in the BDII you have to set ADD_HOST=1 in the NCG::SiteInfo::LDAP module, restart ncg.pl and the Nagios daemon.
At this point Nagios shows in the web interface all the hosts found in the BDII with the CERN site name:
Screnshot-2.png
Tech Corner
This section collects some technical notes/tips about Nagios collected while reading various docs.
- A service, in Nagios language, is always a host,service pair. Therefore, you can have two service definition with the same name and different hosts.
- The file resource.cfg, readable only by nagios, is a good place to store passwords defined as macro.
--
GianniPucciani - 07 Apr 2009