How to validate a ROC or NGI Nagios box
Overview
There are 4 different configurations which we cover with the egee-NAGIOS packaging and configuration:
- Site-Nagios - Monitoring of a site
- Regional-Nagios - Monitoring of an EGEE ROC
- National-Nagios - Monitoring at the NGI level
- Project-Nagios - Central project monitoring
This document covers what you need to do in order to configure a
National-Nagios
or
Regional-Nagios
in order that they take over the definitive testing role within EGEE from the Project Nagios instance currently running.
Process for a National or Regional Nagios to get validated
A high level description of the whole process is :
- Join tool-admins mailing list. Register here
- Register your node as the relevant flavour of Nagios in GOCDB (Regional-Nagios, National-Nagios)
- Register for access to the SAM PI (ONLY if you don't want to use ATP as the topology provider for NCG)
- open a GGUS ticket
- ask in the ticket to get it assigned to 'Nagios' Support Unit
- mention in the ticket the IP address of your Nagios instance and that you need access to the SAM PI to configure it.
- Install egee-NAGIOS using the relevant configuration below
- Publish your GRIS running on your Nagios node into the information system
- Raise a GGUS ticket to the Nagios support unit to start the validation process and for this, please start with the steps written below.
Validation Process
The Project level Nagios hosts are listed here:
https://twiki.cern.ch/twiki/bin/view/EGEE/NagiosROCURL. The validation process consists of comparing the setup of a new regional or national Nagios against this current project Nagios instance.
In order to validate your instance, please follow these steps:
- Ensure that all the egee-sa1 packages are upgraded. For this, the following query shouldn’t return any data:
[root~]# repoquery --pkgnarrow=updates --disablerepo=\* --enablerepo=egee-sa1 -qa --queryformat ' yum update %{name} '
- Ensure that ncg cron job is executed regularly (every 3 hours in our case): https://tomtools.cern.ch/jira/browse/SAM-402
- Check if your services are being tested by the metrics defined in the ROC SAM critical profile, described here: https://twiki.cern.ch/twiki/bin/view/LCG/MDDBProfilesSAM#ROC_SAM_critical
- Once you have done this, open a GGUS ticket to be assigned to the 'Nagios' support unit. Please mention in the ticket which is your ROC/NGI Nagios instance, so we:
- add the nagios instance to the
ops-monitor
nagios ( https://ops-monitor.cern.ch/nagios/
) to compare the number of services and hosts with the project level instance
- and to compare the status of your services to the ones defined in the central Nagios instance at CERN.
- for a ROC to validate an NGI Nagios instance, you should use the ops-monitor nagios
- Select
Service Groups --> Summary --> SERVICE_NagiosNGIDiff
- Select your NGI and then check the
ngi.nagios.diff
which describes the differences between the services and probes run by your ROC and your NGI. This check compares the metrics defined in the ROC_CRITICAL profile, i.e, it covers the sBDII, SRMv2 and CE services.
- Go to your NGI Nagios instance and check those services & metrics to see why they are failing while not in the ROC Nagios instance.
- Once all those discrepancies are fixed or understood, the ROC considers that the NGI Nagios instance is validated.
List of
ROC Nagios to validate.
Software version
You should have deployed the same version of the components that are running at your corresponding CERN Nagios instance, which are the ones included in the meta package
egee-NAGIOS-1.0.0-48.el5.noarch.rpm
, available at the egee-SA1 repository:
http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/
. Releases are advertised through the
tool-admins
mailing list and through the SAM blog
https://svnweb.cern.ch/trac/sam/blog
Configuration
General Configuration details
VO to run the tests as
The tests should be run as the
ops
VO. We will accept two DNs per ROC or NGI for testing purpose, and you can join the
ops
VO from here
https://lcg-voms.cern.ch:8443/vo/ops/vomrs
For ROC Nagios submissions, you should also request the lcgadmin role (
/ops/Role=lcgadmin
), which is being used for SAM and CERN ROC Nagios submissions.
For NGIs, you should join your corresponding
/ops/NGI/*
group, so you do not need to use the lcgadmin role.
DNs already registered in OPS
VOs/users to be supported
For the web interface you should support
dteam
VO at a minimum. This will allow first-level support to access the web interface. ROC and site contacts will be automatically added from the GOCDB, and Nagios will be be configured so that ROC and Site contacts can resubmit jobs.
Topology sources
Currently we use SAM as the source of topology (i.e. site names and services at the site).
Probes to use
In these instances we don't import any remote probe results (e.g. SAM, NPM).
This leads to the following config for all configurations
NCG_TOPOLOGY_USE_SAM=true
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_VO=ops
VO_OPS_NCG_DEFAULT_VO_FQAN=/ops/Role=lcgadmin
NCG_PROBES_TYPE=local
NCG_REMOTE_USE_NAGIOS=true
NCG_REMOTE_USE_ENOC=false
NCG_REMOTE_USE_SAM=false
Specific Configuration details
We consider two different possibilities which each require a slightly different additional configuration in
YAIM :
Existing EGEE ROC
NAGIOS_ROLE=roc
ROC_NAME=<ROC_NAME>
NCG_GOCDB_ROC_NAME=<ROC_NAME>
A new NGI having sites in a ROC in GOCDB
NAGIOS_ROLE=ngi
ROC_NAME=<ROC_NAME>
NCG_GOCDB_ROC_NAME=<ROC_NAME>
Contact
If you have any questions, do not hesitate to send an email to
tool-admins AT mailman.egi.eu