How to validate a ROC or NGI Nagios box

Overview

There are 4 different configurations which we cover with the egee-NAGIOS packaging and configuration:

  • Site-Nagios - Monitoring of a site
  • Regional-Nagios - Monitoring of an EGEE ROC
  • National-Nagios - Monitoring at the NGI level
  • Project-Nagios - Central project monitoring

This document covers what you need to do in order to configure a National-Nagios or Regional-Nagios in order that they take over the definitive testing role within EGEE from the Project Nagios instance currently running.

Process for a National or Regional Nagios to get validated

A high level description of the whole process is :

  1. Join regional-nagios-admins mailing list. Register here
  2. Register your node as the relevant flavour of Nagios in GOCDB (Regional-Nagios, National-Nagios)
  3. Register for access to the SAM PI
    1. open a GGUS ticket
    2. ask in the ticket to get it assigned to 'Nagios' Support Unit
    3. mention in the ticket the IP address of your Nagios instance and that you need access to the SAM PI to configure it.
  4. Install egee-NAGIOS using the relevant configuration below
  5. Publish your GRIS running on your Nagios node into the information system
  6. Raise a GGUS ticket to the Nagios support unit to start the validation process and for this, please start with the steps written below.

Validation Process

The Project level Nagios hosts are listed here: https://twiki.cern.ch/twiki/bin/view/EGEE/NagiosROCURL. The validation process consists of comparing the setup of a new regional or national Nagios against this current project Nagios instance.

In order to validate your instance, please follow these steps:

  1. Ensure that all the egee-sa1 packages are upgraded. For this, the following query shouldn’t return any data:
    [root~]# repoquery --pkgnarrow=updates --disablerepo=\* --enablerepo=egee-sa1 -qa --queryformat ' yum update %{name} '
  2. Send us the following information to sam-support@cern.ch:
    1. your ncg.conf file. The format of your file should be like this one.
    2. the result of following query executed on your Nagios box:
      [root~]# nagios -v /etc/nagios/nagios.cfg | grep Checked | grep services
    3. the glite-UI version used by you (we are currently using glite-UI-version-3.2.*-0):
      [root ~]# rpm -qa | grep glite-UI
  3. Ensure that ncg cron job is executed regularly (every 3 hours in our case): https://tomtools.cern.ch/jira/browse/SAM-402
  4. Check if your services are being tested by the metrics defined in the ROC SAM critical profile, described here: https://twiki.cern.ch/twiki/bin/view/LCG/MDDBProfilesSAM#ROC_SAM_critical
  5. Once you have done this, open a GGUS ticket to be assigned to the 'Nagios' support unit. Please mention in the ticket which is your ROC/NGI Nagios instance, so we:
    1. add the nagios instance to the ops-monitor nagios ( https://ops-monitor.cern.ch/nagios/ ) to compare the number of services and hosts with the project level instance
    2. and to compare the status of your services to the ones defined in the central Nagios instance at CERN.
    3. for a ROC to validate an NGI Nagios instance, you should use the ops-monitor nagios
      1. Select Service Groups --> Summary --> SERVICE_NagiosNGIDiff
      2. Select your NGI and then check the ngi.nagios.diff which describes the differences between the services and probes run by your ROC and your NGI. This check compares the metrics defined in the ROC_CRITICAL profile, i.e, it covers the sBDII, SRMv2 and CE services.
      3. Go to your NGI Nagios instance and check those services & metrics to see why they are failing while not in the ROC Nagios instance.
      4. Once all those discrepancies are fixed or understood, the ROC considers that the NGI Nagios instance is validated.

List of ROC Nagios to validate.

Software version

You should have deployed the same version of the components that are running at your corresponding CERN Nagios instance, which are the ones included in the meta package egee-NAGIOS-1.0.0-48.el5.noarch.rpm, available at the egee-SA1 repository: http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/. Releases are advertised through the egee3-operations-automation-discuss and regional-nagios-admins mailing lists and through the SAM blog https://svnweb.cern.ch/trac/sam/blog

Configuration

General Configuration details

VO to run the tests as

The tests should be run as the ops VO. We will accept two DNs per ROC or NGI for testing purpose, and you can join the ops VO from here https://lcg-voms.cern.ch:8443/vo/ops/vomrs

For ROC Nagios submissions, you should also request the lcgadmin role (/ops/Role=lcgadmin), which is being used for SAM and CERN ROC Nagios submissions.

For NGIs, you should join your corresponding /ops/NGI/* group, so you do not need to use the lcgadmin role.

DNs already registered in OPS

VOs/users to be supported

For the web interface you should support dteam VO at a minimum. This will allow first-level support to access the web interface. ROC and site contacts will be automatically added from the GOCDB, and Nagios will be be configured so that ROC and Site contacts can resubmit jobs.

Topology sources

Currently we use SAM as the source of topology (i.e. site names and services at the site).

Probes to use

In these instances we don't import any remote probe results (e.g. SAM, NPM).

This leads to the following config for all configurations

NCG_TOPOLOGY_USE_SAM=true
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_VO=ops
VO_OPS_NCG_DEFAULT_VO_FQAN=/ops/Role=lcgadmin
NCG_PROBES_TYPE=local
NCG_REMOTE_USE_NAGIOS=true
NCG_REMOTE_USE_ENOC=false
NCG_REMOTE_USE_SAM=false

Specific Configuration details

We consider two different possibilities which each require a slightly different additional configuration in YAIM :

Existing EGEE ROC

NAGIOS_ROLE=roc
ROC_NAME=<ROC_NAME>
NCG_GOCDB_ROC_NAME=<ROC_NAME>

A new NGI having sites in a ROC in GOCDB

NAGIOS_ROLE=ngi
ROC_NAME=<ROC_NAME>
NCG_GOCDB_ROC_NAME=<ROC_NAME>

Contact

If you have any questions, do not hesitate to send an email to egee3-operations-automation-discuss AT cern.ch or to regional-nagios-admins AT cern.ch
Edit | Attach | Watch | Print version | History: r33 < r32 < r31 < r30 < r29 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r30 - 2010-07-21 - DavidCollados
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback