SAM to Nagios migration of sensors and tests

Two main cases are considered:

  1. running existing SAM tests without modification under Nagios
  2. developing service checks for Nagios

NB! grid-monitoring-probes-org.sam RPM is available through EGEE SA1 repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (also, via egee-NAGIOS meta RPM). For more information see link.

Nagios checks - return codes and output

Tiny "crash course" on Nagios checks. Nagios (from version 3.x) assumes that checks can produce multi-line output. First line is considered as the check's summary and all the rest is the check's details data. Return codes: 0 - OK, 1 - WARNING, 2 - CRITICAL, 3 - UNKNOWN (for more details see "Plugin Return Codes").

Example of running a dummy Nagios check:

$ ./check_dummy
check_dummy: Could not parse arguments
Usage: check_dummy <integer state> [optional text]
$
$ ./check_dummy 2 "my summary
> my details data line 1
> my details data line 2"
CRITICAL: my summary
my details data line 1
my details data line 2
$
$ echo $?
2

$ ./check_dummy 2 "my summary
> my details data line 1
> my details data line 2" | nl
     1  CRITICAL: my summary
     2  my details data line 1
     3  my details data line 2
$

When the check is run under Nagios and the output of the check is collected by Nagios, lines 1 and "2 ..." of the check's output (from the example above) will be stored in two different containers - summary and details data (SERVICEOUTPUT and LONGSERVICEOUTPUT in Nagios words respectively). By default, Nagios reads only 4 KB of data returned by checks (can be overridden by recompilation on Nagios - see). Which means 512 ASCII characters (including spaces and line feeds) for summary and details data. Nagios web GUI displays summary and details data separately. Details data is displayed preserving line feeds.

Running existing SAM tests under Nagios - samtest-run

A wrapper check was developed to allow for running (a sub-set of) existing SAM tests under Nagios. Useful during transition period.

$ rpm -qf --queryformat "%{NAME}\n" /usr/libexec/grid-monitoring/probes/org.sam/samtest-run
grid-monitoring-probes-org.sam
$

Usage

Usage of the wrapper check

$ /usr/libexec/grid-monitoring/probes/org.sam/samtest-run
Usage:
samtest-run [-d <path> -s <name> -m <test>] | [-f <pathToTest>] [-H <hostname>]
[-p] [-e <env,..>] [-t|--timeout sec] [-V] [-h|--help] [-v|--vo <VO>]
[-x proxy] [-w <path>] [-o "SAM test options"]

    Provide test with
 [-d <dir> -s <name> -m <test>] | [-f <pathToTest>]
    or
 -h for help
$

Output of running ./samtest-run -h show hide

$ ./samtest-run -h
Usage:
samtest-run [-d <path> -s <name> -m <test>] | [-f <pathToTest>] [-H <hostname>]
[-p] [-e <env,..>] [-t|--timeout sec] [-V] [-h|--help] [-v|--vo <VO>]
[-x proxy] [-w <path>] [-o "SAM test options"]

   Mandatory parameters:
-d <path>          Directory where SAM sensors are located. Absolute path.
-s <name>          Name of SAM sensor
-m <test>          Name of a test to be run. Eg. SRMv2-get-SURLs. Assumes tests
                   are located under /<path>/<name>/tests/
-f <pathToTest>    Test specified by an absolute path.
   Optional parameters:
-H <hostname>      Hostname the service is running on. (Default: localhost)
                   Usually, SAM tests assume that first positional parameter is
                   the name of the host to test. However, this might not be the
                   case for all the tests. Thus, if not provided 'localhost'
                   default is used.
-p                 If you want your '/<path>/<name>/prepare-<name>' script to
                   be executed. Note: the script shouldn't contain any calls to
                   SAM binaries (eg. same-publish-tuples) - only code really
                   preparing an "environment" for the test to be executed.
-e <env,..>        Comma delimited list of KEY=value environment variables that
                   are required to be exported before launching the test.
-h|--help          Displays help.
-t|--timeout sec   Sets test's global timeout. (Default: 600)
-v|--vo <VO>       VO name to set as SAME_VO environment variable.
                   (Default: ops)
-w <path>          Working directory for checks.
                   (Default: /var/run/gridprobes/<VO>/same)
-x                 VOMS proxy (Order: X509_USER_PROXY, /tmp/x509up_u<UID>, -x)
-o "options"       Options to be passed to the SAM test
-V                 Displays version.

You must specify test ([-d, -s, -m] | [-f]) and hostname (-H)

This script runs a test script available as an executable in
<directory>/<name>/tests/<test>
  or
<pathToTest>

Arguments given with -o option are passed to the test script.

The script captures (in line buffered mode) stdout and stderr of the test
script and produces Nagios compliant output consisting of
- test status (on the first line)
- multi-line details data

SAM exit codes are mapped to Nagios ones.
    Nagios      | SAM
    0, OK       | 0, (10, ok)
    1, WARNING  | (40, warning)
    2, CRITICAL | 1, (50, error), (60, critical)
    3, UNKNOWN  | (20, info), (30, notice), (100, maintenance)
                |
    1, WARNING  | in other cases
$

Detailed description

Detailed description of what the wrapper does:

  • sets SAM compliant environment for the test. Variables accessible are:
show hide
variable value
SAME_HOME * SAME_HOME=<path>/.. (a level higher up in the directory tree hierarchy). In SAM corresponds to /opt/lcg/same/client/sensors/..
SAME_SENSOR_NAME * SAME_SENSOR_NAME=<name>
SAME_SENSOR_HOME * SAME_SENSOR_HOME=<path>/<name>
SAME_WORK default /var/run/gridmonsam/$SAME_VO/same or given by -w <path> - SAME_WORK=<path>. In SAM corresponds to $HOME/.same/
SAME_SENSOR_WORK * SAME_SENSOR_WORK=$SAME_WORK/$SAME_SENSOR_NAME
SAME_TEST_WORK * SAME_TEST_WORK=$SAME_SENSOR_WORK/nodes/<hostname>
SAME_TEST_DIRNAME ** SAME_TEST_DIRNAME=dirname(<pathToTest>)
SAME_VO ops by default or given by [-v¦--vo <VO>]
SAME_TIMEOUT 600 by default or given by [-t¦--timeout sec]
SAME_OK 10
SAME_INFO 20
SAME_NOTICE 30
SAME_WARNING 40
SAME_ERROR 50
SAME_CRITICAL 60
SAME_MAINTENANCE 100
* set only if -d <path>, -s <name> and -m <test> are given

** set only if -f <pathToTest> is given

  • exports environment variables provided with -e
  • creates $SAME_WORK and $SAME_SENSOR_WORK if they do not exist
  • creates $SAME_SENSOR_WORK/nodes/<hostname> if doesn't exist
  • sets global timeout (alarm) on the run of the SAM test (600 sec by default); when timeout is reached, gracefully kills the whole process group; stdout/stderr from the child processes gathered before the timeout will be provided as detailed data
  • sets SIGTERM handler; if SIGTERM is caught the script gracefully exists with printing out status and launched script detailed data gathered before the signal was caught
  • if -p was specified, /<path>/<name>/prepare-<name> will be executed before launching test. The prepare-<name> script should be modified to contain only code that is responsible for preparation of an environment for the test to be executed correctly. Calls to SAM binaries should be suppressed (eg. same-publish-tuples)
  • forks SAM test with stdout/stderr joined
  • collects the test's output in line-buffered mode
  • picks up the test's exit code
  • translates the test's output and exit code to Nagios compliant output and exit code respectively
    • output (Nagios expects first line of the output to be summary of a check execution and all the rest - details data):
      • code checks if last line of the check's output starts with "summary: ". If this is the case, the part of the string (256 characters) that follows "summary: " is considered as summary data. The summary data will be pre-pended with textual representation of the check's SAM exit code mapped to Nagios one (OK, WARNING, CRITICAL, UNKNOWN). Eg.: CRITICAL: unable to copy file.. If the check's "summary data" not found, then, only textual representation of exit code will be printed to Nagios as the summary.
      • all the rest of the check's output is considered as details data (and will be output to Nagios followed the summary line.)
    • exit code:
      • the following mapping is used
SAM Nagios
(10, SAME_OK) 0, OK
(40, SAME_WARNING) 1, WARNING
(50, SAME_ERROR), (60, SAME_CRITICAL) 2, CRITICAL
(20, SAME_INFO), (30, SAME_NOTICE), (100, SAME_MAINTENANCE) 3, UNKNOWN
in other cases 1, WARNING

Examples

Examples. No modifications to tests were required:

  • running /opt/lcg/same/client/sensors/LFC_C/LFC_C-ping prod-lfc-shared-central.cern.ch
show hide
$ ./samtest-run -d /opt/lcg/same/client/sensors -s LFC_C -m LFC_C-ping -H prod-lfc-shared-central.cern.ch
OK: 1.6.8-1
...
Executing lfc-ping on LFC node prod-lfc-shared-central.cern.ch:
<pre>
server version: 1.6.8-1
</pre>
$ echo $?
0
$

  • running /opt/lcg/same/client/sensors/SRMv2/SRMv2-{get-SURLs,put,del} gridsrm.pi.infn.it

show hide

#
# sandbox for sensors is empty
#
$ ll /var/run/gridprobes/ops/same/
total 0
$

#
# getting SRM endpoint and SAPath
#
$ ./samtest-run -d /opt/lcg/same/client/sensors -s SRMv2 -m SRMv2-get-SURLs -H gridsrm.pi.infn.it
OK
...
<h2>Checking full endpoint and default storage areas in BDII</h2>
...
+ ldapsearch -l 7 -LLL -h lcg-bdii.cern.ch:2170 -x -b o=grid '(|(&(objectClass=GlueSA)(GlueChunkKey=GlueSEUniqueID=gridsrm.pi.infn.it)(|(GlueSAAccessControlBaseRule=ops)(GlueSAAccessControlBaseRule=VO:ops))(GlueSALocalID=ops)) (&(objectClass=GlueService)(GlueServiceUniqueID=httpg://gridsrm.pi.infn.it*/srm/managerv2)(GlueServiceVersion=2.2.0)) )' GlueServiceEndpoint GlueSAPath
+ set +x
...
SRMv2 endpoint:
httpg://gridsrm.pi.infn.it:8444/srm/managerv2
Storage Area path:
/ops
...
<p>Test status: <b>OK</b></p>
$ echo $?
0

#
# working directories for the test were created
#
$ ll /var/run/gridprobes/ops/same/
total 4
drwxrwxr-x  3 500 500 4096 Mar 23 09:31 SRMv2
$ ll /var/run/gridprobes/ops/same/SRMv2/nodes/gridsrm.pi.infn.it/
total 12
-rw-rw-r--  1 500 500 20 Mar 23 09:31 comm
-rw-rw-r--  1 500 500 46 Mar 23 09:31 endpoint.txt
-rw-rw-r--  1 500 500  5 Mar 23 09:31 SApath.txt
$

#
# "prepare" script for SRMv2 sensor
#
$ cat /opt/lcg/same/client/sensors/SRMv2/prepare-SRMv2
#!/bin/bash
echo -n "0123456789" > $SAME_SENSOR_WORK/testFile.txt
cksum $SAME_SENSOR_WORK/testFile.txt > $SAME_SENSOR_WORK/testFile.cksum
$

#
# SRMv2-put requires sensor's "prepare-SRMv2" to be run first - "-p" parameter
#
$ ./samtest-run -d /opt/lcg/same/client/sensors -p -s SRMv2 -m SRMv2-put -H gridsrm.pi.infn.it
OK
...
Testing SRMv2 endpoint
<pre>srm://gridsrm.pi.infn.it:8444/srm/managerv2</pre>
...
Testing with SURL:
srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
...
+ lcg-cp -t 120 -b --vo ops -D srmv2 -U srmv2 -v file:/var/run/gridprobes/ops/same/SRMv2/testFile.txt 'srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt'
        41472 bytes     29.95 KB/sec avg     29.95 KB/sec instDestination SE type: SRMv2
Destination SRM Request Token: f4bae3e6-e09f-4ffc-9eb6-8aa04fb53e47
Source URL: file:/var/run/gridprobes/ops/same/SRMv2/testFile.txt
File size: 41472
Source URL for copy: file:/var/run/gridprobes/ops/same/SRMv2/testFile.txt
Destination URL: gsiftp://gridsrm.pi.infn.it:2811/gpfs/gpfsift/srm/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
# streams: 1
# set timeout to  120 (seconds)

Transfer took 2070 ms
+ retcode=0
+ set +x
...
<b>OK</b>: File srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt copied successfully.
</p>
<p>Test status: <b>OK</b></p>
$ echo $? 
0

#
# file was successfully copied to SRM and its SURL stored in the sensor's sandbox
#
$ cat /var/run/gridprobes/ops/same/SRMv2/nodes/gridsrm.pi.infn.it/testFile.surl
/ops srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
$

#
# deleting file from SRM
#
$ ./samtest-run -d /opt/lcg/same/client/sensors -s SRMv2 -m SRMv2-del -H gridsrm.pi.infn.it
OK
...
Testing SRMv2 endpoint
<pre>srm://gridsrm.pi.infn.it:8444/srm/managerv2</pre>
...
Testing with SURL:
..
srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
...
+ lcg-del -t 120 -v -b -l -D srmv2 -T srmv2 --vo ops 'srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt'
VO name: ops
Timeout: 120 seconds
SE type: SRMv2
srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt - DELETED
+ retcode=0
+ set +x
...
<b>OK</b>: File srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt was deleted successfully.
<p>Test status: <b>OK</b></p>
$ echo $?
0
$

SAM tests that can be run with samtest-run

SAM critical tests

List of EGEE/WLCG Critical Probes used for Availability Metrics Calculations.

  • CE
show hide
test command modifications
CE-sft-brokerinfo ./samtest-run -f /path/CE-sft-brokerinfo no
CE-sft-caver ./samtest-run -f /path/CE-sft-caver -o "-c /path/ca_data.conf -b /path/ca_data.dat" no
CE-sft-csh ./samtest-run -f /path/CE-sft-csh no
CE-sft-softver ./samtest-run -d /path/ -s testjob -m CE-sft-softver no
"CE"-host-cert-valid ./samtest-run -p -d /path/ -s host-cert -m host-cert-valid -o "ce111.cern.ch CE" prepare-host-cert - only call to make a required test utility was left
CE-sft-lcg-rm ... wrapper for CE-sft-lcg-rm-* - can't be run
CE-sft-lcg-rm-gfal ./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-gfal no
CE-sft-lcg-rm-free ./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-free no
CE-sft-lcg-rm-cr ./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-cr -e LFC_HOST=prod-lfc-shared-central.cern.ch,LFC_HOME=/grid/ops/SAM no
CE-sft-lcg-rm-cp ./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-cp -e LFC_HOST=prod-lfc-shared-central.cern.ch,LFC_HOME=/grid/ops/SAM no
CE-sft-lcg-rm-rep ./samtest-run -d /opt/lcg/same/client/sensors/ -s testjob -m CE-sft-lcg-rm-cr -e SAME_CENTRAL_SE=lxdpm104.cern.ch,LFC_HOST=prod-lfc-shared-central.cern.ch no
CE-sft-lcg-rm-del ./samtest-run -d /opt/lcg/same/client/sensors/ -s testjob -m CE-sft-lcg-rm-del -e LFC_HOST=prod-lfc-shared-central.cern.ch,LFC_HOME=/grid/ops/SAM no

  • SRMv2
show hide
test command modifications
SRMv2-get-SURLs ./samtest-run -d /path/ -s SRMv2 -m SRMv2-get-SURLs -H lxdpm104.cern.ch no
SRMv2-ls-dir ./samtest-run -d /path/ -s SRMv2 -m SRMv2-ls-dir -H lxdpm104.cern.ch no
SRMv2-put ./samtest-run -p -d /path/ -s SRMv2 -m SRMv2-put -H lxdpm104.cern.ch prepare-SRMv2 - only code for creation of a test file was left (which could have been moved to SRMv2-put test itself)
SRMv2-ls ./samtest-run -d /path/ -s SRMv2 -m SRMv2-ls -H lxdpm104.cern.ch no
SRMv2-gt ./samtest-run -d /path/ -s SRMv2 -m SRMv2-gt -H lxdpm104.cern.ch no
SRMv2-get ./samtest-run -d /path/ -s SRMv2 -m SRMv2-get -H lxdpm104.cern.ch no
SRMv2-del ./samtest-run -d /path/ -s SRMv2 -m SRMv2-del -H lxdpm104.cern.ch no

SAM non-critical tests

  • LFC_C
show hide
test command modifications
LFC_C-ping ./samtest-run -d /path/ -s LFC_C -m LFC_C-ping -H prod-lfc-shared-central.cern.ch no
LFC_C-ls ./samtest-run -d /path/ -s LFC_C -m LFC_C-ls -H prod-lfc-shared-central.cern.ch no
LFC_C-writefile ./samtest-run -d /path/ -s LFC_C -m LFC_C-writefile -H prod-lfc-shared-central.cern.ch no

  • LFC_L
show hide
LFC_L
LFC_L-ping ./samtest-run -d /path/ -s LFC_L -m LFC_L-ping -H lfc.triumf.ca no
LFC_L-ls ./samtest-run -d /path/ -s LFC_L -m LFC_L-ls -H lfc.triumf.ca no

  • TODO: extend the list

List of SAM tests that require certain modifications to be able to run with samtest-run

test modifications command
working on it right now

Writing probes for Nagios

Nagios checks

For development of Nagios native checks please see Nagios plug-in development guidelines.

"Semi-Nagios" check. Running with nagtest-run

"Semi-Nagios" check is the one that is:

For such checks a wrapper exists - nagtest-run.

$ rpm -qf --queryformat "%{NAME}\n" /usr/libexec/grid-monitoring/probes/org.sam/nagtest-run
grid-monitoring-probes-org.sam
$

Usage

Usage of the wrapper check
$ /usr/libexec/grid-monitoring/probes/org.sam/nagtest-run
Usage:
nagtest-run -m <pathToTest> -H <hostname> -s <service> [-t|--timeout sec]
[-V] [-e <env,..>] [-h|--help] [--vo <VO>] [-x proxy] [-w <path>]
[-v <0-3>] [-o "test options"]

    Provide test with mandatory parameters
 -m <pathToTest> -s <service> -H <hostname>
    or
 -h for help

Output of running ./nagtest-run -h show hide

$ ./nagtest-run -h
Usage:
nagtest-run -m <pathToTest> -H <hostname> -s <service> [-t|--timeout sec]
[-V] [-e <env,..>] [-h|--help] [--vo <VO>] [-x proxy] [-w <path>]
[-v <0-3>] [-o "test options"]

   Mandatory parameters:
-m <pathToTest>    Test specified by an absolute path.
-s <service>       Name of a service to be tested.
-H <hostname>      Hostname to test. Passed to test.
   Optional parameters:
-h|--help          Displays help
-t|--timeout sec   Sets test's global timeout. Passed to test as (-t).
                   (Default: 600)
--vo <VO>          VO name to set as NAG_VO environment variable.
                   (Default: ops)
-v <0-3>           Verbosity level. Passed to the test. (Default: 0)
-e <env,..>        Comma delimited list of KEY=value environment variables that
                   are required to be exported before launching the test.
-w <path>          Working directory for checks.
                   (Default: /var/run/gridprobes/<VO>/nag)
-x                 VOMS proxy (Order: X509_USER_PROXY, /tmp/x509up_u<UID>, -x)
-o "test options"  Options to be passed to the test.
-V                 Displays version.

You must specify test (-m), service name (-s) and hostname (-H)

The wrapper script runs a test script available as an executable <pathToTest>.

Arguments given with -o option are passed to the test script.

The script captures (in line buffered mode) stdout and stderr of the test
script and produces Nagios compliant output consisting of
- test status (on the first line)
- multi-line details data
The script assumes that test it runs returns 0-3. If return code is greater
than 3 it issues WARNING instead.
$

Detailed description

Detailed description of what the wrapper does:

  • command line parameters defined by Nagios and the ones that will be passed to tests.
show hide
          -V version (--version)
          -h help (--help)
          -t timeout (--timeout) *
          -w warning threshold (--warning)
          -c critical threshold (--critical)
          -H hostname (--hostname) *
          -v verbose (--verbose) *

  • variables available for checks

show hide

variable value
NAG_WORK <path>/<VO>/nag
NAG_SERVICE_NAME <service>
NAG_SERVICE_WORK $NAG_WORK/$NAG_SERVICE_NAME
NAG_CHECK_WORK $NAG_SERVICE_WORK/nodes/<hostname>
NAG_CHECK_DIRNAME dirname(<pathToTest>)
NAG_OK 0
NAG_WARNING 1
NAG_CRITICAL 2
NAG_UNKNOWN 3
NAG_VO ops or <VO>
NAG_TIMEOUT 600 or -t¦--timeout sec

  • exports environment variables provided with -e
  • creates $NAG_WORK, $NAG_SERVICE_WORK and $NAG_CHECK_WORK if they do not exist
  • sets global timeout (alarm) on the run of the SAM test (600 sec by default); when timeout is reached, gracefully kills the whole process group; stdout/stderr from the child processes gathered before the timeout will be provided as detailed data
  • sets SIGTERM handler; if SIGTERM is caught the script gracefully exists with printing out status and launched script detailed data gathered before the signal was caught
  • forks the test with stdout/stderr joined
  • collects the test's output in line-buffered mode
  • picks up the test's exit code
  • translates the test's output to Nagios compliant output. Checks test's exit code.
    • output:
      • summary of the test should go on the last line. Possibilities:
        • 256 characters of the last line
        • if line starts with "summary: " - 256 characters after this clause. If it happens to be empty, then:
          • textual representation of exit code - mapping {0 - OK, 1 - WARNING, 2 - CRITICAL, 3 - UNKNOWN }
      • previously printed data is considered as details data
    • exit code
      • checked if exit code is in range 0-3. If greater than 3 - WARNING is issued.

Examples

  • example showing how nagtest-run treats summary and details data, and passes parameters to tests.

show hide


# A dummy test script.
#
$ cat /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py
#!/usr/bin/env python
import sys, getopt
host = 'dummy'; s = 'last'
def get_opts(argv):
        global host, s
        opts,_ = getopt.getopt(argv[1:],'H:s:')
        for o,v in opts:
                if o == '-s':
                        s = v
                elif o == '-H':
                        host = v
get_opts(sys.argv)
print "line 1 - details, testing host: "+host
print "line 2 - details, testing host: "+host
if s == 'sl': # last line with "summary: some text"
        print "summary: "+"."*(250-2)+"0123456789"
elif s == 'sle': # last line with "summary: "
        print "summary: "
else:
        print "."*(250-2)+"0123456789"
sys.exit(2)
$

#
# Launch w/o options. -H was passed to the test and last line got truncated to 256 chars,
# and printed as Nagios summary (first line).
#
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py -s TEST -H localhost
/home/konstan/eclipse/workspace/sam-probes-branch-noWLCG/src/checks/nag1.py -H localhost
........................................................................................................................................................................................................................................................0123456
line 1 - details, testing host: localhost
line 2 - details, testing host: localhost
$ echo $?
2
$


#
# Launch with "-s sl". Test printed out "summary: ..." as last line. Only part after 
# "summary: " was taken and truncated to 256 chars, and printed as Nagios summary (first line).
#
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py -s TEST -H localhost -o "-s sl"
/home/konstan/eclipse/workspace/sam-probes-branch-noWLCG/src/checks/nag1.py -H localhost -s sl
........................................................................................................................................................................................................................................................0123456
line 1 - details, testing host: localhost
line 2 - details, testing host: localhost
$ echo $?
2

#
# Launch with "-s sle". Test printed out "summary: ". Wrapper put textual representation of the test's exit 
# code as the summary.
#
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py -s TEST -H localhost -o "-s sle"
/home/konstan/eclipse/workspace/sam-probes-branch-noWLCG/src/checks/nag1.py -H localhost -s sle
CRITICAL
line 1 - details, testing host: localhost
line 2 - details, testing host: localhost
$ echo $?
2
$                                    

  • "Nagios-enabled" version of SAM's LFC-ping. Nothing was changed except return codes. Note: --nopass-H is needed to say that hostname must be passed to the test w/o explicitly specifying -H parameter.

show hide

$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/eu.hec/checks/LFC-ping -H prod-lfc-shared-central.cern.ch -s LFC --nopass-H
1.6.8-1
Testing from host: kvs.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=kskaburs/CN=658461/CN=Konstantin Skaburskas
Netork timeout on LFC: LFC_CONNTIMEOUT=5 LFC_CONRETRY=1 LFC_CONRETRYINT=2

Executing lfc-ping on LFC node prod-lfc-shared-central.cern.ch:
server version: 1.6.8-1
$ echo $?
0
$

Python based probes using 'gridmon' package from python-GridMon RPM

Python 'gridmon' package provides metric base class (probe.MetricGatherer) and a framework for writing Python based Nagios compliant probes and metrics.

Binaries

For org.sam probes developed using 'gridmon' package see SAM Probes and Metrics.

Sources

'gridmon' package from python-GridMon RPM

org.sam probes and metrics

Naming

Proposed naming conventions:

  • /path/<nameSpace>/<serviceAbbreviation>-probe - probe is an executable script containing a set of metrics to test a particular service. <nameSpace> can eg. be org.<VOname> Eg., in case of SAM: /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe.
  • metric - a test examining a particular functionality of a service. Metrics can follow the following naming "convention" - <nameSpace>.<serviceAbbreviation>-<testName>. Eg.: org.sam/SRM-probe contains a number of metrics: org.sam.SRM-LsDir, org.sam.SRM-Put, org.sam.SRMv2-Del etc.

Probes/metrics naming (in short):

Probe Metric
<nameSpace>.<serviceAbbreviation>-probe <nameSpace>.<serviceAbbreviation>-<testName>
org.sam.SRM-probe org.sam.SRM-LsDir

Writing a probe

Please check org.sam/T-probe (template probe).

Skeleton of your probe:

# my imports + 'probe' module from 'gridmon' package
from gridmon import probe

class MYMetrics(probe.MetricGatherer):
   # my custom attributes
   def __init__():
      probe.MetricGatherer.__init__()
      # mandatory initialization code
      # my custom initialization code
   def metricMyMet1():
      # metric body
   def metricMyMet2():
      # metric body

runner = probe.Runner(MYMetrics, probe.ProbeFormatRenderer())
sys.exit(runner.run(sys.argv))

Simple step-by-step example:

  • import probe module from gridmon package.
try:
    from gridmon import probe
except ImportError,e:
    print "UNKNOWN: Error loading modules : %s" % (e)
    sys.exit(3)

  • instantiate your metrics class from probe.MetricGatherer class; set some class attributes
class LFCMetrics(probe.MetricGatherer):
   # dictionary describing metrics implemented in the class
   __metrics = { 'MyMet1':{'metricDescription':'My metric one', '':}, 
                 'MyMet2':{'metricDescription':'My metric two'} }

  • metrics description in a dictionary
class LFCMetrics(probe.MetricGatherer):
   __metrics = {

}

Reusing metrics implemented in classes of gridmetrics.*metrics modules

If you want to reuse one or more metrics implemented in metric classes defined in gridmetrics.*metrics (e.g., gridmetrics.srmmetrics.SRMMetrics) to be able to add your metrics you need the following:

from gridmetrics import srmmetrics

class SRMMetrics(srmmetrics.SRMMetrics):
    my_metrics = {
                      'MyPut' : { 
                           # required keys
                           'metricDescription' : "this metric does the following...",
                           'metricLocality'    : 'local',
                           'metricType'        : 'status',
                           'metricVersion'     : '0.1',
                           # optional keys - example
                           'cmdLineOptions'    : ['a=','b=','c='],
                           'metricChildren'    : []
                           }}

    def __init__(self, tuples):
        self.set_metrics(self.my_metrics)
        srmmetrics.SRMMetrics.__init__(self, tuples, 'SRM')
        ...
        # not needed
        #self.set_metrics(self._metrics)

Varia

Nagios sends SIGKILL to the launched checks when service_check_timeout is reached. New [Monday, June 07 2010] its set to service_check_timeout=910 sec.

Perl based

Nagios grid probes by hr.srce

Migration of HEP VOs SAM tests

Follow the link SAM to Nagios - Practical Hints to get the technical details to migrate.
If you want to publish experiment specific tests directly into MSG and make Nagios consume them, please follow this HowTo ExperimentTestsMSGPublisher

ALICE

Follow the link SAM to Nagios - ALICE

ATLAS

Follow the link SAM to Nagios - ATLAS

CMS

Follow the link SAM to Nagios - CMS

LHCb

Follow the link SAM to Nagios - LHCb

Meetings related to the migration of VO tests into Nagios

Follow the link Meetings VOs SAM tests into Nagios

Running checks under Nagios

Environment

NB! [06 Jun 2009]

Nagios:

  • doesn't clean the environment before launching checks
  • adds NAGIOS_* variables
  • configuration parameter child_processes_fork_twice=[0|1] doesn't affect environment

  • Nagios server
    • when started from init or by service nagios start, Nagios's environment will be very limited and will definitely not contain grid environment.
    • Nagios server started from interactive shell via init.d launch script (/etc/init.d/nagios start) inherits the shell's environment. The very same environment will be propagated to checks.

  • WN
    • on WNs Nagios process is started as background process (not daemon) and with child_processes_fork_twice=0 configuration parameter in nagios.cfg. Primary purpose is to be friendly to LRMS - e.i., not to leave runaway processes.
    • Nagios propagates the full user's environment it was started with to the checks. I.e., if check fails due to problems with importing (shared) libraries/modules, "command not found" etc., this is a problem of badly defined paths in the login shell.
    • script that launches Nagios on WNs doesn't change environment.

  • Observations:
    • it was observed that Nagios leaks changes of environment from checks to checks. This was observed in the following situation: Python checks launched w/o any NCG wrappers were getting grid environment being set. The environment ($PYTHONPATH) corresponded to the one defined in /usr/lib/perl5/site_perl/5.8.5/GridMon/sgutils.pm module, which is used by NCG Perl Nagios wrappers (and Perl grid checks).
      • Just speculating... smile This apparently has something to do with Nagios's internal Perl interpreter, which loads client's libraries, and if they contain statements to modify %ENV, then the modifications are applied to the whole Nagios process. Thus, all subsequent checks (irregardless if they are Perl or not) get this modified environment.

show hide

06/03/2009 03:04 PM

Here is an example. Please, try to follow till the end.

Configuration for org.sam.SRM-All-ops:

org.sam.SRM-All-ops 
ncg_check_native!/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe!600!-x 
$USER2$ --vo ops --ldap-uri alice003.nipne.ro

where ncg_check_native is:

define command{
         command_name                    ncg_check_native
         command_line                    $ARG1$ -H $HOSTNAME$ -t $ARG2$ $ARG3$
}

which (by my understanding) instructs Nagios to fork the probe 
path/org.sam/SRM-probe + arguments. No actual ncg wrapper here! Assuming that 
Nagios cleans env before launching each check the above one should fail (on 
"import lcg_util") all the time... But it doesn't fail, if in 
/usr/lib/perl5/site_perl/5.8.5/GridMon/sgutils.pm one sets

$ENV{PYTHONPATH}="$PYTHOPATH:/opt/lcg/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages:/opt/glite/lib/python::/opt/glite/lib/python2.3/site-pa
ckages/amga:/opt/fpconst/lib/python2.3/site-packages:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages";

... and in fact SRM-probe gets FULL environment (see below). Below, I "caught" a 
running instance of SRM-probe. PYTHONPATH is there and contains pre-pended colon 
":/opt/..." because of empty $PYTHOPATH in 
$ENV{PYTHONPATH}="$PYTHOPATH:/opt/..." (kind of an indicator in this case). 
Check this out:

[root@samnag004 ~]# ps fU nagios|grep SRM-probe && for p in `ps -u nagios -o 
pid|grep -v PID`;do strings /proc/${p}/environ |grep PYTHONPATH;done|sort|uniq -c
25929 ?        S      0:00  \_ python 
/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H 
lcgdpmse.dnp.fmph.uniba.sk -t 600 -x /etc/nagios/globus/userproxy.pem-ops --vo 
ops --ldap-uri lcgmonitor.dnp.fmph.uniba.sk
      18 
PYTHONPATH=/opt/glite/lib/python:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages/amga:/opt/fpconst/lib/python2.3/site-packages
      61 
PYTHONPATH=:/opt/lcg/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages:/opt/glite/lib/python::/opt/glite/lib/python2.3/site-packages/amga:/opt/fpconst/lib/python2.3/site-packages:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages

[root@samnag004 ~]# ps fwwe -p 25929
   PID TTY      STAT   TIME COMMAND
25929 ?        S      0:00 python 
/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H 
lcgdpmse.dnp.fmph.uniba.sk -t 600 -x /etc/nagios/globus/userproxy.pem-ops --vo 
ops --ldap-uri lcgmonitor.dnp.fmph.uniba.sk 
GLITE_LOCATION_VAR=/opt/glite/var 
GLOBUS_TCP_PORT_RANGE=20000,25000 ... 
LCG_GFAL_INFOSYS=sam-bdii.cern.ch:2170 ...
PYTHONPATH=:/opt/lcg/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages:/opt/glite/lib/python::/opt/glite/lib/python2.3/site-packages/amga:/opt/fpconst/lib/python2.3/site-packages:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages 
...
[root@samnag004 ~]#

I might made some mistakes in the assumptions/interpretations above, but if 
everything was correct... how did it happen that SRM-probe got 
"PYTHONPATH=:/opt/..." in its env?

K.

NB! I wasn't able to reproduce this behavior with two simple tests. Check A: Perl test sets an environment variable. Check B: shell test checks for presence of that variable. Nagios daemon instance with the two above checks only was run for quite sometime, but Check B didn't see the variable set by Check A.

-- KonstantinSkaburskas - 17 Mar 2009

Edit | Attach | Watch | Print version | History: r41 < r40 < r39 < r38 < r37 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r41 - 2010-08-16 - AleDiGGi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback