SAM to Nagios migration of sensors and tests
Two main cases are considered:
- running existing SAM tests without modification under Nagios
- developing service checks for Nagios
NB!
grid-monitoring-probes-org.sam
RPM is available through EGEE SA1 repository
http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/
(also, via
egee-NAGIOS
meta RPM). For more information see
link.
Nagios checks - return codes and output
Tiny "crash course" on Nagios checks. Nagios (from version 3.x) assumes that checks can produce multi-line output. First line is considered as the check's summary and all the rest is the check's details data. Return codes:
0 - OK, 1 - WARNING, 2 - CRITICAL, 3 - UNKNOWN
(for more details see
"Plugin Return Codes"
).
Example of running a dummy Nagios check:
$ ./check_dummy
check_dummy: Could not parse arguments
Usage: check_dummy <integer state> [optional text]
$
$ ./check_dummy 2 "my summary
> my details data line 1
> my details data line 2"
CRITICAL: my summary
my details data line 1
my details data line 2
$
$ echo $?
2
$ ./check_dummy 2 "my summary
> my details data line 1
> my details data line 2" | nl
1 CRITICAL: my summary
2 my details data line 1
3 my details data line 2
$
When the check is run under Nagios and the output of the check is collected by Nagios, lines
1
and
"2 ..."
of the check's output (from the example above) will be stored in two different containers -
summary
and
details data
(SERVICEOUTPUT and LONGSERVICEOUTPUT in Nagios words respectively). By default, Nagios reads only 4 KB of data returned by checks (can be overridden by recompilation on Nagios -
see
). Which means 512 ASCII characters (including spaces and line feeds) for summary and details data. Nagios web GUI displays summary and details data separately. Details data is displayed preserving line feeds.
Running existing SAM tests under Nagios - samtest-run
A wrapper check was developed to allow for running (a sub-set of) existing SAM tests under Nagios. Useful during transition period.
$ rpm -qf --queryformat "%{NAME}\n" /usr/libexec/grid-monitoring/probes/org.sam/samtest-run
grid-monitoring-probes-org.sam
$
Usage
Usage of the wrapper check
$ /usr/libexec/grid-monitoring/probes/org.sam/samtest-run
Usage:
samtest-run [-d <path> -s <name> -m <test>] | [-f <pathToTest>] [-H <hostname>]
[-p] [-e <env,..>] [-t|--timeout sec] [-V] [-h|--help] [-v|--vo <VO>]
[-x proxy] [-w <path>] [-o "SAM test options"]
Provide test with
[-d <dir> -s <name> -m <test>] | [-f <pathToTest>]
or
-h for help
$
Output of running
./samtest-run -h
show
hide
$ ./samtest-run -h
Usage:
samtest-run [-d <path> -s <name> -m <test>] | [-f <pathToTest>] [-H <hostname>]
[-p] [-e <env,..>] [-t|--timeout sec] [-V] [-h|--help] [-v|--vo <VO>]
[-x proxy] [-w <path>] [-o "SAM test options"]
Mandatory parameters:
-d <path> Directory where SAM sensors are located. Absolute path.
-s <name> Name of SAM sensor
-m <test> Name of a test to be run. Eg. SRMv2-get-SURLs. Assumes tests
are located under /<path>/<name>/tests/
-f <pathToTest> Test specified by an absolute path.
Optional parameters:
-H <hostname> Hostname the service is running on. (Default: localhost)
Usually, SAM tests assume that first positional parameter is
the name of the host to test. However, this might not be the
case for all the tests. Thus, if not provided 'localhost'
default is used.
-p If you want your '/<path>/<name>/prepare-<name>' script to
be executed. Note: the script shouldn't contain any calls to
SAM binaries (eg. same-publish-tuples) - only code really
preparing an "environment" for the test to be executed.
-e <env,..> Comma delimited list of KEY=value environment variables that
are required to be exported before launching the test.
-h|--help Displays help.
-t|--timeout sec Sets test's global timeout. (Default: 600)
-v|--vo <VO> VO name to set as SAME_VO environment variable.
(Default: ops)
-w <path> Working directory for checks.
(Default: /var/run/gridprobes/<VO>/same)
-x VOMS proxy (Order: X509_USER_PROXY, /tmp/x509up_u<UID>, -x)
-o "options" Options to be passed to the SAM test
-V Displays version.
You must specify test ([-d, -s, -m] | [-f]) and hostname (-H)
This script runs a test script available as an executable in
<directory>/<name>/tests/<test>
or
<pathToTest>
Arguments given with -o option are passed to the test script.
The script captures (in line buffered mode) stdout and stderr of the test
script and produces Nagios compliant output consisting of
- test status (on the first line)
- multi-line details data
SAM exit codes are mapped to Nagios ones.
Nagios | SAM
0, OK | 0, (10, ok)
1, WARNING | (40, warning)
2, CRITICAL | 1, (50, error), (60, critical)
3, UNKNOWN | (20, info), (30, notice), (100, maintenance)
|
1, WARNING | in other cases
$
Detailed description
Detailed description of what the wrapper does:
- sets SAM compliant environment for the test. Variables accessible are:
show
hide
variable |
value |
SAME_HOME * |
SAME_HOME=<path>/.. (a level higher up in the directory tree hierarchy). In SAM corresponds to /opt/lcg/same/client/sensors/.. |
SAME_SENSOR_NAME * |
SAME_SENSOR_NAME=<name> |
SAME_SENSOR_HOME * |
SAME_SENSOR_HOME=<path>/<name> |
SAME_WORK |
default /var/run/gridmonsam/$SAME_VO/same or given by -w <path> - SAME_WORK=<path> . In SAM corresponds to $HOME/.same/ |
SAME_SENSOR_WORK * |
SAME_SENSOR_WORK=$SAME_WORK/$SAME_SENSOR_NAME |
SAME_TEST_WORK * |
SAME_TEST_WORK=$SAME_SENSOR_WORK/nodes/<hostname> |
SAME_TEST_DIRNAME ** |
SAME_TEST_DIRNAME=dirname(<pathToTest>) |
SAME_VO |
ops by default or given by [-v¦--vo <VO>] |
SAME_TIMEOUT |
600 by default or given by [-t¦--timeout sec] |
SAME_OK |
10 |
SAME_INFO |
20 |
SAME_NOTICE |
30 |
SAME_WARNING |
40 |
SAME_ERROR |
50 |
SAME_CRITICAL |
60 |
SAME_MAINTENANCE |
100 |
* set only if -d <path>
, -s <name>
and -m <test>
are given
** set only if -f <pathToTest>
is given
- exports environment variables provided with
-e
- creates
$SAME_WORK
and $SAME_SENSOR_WORK
if they do not exist
- creates
$SAME_SENSOR_WORK/nodes/<hostname>
if doesn't exist
- sets global timeout (alarm) on the run of the SAM test (600 sec by default); when timeout is reached, gracefully kills the whole process group; stdout/stderr from the child processes gathered before the timeout will be provided as detailed data
- sets SIGTERM handler; if SIGTERM is caught the script gracefully exists with printing out status and launched script detailed data gathered before the signal was caught
- if
-p
was specified, /<path>/<name>/prepare-<name>
will be executed before launching test. The prepare-<name>
script should be modified to contain only code that is responsible for preparation of an environment for the test to be executed correctly. Calls to SAM binaries should be suppressed (eg. same-publish-tuples
)
- forks SAM test with stdout/stderr joined
- collects the test's output in line-buffered mode
- picks up the test's exit code
- translates the test's output and exit code to Nagios compliant output and exit code respectively
- output (Nagios expects first line of the output to be summary of a check execution and all the rest - details data):
- code checks if last line of the check's output starts with
"summary: "
. If this is the case, the part of the string (256 characters) that follows "summary: "
is considered as summary data. The summary data will be pre-pended with textual representation of the check's SAM exit code mapped to Nagios one (OK
, WARNING
, CRITICAL
, UNKNOWN
). Eg.: CRITICAL: unable to copy file.
. If the check's "summary data" not found, then, only textual representation of exit code will be printed to Nagios as the summary.
- all the rest of the check's output is considered as details data (and will be output to Nagios followed the summary line.)
- exit code:
- the following mapping is used
SAM |
Nagios |
(10, SAME_OK ) |
0, OK |
(40, SAME_WARNING ) |
1, WARNING |
(50, SAME_ERROR ), (60, SAME_CRITICAL ) |
2, CRITICAL |
(20, SAME_INFO ), (30, SAME_NOTICE ), (100, SAME_MAINTENANCE ) |
3, UNKNOWN |
in other cases |
1, WARNING |
Examples
Examples. No modifications to tests were required:
- running
/opt/lcg/same/client/sensors/LFC_C/LFC_C-ping prod-lfc-shared-central.cern.ch
show
hide
$ ./samtest-run -d /opt/lcg/same/client/sensors -s LFC_C -m LFC_C-ping -H prod-lfc-shared-central.cern.ch
OK: 1.6.8-1
...
Executing lfc-ping on LFC node prod-lfc-shared-central.cern.ch:
<pre>
server version: 1.6.8-1
</pre>
$ echo $?
0
$
- running
/opt/lcg/same/client/sensors/SRMv2/SRMv2-{get-SURLs,put,del} gridsrm.pi.infn.it
show
hide
#
# sandbox for sensors is empty
#
$ ll /var/run/gridprobes/ops/same/
total 0
$
#
# getting SRM endpoint and SAPath
#
$ ./samtest-run -d /opt/lcg/same/client/sensors -s SRMv2 -m SRMv2-get-SURLs -H gridsrm.pi.infn.it
OK
...
<h2>Checking full endpoint and default storage areas in BDII</h2>
...
+ ldapsearch -l 7 -LLL -h lcg-bdii.cern.ch:2170 -x -b o=grid '(|(&(objectClass=GlueSA)(GlueChunkKey=GlueSEUniqueID=gridsrm.pi.infn.it)(|(GlueSAAccessControlBaseRule=ops)(GlueSAAccessControlBaseRule=VO:ops))(GlueSALocalID=ops)) (&(objectClass=GlueService)(GlueServiceUniqueID=httpg://gridsrm.pi.infn.it*/srm/managerv2)(GlueServiceVersion=2.2.0)) )' GlueServiceEndpoint GlueSAPath
+ set +x
...
SRMv2 endpoint:
httpg://gridsrm.pi.infn.it:8444/srm/managerv2
Storage Area path:
/ops
...
<p>Test status: <b>OK</b></p>
$ echo $?
0
#
# working directories for the test were created
#
$ ll /var/run/gridprobes/ops/same/
total 4
drwxrwxr-x 3 500 500 4096 Mar 23 09:31 SRMv2
$ ll /var/run/gridprobes/ops/same/SRMv2/nodes/gridsrm.pi.infn.it/
total 12
-rw-rw-r-- 1 500 500 20 Mar 23 09:31 comm
-rw-rw-r-- 1 500 500 46 Mar 23 09:31 endpoint.txt
-rw-rw-r-- 1 500 500 5 Mar 23 09:31 SApath.txt
$
#
# "prepare" script for SRMv2 sensor
#
$ cat /opt/lcg/same/client/sensors/SRMv2/prepare-SRMv2
#!/bin/bash
echo -n "0123456789" > $SAME_SENSOR_WORK/testFile.txt
cksum $SAME_SENSOR_WORK/testFile.txt > $SAME_SENSOR_WORK/testFile.cksum
$
#
# SRMv2-put requires sensor's "prepare-SRMv2" to be run first - "-p" parameter
#
$ ./samtest-run -d /opt/lcg/same/client/sensors -p -s SRMv2 -m SRMv2-put -H gridsrm.pi.infn.it
OK
...
Testing SRMv2 endpoint
<pre>srm://gridsrm.pi.infn.it:8444/srm/managerv2</pre>
...
Testing with SURL:
srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
...
+ lcg-cp -t 120 -b --vo ops -D srmv2 -U srmv2 -v file:/var/run/gridprobes/ops/same/SRMv2/testFile.txt 'srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt'
41472 bytes 29.95 KB/sec avg 29.95 KB/sec instDestination SE type: SRMv2
Destination SRM Request Token: f4bae3e6-e09f-4ffc-9eb6-8aa04fb53e47
Source URL: file:/var/run/gridprobes/ops/same/SRMv2/testFile.txt
File size: 41472
Source URL for copy: file:/var/run/gridprobes/ops/same/SRMv2/testFile.txt
Destination URL: gsiftp://gridsrm.pi.infn.it:2811/gpfs/gpfsift/srm/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
# streams: 1
# set timeout to 120 (seconds)
Transfer took 2070 ms
+ retcode=0
+ set +x
...
<b>OK</b>: File srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt copied successfully.
</p>
<p>Test status: <b>OK</b></p>
$ echo $?
0
#
# file was successfully copied to SRM and its SURL stored in the sensor's sandbox
#
$ cat /var/run/gridprobes/ops/same/SRMv2/nodes/gridsrm.pi.infn.it/testFile.surl
/ops srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
$
#
# deleting file from SRM
#
$ ./samtest-run -d /opt/lcg/same/client/sensors -s SRMv2 -m SRMv2-del -H gridsrm.pi.infn.it
OK
...
Testing SRMv2 endpoint
<pre>srm://gridsrm.pi.infn.it:8444/srm/managerv2</pre>
...
Testing with SURL:
..
srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt
...
+ lcg-del -t 120 -v -b -l -D srmv2 -T srmv2 --vo ops 'srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt'
VO name: ops
Timeout: 120 seconds
SE type: SRMv2
srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt - DELETED
+ retcode=0
+ set +x
...
<b>OK</b>: File srm://gridsrm.pi.infn.it:8444/srm/managerv2?SFN=/ops/testfile-cp-20090323-093351-a910e5c6dca9.txt was deleted successfully.
<p>Test status: <b>OK</b></p>
$ echo $?
0
$
SAM tests that can be run with samtest-run
SAM critical tests
List of
EGEE/WLCG Critical Probes used for Availability Metrics Calculations.
show
hide
test |
command |
modifications |
CE-sft-brokerinfo |
./samtest-run -f /path/CE-sft-brokerinfo |
no |
CE-sft-caver |
./samtest-run -f /path/CE-sft-caver -o "-c /path/ca_data.conf -b /path/ca_data.dat" |
no |
CE-sft-csh |
./samtest-run -f /path/CE-sft-csh |
no |
CE-sft-softver |
./samtest-run -d /path/ -s testjob -m CE-sft-softver |
no |
"CE"-host-cert-valid |
./samtest-run -p -d /path/ -s host-cert -m host-cert-valid -o "ce111.cern.ch CE" |
prepare-host-cert - only call to make a required test utility was left |
CE-sft-lcg-rm |
... |
wrapper for CE-sft-lcg-rm-* - can't be run |
CE-sft-lcg-rm-gfal |
./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-gfal |
no |
CE-sft-lcg-rm-free |
./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-free |
no |
CE-sft-lcg-rm-cr |
./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-cr -e LFC_HOST=prod-lfc-shared-central.cern.ch,LFC_HOME=/grid/ops/SAM |
no |
CE-sft-lcg-rm-cp |
./samtest-run -d /path/ -s testjob -m CE-sft-lcg-rm-cp -e LFC_HOST=prod-lfc-shared-central.cern.ch,LFC_HOME=/grid/ops/SAM |
no |
CE-sft-lcg-rm-rep |
./samtest-run -d /opt/lcg/same/client/sensors/ -s testjob -m CE-sft-lcg-rm-cr -e SAME_CENTRAL_SE=lxdpm104.cern.ch,LFC_HOST=prod-lfc-shared-central.cern.ch |
no |
CE-sft-lcg-rm-del |
./samtest-run -d /opt/lcg/same/client/sensors/ -s testjob -m CE-sft-lcg-rm-del -e LFC_HOST=prod-lfc-shared-central.cern.ch,LFC_HOME=/grid/ops/SAM |
no |
show
hide
test |
command |
modifications |
SRMv2-get-SURLs |
./samtest-run -d /path/ -s SRMv2 -m SRMv2-get-SURLs -H lxdpm104.cern.ch |
no |
SRMv2-ls-dir |
./samtest-run -d /path/ -s SRMv2 -m SRMv2-ls-dir -H lxdpm104.cern.ch |
no |
SRMv2-put |
./samtest-run -p -d /path/ -s SRMv2 -m SRMv2-put -H lxdpm104.cern.ch |
prepare-SRMv2 - only code for creation of a test file was left (which could have been moved to SRMv2-put test itself) |
SRMv2-ls |
./samtest-run -d /path/ -s SRMv2 -m SRMv2-ls -H lxdpm104.cern.ch |
no |
SRMv2-gt |
./samtest-run -d /path/ -s SRMv2 -m SRMv2-gt -H lxdpm104.cern.ch |
no |
SRMv2-get |
./samtest-run -d /path/ -s SRMv2 -m SRMv2-get -H lxdpm104.cern.ch |
no |
SRMv2-del |
./samtest-run -d /path/ -s SRMv2 -m SRMv2-del -H lxdpm104.cern.ch |
no |
SAM non-critical tests
show
hide
test |
command |
modifications |
LFC_C-ping |
./samtest-run -d /path/ -s LFC_C -m LFC_C-ping -H prod-lfc-shared-central.cern.ch |
no |
LFC_C-ls |
./samtest-run -d /path/ -s LFC_C -m LFC_C-ls -H prod-lfc-shared-central.cern.ch |
no |
LFC_C-writefile |
./samtest-run -d /path/ -s LFC_C -m LFC_C-writefile -H prod-lfc-shared-central.cern.ch |
no |
show
hide
LFC_L |
LFC_L-ping |
./samtest-run -d /path/ -s LFC_L -m LFC_L-ping -H lfc.triumf.ca |
no |
LFC_L-ls |
./samtest-run -d /path/ -s LFC_L -m LFC_L-ls -H lfc.triumf.ca |
no |
List of SAM tests that require certain modifications to be able to run with samtest-run
Writing probes for Nagios
Nagios checks
For development of Nagios native checks please see
Nagios plug-in development guidelines
.
"Semi-Nagios" check. Running with nagtest-run
"Semi-Nagios" check is the one that is:
For such checks a wrapper exists -
nagtest-run
.
$ rpm -qf --queryformat "%{NAME}\n" /usr/libexec/grid-monitoring/probes/org.sam/nagtest-run
grid-monitoring-probes-org.sam
$
Usage
Usage of the wrapper check
$ /usr/libexec/grid-monitoring/probes/org.sam/nagtest-run
Usage:
nagtest-run -m <pathToTest> -H <hostname> -s <service> [-t|--timeout sec]
[-V] [-e <env,..>] [-h|--help] [--vo <VO>] [-x proxy] [-w <path>]
[-v <0-3>] [-o "test options"]
Provide test with mandatory parameters
-m <pathToTest> -s <service> -H <hostname>
or
-h for help
Output of running
./nagtest-run -h
show
hide
$ ./nagtest-run -h
Usage:
nagtest-run -m <pathToTest> -H <hostname> -s <service> [-t|--timeout sec]
[-V] [-e <env,..>] [-h|--help] [--vo <VO>] [-x proxy] [-w <path>]
[-v <0-3>] [-o "test options"]
Mandatory parameters:
-m <pathToTest> Test specified by an absolute path.
-s <service> Name of a service to be tested.
-H <hostname> Hostname to test. Passed to test.
Optional parameters:
-h|--help Displays help
-t|--timeout sec Sets test's global timeout. Passed to test as (-t).
(Default: 600)
--vo <VO> VO name to set as NAG_VO environment variable.
(Default: ops)
-v <0-3> Verbosity level. Passed to the test. (Default: 0)
-e <env,..> Comma delimited list of KEY=value environment variables that
are required to be exported before launching the test.
-w <path> Working directory for checks.
(Default: /var/run/gridprobes/<VO>/nag)
-x VOMS proxy (Order: X509_USER_PROXY, /tmp/x509up_u<UID>, -x)
-o "test options" Options to be passed to the test.
-V Displays version.
You must specify test (-m), service name (-s) and hostname (-H)
The wrapper script runs a test script available as an executable <pathToTest>.
Arguments given with -o option are passed to the test script.
The script captures (in line buffered mode) stdout and stderr of the test
script and produces Nagios compliant output consisting of
- test status (on the first line)
- multi-line details data
The script assumes that test it runs returns 0-3. If return code is greater
than 3 it issues WARNING instead.
$
Detailed description
Detailed description of what the wrapper does:
- command line parameters defined by Nagios and the ones that will be passed to tests.
show
hide
-V version (--version)
-h help (--help)
-t timeout (--timeout) *
-w warning threshold (--warning)
-c critical threshold (--critical)
-H hostname (--hostname) *
-v verbose (--verbose) *
- variables available for checks
show
hide
variable |
value |
NAG_WORK |
<path>/<VO>/nag |
NAG_SERVICE_NAME |
<service> |
NAG_SERVICE_WORK |
$NAG_WORK/$NAG_SERVICE_NAME |
NAG_CHECK_WORK |
$NAG_SERVICE_WORK/nodes/<hostname> |
NAG_CHECK_DIRNAME |
dirname(<pathToTest> ) |
NAG_OK |
0 |
NAG_WARNING |
1 |
NAG_CRITICAL |
2 |
NAG_UNKNOWN |
3 |
NAG_VO |
ops or <VO> |
NAG_TIMEOUT |
600 or -t¦--timeout sec |
- exports environment variables provided with
-e
- creates
$NAG_WORK
, $NAG_SERVICE_WORK
and $NAG_CHECK_WORK
if they do not exist
- sets global timeout (alarm) on the run of the SAM test (600 sec by default); when timeout is reached, gracefully kills the whole process group; stdout/stderr from the child processes gathered before the timeout will be provided as detailed data
- sets SIGTERM handler; if SIGTERM is caught the script gracefully exists with printing out status and launched script detailed data gathered before the signal was caught
- forks the test with stdout/stderr joined
- collects the test's output in line-buffered mode
- picks up the test's exit code
- translates the test's output to Nagios compliant output. Checks test's exit code.
- output:
- summary of the test should go on the last line. Possibilities:
- 256 characters of the last line
- if line starts with
"summary: "
- 256 characters after this clause. If it happens to be empty, then:
- textual representation of exit code - mapping {0 -
OK
, 1 - WARNING
, 2 - CRITICAL
, 3 - UNKNOWN
}
- previously printed data is considered as details data
- exit code
- checked if exit code is in range 0-3. If greater than 3 - WARNING is issued.
Examples
- example showing how
nagtest-run
treats summary and details data, and passes parameters to tests.
show
hide
# A dummy test script.
#
$ cat /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py
#!/usr/bin/env python
import sys, getopt
host = 'dummy'; s = 'last'
def get_opts(argv):
global host, s
opts,_ = getopt.getopt(argv[1:],'H:s:')
for o,v in opts:
if o == '-s':
s = v
elif o == '-H':
host = v
get_opts(sys.argv)
print "line 1 - details, testing host: "+host
print "line 2 - details, testing host: "+host
if s == 'sl': # last line with "summary: some text"
print "summary: "+"."*(250-2)+"0123456789"
elif s == 'sle': # last line with "summary: "
print "summary: "
else:
print "."*(250-2)+"0123456789"
sys.exit(2)
$
#
# Launch w/o options. -H was passed to the test and last line got truncated to 256 chars,
# and printed as Nagios summary (first line).
#
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py -s TEST -H localhost
/home/konstan/eclipse/workspace/sam-probes-branch-noWLCG/src/checks/nag1.py -H localhost
........................................................................................................................................................................................................................................................0123456
line 1 - details, testing host: localhost
line 2 - details, testing host: localhost
$ echo $?
2
$
#
# Launch with "-s sl". Test printed out "summary: ..." as last line. Only part after
# "summary: " was taken and truncated to 256 chars, and printed as Nagios summary (first line).
#
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py -s TEST -H localhost -o "-s sl"
/home/konstan/eclipse/workspace/sam-probes-branch-noWLCG/src/checks/nag1.py -H localhost -s sl
........................................................................................................................................................................................................................................................0123456
line 1 - details, testing host: localhost
line 2 - details, testing host: localhost
$ echo $?
2
#
# Launch with "-s sle". Test printed out "summary: ". Wrapper put textual representation of the test's exit
# code as the summary.
#
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/org.sam/tests/nagrun-test.py -s TEST -H localhost -o "-s sle"
/home/konstan/eclipse/workspace/sam-probes-branch-noWLCG/src/checks/nag1.py -H localhost -s sle
CRITICAL
line 1 - details, testing host: localhost
line 2 - details, testing host: localhost
$ echo $?
2
$
- "Nagios-enabled" version of SAM's
LFC-ping
. Nothing was changed except return codes. Note: --nopass-H
is needed to say that hostname must be passed to the test w/o explicitly specifying -H
parameter.
show
hide
$ ./nagtest-run -m /usr/libexec/grid-monitoring/probes/eu.hec/checks/LFC-ping -H prod-lfc-shared-central.cern.ch -s LFC --nopass-H
1.6.8-1
Testing from host: kvs.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=kskaburs/CN=658461/CN=Konstantin Skaburskas
Netork timeout on LFC: LFC_CONNTIMEOUT=5 LFC_CONRETRY=1 LFC_CONRETRYINT=2
Executing lfc-ping on LFC node prod-lfc-shared-central.cern.ch:
server version: 1.6.8-1
$ echo $?
0
$
Python based probes using 'gridmon' package from python-GridMon RPM
Python 'gridmon' package provides metric base class (probe.MetricGatherer) and a framework for writing Python based Nagios compliant probes and metrics.
Binaries
For org.sam probes developed using 'gridmon' package see
SAM Probes and Metrics.
Sources
'gridmon' package from python-GridMon RPM
- anonymous read-only access
org.sam probes and metrics
- anonymous read-only access
-
- code can be browsed from here
Naming
Proposed naming conventions:
-
/path/<nameSpace>/<serviceAbbreviation>-probe
- probe is an executable script containing a set of metrics to test a particular service. <nameSpace>
can eg. be org.<VOname>
Eg., in case of SAM: /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe
.
- metric - a test examining a particular functionality of a service. Metrics can follow the following naming "convention" -
<nameSpace>.<serviceAbbreviation>-<testName>
. Eg.: org.sam/SRM-probe
contains a number of metrics: org.sam.SRM-LsDir
, org.sam.SRM-Put
, org.sam.SRMv2-Del
etc.
Probes/metrics naming (in short):
Probe |
Metric |
<nameSpace>.<serviceAbbreviation>-probe |
<nameSpace>.<serviceAbbreviation>-<testName> |
org.sam.SRM-probe |
org.sam.SRM-LsDir |
Writing a probe
Please check
org.sam/T-probe
(template probe).
Skeleton of your probe:
# my imports + 'probe' module from 'gridmon' package
from gridmon import probe
class MYMetrics(probe.MetricGatherer):
# my custom attributes
def __init__():
probe.MetricGatherer.__init__()
# mandatory initialization code
# my custom initialization code
def metricMyMet1():
# metric body
def metricMyMet2():
# metric body
runner = probe.Runner(MYMetrics, probe.ProbeFormatRenderer())
sys.exit(runner.run(sys.argv))
Simple step-by-step example:
- import
probe
module from gridmon
package.
try:
from gridmon import probe
except ImportError,e:
print "UNKNOWN: Error loading modules : %s" % (e)
sys.exit(3)
- instantiate your metrics class from
probe.MetricGatherer
class; set some class attributes
class LFCMetrics(probe.MetricGatherer):
# dictionary describing metrics implemented in the class
__metrics = { 'MyMet1':{'metricDescription':'My metric one', '':},
'MyMet2':{'metricDescription':'My metric two'} }
- metrics description in a dictionary
class LFCMetrics(probe.MetricGatherer):
__metrics = {
}
Reusing metrics implemented in classes of gridmetrics.*metrics modules
If you want to reuse one or more metrics implemented in metric classes defined in
gridmetrics.*metrics
(e.g.,
gridmetrics.srmmetrics.SRMMetrics
)
to be able to add your metrics you need the following:
from gridmetrics import srmmetrics
class SRMMetrics(srmmetrics.SRMMetrics):
my_metrics = {
'MyPut' : {
# required keys
'metricDescription' : "this metric does the following...",
'metricLocality' : 'local',
'metricType' : 'status',
'metricVersion' : '0.1',
# optional keys - example
'cmdLineOptions' : ['a=','b=','c='],
'metricChildren' : []
}}
def __init__(self, tuples):
self.set_metrics(self.my_metrics)
srmmetrics.SRMMetrics.__init__(self, tuples, 'SRM')
...
# not needed
#self.set_metrics(self._metrics)
Varia
Nagios sends
SIGKILL
to the launched checks when
service_check_timeout
is reached. New [Monday, June 07 2010] its set to
service_check_timeout=910
sec.
Perl based
Nagios grid probes by hr.srce
Migration of HEP VOs SAM tests
Follow the link
SAM to Nagios - Practical Hints to get the technical details to migrate.
If you want to publish experiment specific tests directly into MSG and make Nagios consume them, please follow this HowTo
ExperimentTestsMSGPublisher
ALICE
Follow the link
SAM to Nagios - ALICE
ATLAS
Follow the link
SAM to Nagios - ATLAS
CMS
Follow the link
SAM to Nagios - CMS
LHCb
Follow the link
SAM to Nagios - LHCb
Meetings related to the migration of VO tests into Nagios
Follow the link
Meetings VOs SAM tests into Nagios
Running checks under Nagios
Environment
NB! [06 Jun 2009]
Nagios:
- doesn't clean the environment before launching checks
- adds
NAGIOS_*
variables
- configuration parameter child_processes_fork_twice=[0|1] doesn't affect environment
- Nagios server
- when started from
init
or by service nagios start
, Nagios's environment will be very limited and will definitely not contain grid environment.
- Nagios server started from interactive shell via
init.d
launch script (/etc/init.d/nagios start
) inherits the shell's environment. The very same environment will be propagated to checks.
- WN
- on WNs Nagios process is started as background process (not daemon) and with child_processes_fork_twice=0 configuration parameter in nagios.cfg. Primary purpose is to be friendly to LRMS - e.i., not to leave runaway processes.
- Nagios propagates the full user's environment it was started with to the checks. I.e., if check fails due to problems with importing (shared) libraries/modules, "command not found" etc., this is a problem of badly defined paths in the login shell.
- script that launches Nagios on WNs doesn't change environment.
- Observations:
- it was observed that Nagios leaks changes of environment from checks to checks. This was observed in the following situation: Python checks launched w/o any NCG wrappers were getting grid environment being set. The environment (
$PYTHONPATH
) corresponded to the one defined in /usr/lib/perl5/site_perl/5.8.5/GridMon/sgutils.pm module, which is used by NCG Perl Nagios wrappers (and Perl grid checks).
- Just speculating...
This apparently has something to do with Nagios's internal Perl interpreter, which loads client's libraries, and if they contain statements to modify %ENV, then the modifications are applied to the whole Nagios process. Thus, all subsequent checks (irregardless if they are Perl or not) get this modified environment.
show
hide
06/03/2009 03:04 PM
Here is an example. Please, try to follow till the end.
Configuration for org.sam.SRM-All-ops:
org.sam.SRM-All-ops
ncg_check_native!/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe!600!-x
$USER2$ --vo ops --ldap-uri alice003.nipne.ro
where ncg_check_native is:
define command{
command_name ncg_check_native
command_line $ARG1$ -H $HOSTNAME$ -t $ARG2$ $ARG3$
}
which (by my understanding) instructs Nagios to fork the probe
path/org.sam/SRM-probe + arguments. No actual ncg wrapper here! Assuming that
Nagios cleans env before launching each check the above one should fail (on
"import lcg_util") all the time... But it doesn't fail, if in
/usr/lib/perl5/site_perl/5.8.5/GridMon/sgutils.pm one sets
$ENV{PYTHONPATH}="$PYTHOPATH:/opt/lcg/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages:/opt/glite/lib/python::/opt/glite/lib/python2.3/site-pa
ckages/amga:/opt/fpconst/lib/python2.3/site-packages:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages";
... and in fact SRM-probe gets FULL environment (see below). Below, I "caught" a
running instance of SRM-probe. PYTHONPATH is there and contains pre-pended colon
":/opt/..." because of empty $PYTHOPATH in
$ENV{PYTHONPATH}="$PYTHOPATH:/opt/..." (kind of an indicator in this case).
Check this out:
[root@samnag004 ~]# ps fU nagios|grep SRM-probe && for p in `ps -u nagios -o
pid|grep -v PID`;do strings /proc/${p}/environ |grep PYTHONPATH;done|sort|uniq -c
25929 ? S 0:00 \_ python
/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H
lcgdpmse.dnp.fmph.uniba.sk -t 600 -x /etc/nagios/globus/userproxy.pem-ops --vo
ops --ldap-uri lcgmonitor.dnp.fmph.uniba.sk
18
PYTHONPATH=/opt/glite/lib/python:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages/amga:/opt/fpconst/lib/python2.3/site-packages
61
PYTHONPATH=:/opt/lcg/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages:/opt/glite/lib/python::/opt/glite/lib/python2.3/site-packages/amga:/opt/fpconst/lib/python2.3/site-packages:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages
[root@samnag004 ~]# ps fwwe -p 25929
PID TTY STAT TIME COMMAND
25929 ? S 0:00 python
/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H
lcgdpmse.dnp.fmph.uniba.sk -t 600 -x /etc/nagios/globus/userproxy.pem-ops --vo
ops --ldap-uri lcgmonitor.dnp.fmph.uniba.sk
GLITE_LOCATION_VAR=/opt/glite/var
GLOBUS_TCP_PORT_RANGE=20000,25000 ...
LCG_GFAL_INFOSYS=sam-bdii.cern.ch:2170 ...
PYTHONPATH=:/opt/lcg/lib/python2.3/site-packages:/opt/lcg/lib/python:/opt/glite/lib/python2.3/site-packages:/opt/glite/lib/python::/opt/glite/lib/python2.3/site-packages/amga:/opt/fpconst/lib/python2.3/site-packages:/opt/ZSI/lib/python2.3/site-packages:/opt/SOAPpy/lib/python2.3/site-packages
...
[root@samnag004 ~]#
I might made some mistakes in the assumptions/interpretations above, but if
everything was correct... how did it happen that SRM-probe got
"PYTHONPATH=:/opt/..." in its env?
K.
NB! I wasn't able to reproduce this behavior with two simple tests. Check A: Perl test sets an environment variable.
Check B: shell test checks for presence of that variable. Nagios daemon instance with the two above checks only was run
for quite sometime, but Check B didn't see the variable set by Check A.
--
KonstantinSkaburskas - 17 Mar 2009