Practical Hints to migrate to Nagios®

Aim of this document is to give practical tips on how to setup your VO-specific Nagios project, how to write the probes, integrate them in the Nagios framework, verify the publication and the correctness and also create RPMs for deployment.

Nagios basic configuration

Nagios configuration under /etc/nagios/

  • nagios.cfg: regenerated by YAIM. It contains the main configuration. See parameters with grep -v '^#' nagios.cfg | grep -vP '^\s'. Some examples:
    • cfg_dir=/etc/nagios/wlcg.d/
    • cfg_dir=/etc/nagios/gstat/
    • resource_file=/etc/nagios/wlcg_resource.cfg
  • wlcg_resource.cfg: generated by ?. It contains these macro definitions (values are examples):
    • $USER2$=/etc/nagios/globus/userproxy.pem-cms
    • $USER3$=nagios
    • $USER4$=NagiosRetrieve-sam-cms.cern.ch-cms
  • resource.cfg: do not edit. It contains this macro definition:
    • $USER1$=/usr/lib64/nagios/plugins
  • wlcg.d/: contains files generated via cron job by /usr/sbin/ncg.pl every three hours. It contains the following:
    • commands.cfg: contains definitions of ncg_check_* commands plus other commands
    • host.groups.cfg: contains some hostgroup definitions like
      • node-<nodetype> (e.g. node-CE)
      • alias-<alias> (e.g. alias-dcache-se-desy01.desy.de)
      • site-<site> (e.g. site-CIEMAT-LCG2)
    • host.templates.cfg: contains the definition of ncg-generic-host
    • service.groups.cfg: contains some servicegroup definitions like
      • SITE_<site>_<servicetype> (e.g. SITE_CIEMAT-LCG2_SRMv2)
      • <metric>_<node> (e.g. org.sam.SRM-All_srm.ciemat.es)
      • SERVICE_<servicetype> (e.g. SERVICE_SRMv2)
      • <metricset> (e.g. SRM2)
      • <vo> (e.g. cms)
      • <fqan> (e.g. /cms/Role=lcgadmin)
      • local
    • service.templates.cfg: contains the definition of
      • ncg-generic-service
      • ncg-passive-service
    • users.cfg: contains contact definitions
    • wlcg.nagios.cfg: defines these items:
      • servicegroup_name               nagios-internal
      • timeperiod_name                 ncg-24x7
      • contact_name                    nagiosadmin
      • contactgroup_name               nagios-admins
      • contact_name                    msg-contact
      • contactgroup_name               msg-contacts
      • several services related to the Nagios server
    • directories named after sites (e.g. CIEMAT-LCG2) containing these files:
      • contacts.cfg: defines the site contact information
      • hosts.cfg: listing all hosts in the site to be tested. Most important attributes are * check_command (e.g. ncg_check_host_alive) * hostgroups (e.g. site-CIEMAT-LCG2, node-APEL, node-CE)
      • services.cfg: listing all services in the site. Most important attributes are
        • host_name (e.g. lcg02.ciemat.es)
        • servicegroups (e.g. local, org.sam.CE, SITE_CIEMAT-LCG2_CE, SERVICE_CE, cms, /cms/Role=lcgadmin)
        • service_description (e.g. org.sam.CE-JobState-/cms/Role=lcgadmin)
        • check_command, which defines the actual command to make the check
        • _metric_name (e.g. org.sam.CE-JobState)
        • _service_flavour (e.g. CE)
        • _metric_set (e.g. org.sam.CE)
        • and many others

Standard plugins and probes

  • /usr/libexec/grid-monitoring/plugins/nagios/: contains WLCG plugins, use --help for docs
    • check_config: checks if new configurations from remote Nagioses are available
    • gather_sam, gather_npm: Nagios plugin for gathering SAM results and publishing them as passive checks
    • nagios-gocdb-downtime: Nagios GOCDB downtime importer
    • recv_from_queue: Check imports messages from dir queue to Nagios as passive checks
    • send_to_db: check imports messages from dir queue to Metric Results Store
    • send_to_msg: check imports messages from dir queue to Nagios as passive checks
  • /usr/libexec/grid-monitoring/probes: contains WLCG probes. Directory trees:
    • ch.cern: FTS, LFC, RGMA
    • hr.srce: CAdistribution, DPN, DPNS, globus-GRAM, gsiftp, GridProxy, MyProxy, ResourceBroker, SRM, org.glite.wms.WMProxy, org.glite.wms.NetworkServer
    • org.bdii: BDII
    • org.ggus: GGUS
    • org.nagiosexchange: ?
    • org.sam: SRM, CREAMCE, CREAMCEDJS, CE, WN, WMS
    • org.sam.sec: security tests
  • /usr/libexec/grid-monitoring/probes/org.sam/wnjob: contains the Nagios distribution to be run on the WN
    • nagrun.sh: script to launch Nagios on the worker node
    • org.sam.gridJob.jdl.template: JDL template to send jobs to CEs
    • org.sam.gridJob.WMS.jdl.template: JDL template to test the WMS
    • org.sam.gridJob.CREAMCE-djs.jdl.template: JDL template for direct submission to CREAM
    • nagios.d/: standard Nagios files
    • org.sam/: contains configuration and probes for the standard worker node tests
      • probes/org.sam: contains the standard probes and some commands
        • check_pyver: checks the Python version
        • checks_lib.sh: some utility functions for Bash scripts
        • nagtest-run: wrapper script for "semi"-Nagios checks
        • samtest-run: wrapper script for SAM checks
        • sam/: contains the SAM tests

Setup your project

  • Choose your development machine. Subversion clients are available on LXPLUS and the Nagios framework is available on the Nagios machines dedicated to each experiment. For example, you could code on LXPLUS and test the probes on the Nagios box, provided that AFS is installed.
    • Note: to log on your Nagios box (sam-<vo>.cern.ch) as root you have to first log on gd01.cern.ch as yourself and from there use option 3).
  • Prepare your skeleton project. The following structure - available for LHCb - is strongly recommended:
[lxplus308] ~/scratch0/nagios $ tree project/
project/
|-- CHANGES
|-- Makefile
|-- dist
|   `-- grid-monitoring-probes-org.lhcb-0.0.1
|       `-- usr
|           `-- libexec
|               `-- grid-monitoring
|                   `-- probes
|                       `-- org.lhcb
|                           `-- wnjob
|                               `-- org.lhcb
|                                   |-- etc
|                                   |   `-- wn.d
|                                   `-- probes
|                                       `-- org.lhcb
|                                           |-- SRM-lhcb-FileAccess
|                                           |-- WN-sft-brokerinfo
|                                           |-- WN-sft-csh
|                                           |-- WN-sft-lcg-rm-gfal
|                                           |-- WN-sft-vo-swdir
|                                           |-- WN-sft-vo-swdir~
|                                           `-- WN-sft-voms
|-- grid-monitoring-probes-org.lhcb.spec
`-- src
    |-- README
    |-- SRM-probe
    `-- wnjob
        |-- org.lhcb
        |   |-- etc
        |   |   `-- wn.d
        |   |       `-- org.lhcb
        |   |           |-- commands.cfg
        |   |           `-- services.cfg
        |   `-- probes
        |       `-- org.lhcb
        |           |-- SRM-lhcb-FileAccess
        |           |-- WN-sft-brokerinfo
        |           |-- WN-sft-csh
        |           |-- WN-sft-lcg-rm-gfal
        |           |-- WN-sft-vo-swdir
        |           `-- WN-sft-voms
        `-- org.lhcb.gridJob.jdl.template


please note that the SRM-probe script, the wrapper for running SRM passive and active checks, sits at the same level of the wn.d directory that contains active checks for CE, CREAMCE and the WN requiring submission through the Nagios framework.

Write your Worker Node tests and configure them in Nagios

For your active checks on the WN we report here how to convert your existing SAM tests that you submit regularly with the SAM framework. The same section explains how to test your probes and verify that everything is properly migrated.

  • Edit your test probes under org.lhcb/src/wnjob/org.lhcb/probes/. The code can be written in any language. For example one can take the original CE-sft-*, CE-lhcb-* SAM tests and modify them following this simple recipe:
    • Add at the beginning the following enumeration:
      • NAGIOS_OK=0
      • NAGIOS_WARNING=1
      • NAGIOS_ERROR=2
    • Replace all SAM_* exit codes with the corresponding NAGIOS_* return codes.
    • Remove all echo <pre> and echo </pre> instructions
    • Try to keep the test output as concise as possible
    • Ensure that the very first line of your test output is the summary of the test itself. Usually this is the last line in your old SAM tests. For this reason you can either use Konstantinís flip utility or pipe all the output into a variable and then echo this variable at convenient time, after that the summary is printed out.
  • Add to the services.cfg file under org.lhcb/src/wnjob/org.lhcb/etc/wn.d/org.lhcb/ something similar to (assuming you are adapting the CE-sft-vo-swdir test):
define service{
   use         sam-generic-wn-active
   service_description   org.lhcb.WN-sft-vo-swdir-<VO>
   check_command      wn-sft-vo-swdir
   }

This file tells to Nagios which tests (which become services in Nagios) to run on the WN. In this example there are no dependencies between the services and they are all active. Each block refers to a service and the value of the check_command key refers to the a command in commands.cfg. Note: In the example above the service_description is ending with <VO>. However if in the Nagios configurator configuration file (/etc/ncg/ncg.conf) you defined in your profile the FQAN Nagios will try to use VOMS certificates and this should be reflected in the services.conf as:

define service{
   use         sam-generic-wn-active
   service_description   org.lhcb.WN-sft-vo-swdir-<VOMS>
   check_command      wn-sft-vo-swdir
   }

The command.cfg file must contain something like:

define command {
        command_name    wn-sft-vo-swdir
        command_line    $USER3$/org.lhcb/WN-sft-vo-swdir
}
Where:
  • $USER3$ = <nagiosRoot>/probes

The following options might be used:

  • -w <dir>: working directory
  • -d <dir>: directory containing the SAM sensors
  • -s <sensor>: sensor name
  • -m <test>: test to run

NCG

The ncg.pl command generates the WLCG-specific configuration based on its configuration, under /etc/ncg/, and the Hash.pm file.

How to choose on which sites to run Nagios

To include specific sites edit the file /etc/ncg/ncg.localdb and add lines like:
ADD_SITE!sitename

Creating your SVN project and configuring your clients

It is recommended to store your probes in the CERN Subversion repository. The repository can be accessed using:

$ svn co svn+ssh://svn.cern.ch/reps/it-es-vos

Access to the web interface is via https://svnweb.cern.ch/VCS.

Validating your probes (development mode)

  • You have to log into your Nagios machine (sam-lhcb.cern.ch as root and then "su -l nagios").
  • The following fairly complex command line sets all options that are usually hidden by Nagios when running in production (as RPM installed probes).
> /usr/libexec/grid-monitoring/probes/org.sam/CE-probe -m 
org.sam.CE-JobState -H dummy.host.ce --mb-destination /a/c --no-submit -v 3 
--add-wntar-nag 
/afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/wnjob/org.lhcb 
--add-wntar-nag-nosam --vo lhcb --namespace org.lhcb -x 
/afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/wnjob/x509up_u7442
This command will create under /var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/ a series of directories and files that are used by nagios to bundle the grid job (JDL,tarball, executable, options) to be submitted to the remote WN. The executable is nagrun.sh that takes some default option opportunely modifiable. We invite to run first this command and check that everything is properly set before submitting to a real site. If everything is setup properly the output of the above command will look like
OK: Asked not to submit job. Bailing out
OK: Asked not to submit job. Bailing out
Testing from: samnag015.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=santinel/CN=564059/CN=Roberto 
Santinelli/CN=proxy
VOMS FQANs: /lhcb/Role=NULL/Capability=NULL
Preparing for job submission
WMS to be used
System default.
Creating WN tarball
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/gridjob.tgz
Adding externally provided directories:
/afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/wnjob/org.lhcb
cp -rf /afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/wnjob/org.lhcb/* 
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/nagios.d
Creating the archive ... done.
JDL patterns for template substitutions
jdlInputSandboxExecutable : 
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/nagrun.sh
jdlRetryCount : 0
jdlInputSandboxTarball : 
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/gridjob.tgz
jdlExecutable : nagrun.sh
jdlShallowRetryCount : 1
jdlReqCEInfoHostName : dummy.host.ce
jdlArguments : -v lhcb  -d /a/c  -t 600 -w 1 -l prod-lfc-shared-central.cern.ch 
-s samdpm002.cern.ch
JDL template:
file: /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.gridJob.jdl.template
[
Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
Arguments = "<jdlArguments>";
InputSandbox =  {"<jdlInputSandboxExecutable>", "<jdlInputSandboxTarball>"};
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = other.GlueCEInfoHostName == "<jdlReqCEInfoHostName>";
]
Resulting JDL:
[
Type="Job";
JobType="Normal";
Executable = "nagrun.sh";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
Arguments = "-v lhcb  -d /a/c  -t 600 -w 1 -l prod-lfc-shared-central.cern.ch -s 
samdpm002.cern.ch";
InputSandbox =  {"/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/nagrun.sh", 
"/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/gridjob.tgz"};
OutputSandbox = {"gridjob.out"};
RetryCount = 0;
ShallowRetryCount = 1;
Requirements = other.GlueCEInfoHostName == "dummy.host.ce";
]
Saving JDL to 
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/gridJob.jdl ... done.
Job submit command
glite-wms-job-submit -a  -o 
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/jobID 
/var/run/gridprobes/lhcb/org.lhcb/CE/dummy.host.ce/gridJob.jdl

Asked not to submit job. Bailing out ... Asked not to submit job. Bailing out

More information on the above (important) command are available here.

  • To submit a real Nagios probe to a real site and check the results out of the Message Bus. The following nagios command will submit through a specified wms server to a specified CE using specified credentials and not using other probes apart the ones created under AFS org.lhcb projects (by default org.sam probes are submitted). Please check the MessageBus (mb) option requires a fully qualified URI (protocol hostname and port): in the example stomp://gridmsg002.cern.ch:6163
/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -m org.sam.CE-JobState -H ce123.cern.ch --mb-destination /queue/test.org.lhcb --mb-uri stomp://gridmsg002.cern.ch:6163 --add-wntar-nag /afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/wnjob/org.lhcb --add-wntar-nag-nosam --vo lhcb --namespace org.lhcb --wms wms203.cern.ch -x /var/log/nagios/our.pem

and the output produced is:

OK: [Submitted]
OK: [Submitted]

Connecting to the service https://wms203.cern.ch:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://wms203.cern.ch:9000/5CAVZQmA3aJEFzh6eWR81A

The job identifier has been saved in the following file:

  • Check your probes are producing meaningful outputs. You have two (eventually more) ways to do so:
    • directly via apache web server installed in the nagios box you are sending messages checking them at https://gridmsg002.cern.ch/admin/queues.jsp. Select the destination queue specified in the above command line (--mb-destination option); In the example this was set to /queue/test.org.lhcb. You will see active consumer (0) messages produced and you can browse through them clicking on the queue.
    • you can subscribe as an "active consumer" via telnet as shown in the following example of operations done:
9.    Otherwise you can subscribe as an active consumer via telner as in the following steps:
  > telnet gridmsg002.cern.ch 6163
Trying 128.142.178.147...
Connected to gridmsg002.cern.ch.
Escape character is '^]'.
CONNECT

^@   (to create this character CTRL-SHIFT-2)
CONNECTED
session:ID:gridmsg002.cern.ch-41905-1274272973929-2:1780742


SUBSCRIBE
destination:/queue/test.org.sam

^@

SEND
destination:/queue/test.org.sam

hello
^@
MESSAGE
message-id:ID:gridmsg002.cern.ch-41905-1274272973929-2:1780742:-1:1:1
destination:/queue/test.org.sam
timestamp:1274884331503
expires:0
priority:0

hello

^]q

telnet> q
Connection closed.
[kvs]:~ > telnet gridmsg002.cern.ch 6163
Trying 128.142.178.147...
Connected to gridmsg002.cern.ch.
Escape character is '^]'.
CONNECT

^@
CONNECTED
session:ID:gridmsg002.cern.ch-41905-1274272973929-2:1781022


SUBSCRIBE
destination:/queue/test.org.sam

^@
^]q

telnet> q
Connection closed.

  • You can add at any time any other probe that will create. For example you have created a new wn.d probe called SRM-lhcb-FileAccess? You will modify service.cfg and command.cfg according and you will import into SVN as :
svn import SRM-lhcb-FileAccess https://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/probe/org.lhcb/src/wnjob/org.lhcb/probes/org.lhcb/SRM-lhcb-FileAccess 

Writing your SRM probe

Introduction

At the same level of
org.lhcb/src/wnjob
your SRM-probe has to be created. There is not much freedom to choose your programming language (as it was for the CE probes). This section will provide some tips on how to integrate your VO specific code into the Nagios framework. The SRM-probe is a pythjon script that wraps all SRM metrics and complies with Nagios. This wrapper invokes as default pre-defined metrics that comes with org.sam available under a library written by ops people. The structure of these active and passive checks is well documented https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#SRM. Just for the understanding of the choices you will see in the rest of this section we remind that for ops the metric org.sam.SRM-All is is the only metric "Active" while the rest are all "Passive" metrics (i.e each metric execution depends on the results of the others and also passively the results are the results of lower level metric's results). Before continuing with the reading I strongly suggest you to look https://twiki.cern.ch/twiki/bin/view/LCG/SAMToNagios#Python_based_probes_using_gridmo where you can find the basics to understand how this wrapper works and what are the packages that have to be imported to create your own SRM-probe.

Running a "ops" SRM probe.

You can already run existing metrics that come as default with your nagios box installation (org.sam, code available under /usr/lib/python2.4/site-packages/gridmetrics/srmmetrics.py ). The syntax int he following example will be the same to be used for running your future-custom-VO-specific metrics.

./SRM-probe -m org.sam.SRM-Put -x /tmp/nagios -H srm-lhcb.cern.ch --vo lhcb

This command asks explicitly to run the SRM-Put metric against the SRM endpoint for lhcb at CERN as lhcb member and using a proxy located under /tmp/nagios. The SRM-Put python code available in the module srmmetrics.py of the package gridmetrics containing similar code for other services as well.

Writing your own SRM-Probe

  • (On nagios box) Copy the template (empty probe) that you can find in your nagios box under /usr/libexec/grid-monitoring/probes/org.sam/T-probe into your working area (org.lhcb/src) naming as SRM-probe. This will give you the generic structure expected by Nagios.
An example at http://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/probe/org.lhcb/src/SRM-probe shows a (working in progress) SRM-Probe for LHCb with the relevant part to modify. Here I will just tell you what you have to do to enable your own metrics for SRM.
  • Change the flavor of the metric gatherer replacing "T" with "SRM" as in the example
    def __init__(self, tuples):
        probe.MetricGatherer.__init__(self, tuples, 'SRM')
        # command line parameters required by the probe/metrics
        # and usage hints; only "long" parameters MUST be used
        self.usage = """     Metrics specific parameters:

  • Importing non-standard module: to avoid that the python code crashes in case module are not found in the PYTHONPATH and does not report a return code we suggest to write your "import block code" within a try...except structure. If it does not manage to load the module it exits gracefully and the exit code will be 3 (UNKNOWN) with a coherent message.

try:
    from DIRAC.Core.Base.Script                     import initialize
    initialize( enableCommandLine = False )
    from DIRAC.Resources.Storage.StorageFactory     import StorageFactory
    from DIRAC.Core.Utilities.File                  import getSize
    from gridmon import probe
    from gridmon import utils
    from gridmon import gridutils
    from DIRAC import gLogger
    gLogger.setLevel('FATAL') #shut up DIRAC
except ImportError,e:
    print "UNKNOWN: Error loading modules : %s" % str(e)
    sys.exit(3)

The main class that takes care of providing ancillary methods to comply with Nagios (as per inheritance of the base class MetricGatherer available in the module "probe") is the class SRMMetrics. This class brings many utility methods like for example "parse_cmd_args(tuples)" for the input parameter parsing or "self.make_workdir()" that might create (on user wish) working directory identifying each service/site/implementation to be tested (for example a unique working area per each space token at each site). This may be useful to minimize the occurrences of clashes between tests running concurrently. There are also printing commands for a nagios compliant output: prints(string) and printd(string) to print the summary line (the very first one) and the detailed report (everything following the summary).

An important object of this class is however the dictionary of dictionaries self._metrics where you inform the class which metrics belong to.

       self._metrics = {
                        'PutExistsFile' : {
                                # required keys
                                'metricDescription' : "Put a file and check its existence on SRM.",
                                'metricLocality'    : 'local',
                                'metricType'        : 'status',
                                'metricVersion'     : '0.1',
                                # optional keys - example
                                'cmdLineOptions'    : ['file=',
                                                       'space-token=',],
                                'metricChildren'    : []
                                },
                        'PutRemoveFile' : {
                                # required keys
                                'metricDescription' : "Put a file and remove afterwards on SRM.",
                                'metricLocality'    : 'local',
                                'metricType'        : 'status',
                                'metricVersion'     : '0.1',
                                # optional keys - example
                                'cmdLineOptions'    : ['file=',
                                                       'space-token=',],
                                'metricChildren'    : []
                                }

The two metrics in the example, PutExistsFile and PutRemoveFile, are two active checks (do not depend on any other metrics); both require two arguments "file=" and "space-token=" and a description has to be provided. To allow the class SRMMetrics to digest properly these non standard options the method "parse_args" has been modified

    def parse_args(self, opts):
        """ """
        for o,v in opts:
            if o == '--file':
                self.srcFile = v
            elif o == '--space-token':
                self.sp_token = v 

Finally the code of the metrics itself. Please bear in mind that this might even go to another module that you invoke from within the probe. We just want to highlight here few simple rule of thumbs on how to write it.

    def metricPutExistsFile(self):
        """ """
        rc, summary, detmsg = self.setUp()
        if rc != 'OK':
            return rc, summary, detmsg
        status = 'OK'
        srcFile=self.srcFile
        self.printd(self.lcg_gfal_ver)
        self.printd('Testing: %s' % self.hostName)
        self.printd('File to copy: %s' % self.srcFile)
        self.printd('SP Token: %s' % self.sp_token, v=2)
        srcFileSize = getSize(srcFile)
        testFileName = 'testFile.%s' % time.time()
        remoteFile = self.storage.getCurrentURL(testFileName)['Value']
        self.printd(" remote file is: "+remoteFile+"  \n")
        fileDict = {remoteFile:srcFile}
        putFileRes = self.storage.putFile(fileDict)
        existsFileRes = self.storage.exists(remoteFile)

        if not putFileRes['OK']:
            return ('CRITICAL', res['Message'], res['Message'])
        if not existsFileRes['OK']:
            return ('CRITICAL', res['Message'], res['Message'])
        
        self.prints(status) # status message

        return status

The method setUp() in the example is a method of the SRMMetric class added for instantiating and configuring dirac objects and invoked by all metrics in the lhcb SRM-probe. Its explanation is irrelevant for this document purpose. Please note that - by convention - the name of the method must be the concatenation of "metric" and the "name_of_the_metric" Please note how self.prints can be invoked at any point on time in the code (i.e. when a final status is reached). It will be then the framework nagios that will adjust the output and will put the string as the very first line (as required). The rest of this metrics code is just specific dirac code (adapted), that will produce some result to be adequately interpreted and translated into a return status code (by default 'OK'). Important to shut up underlying method of the experiment specific framework to avoid unexpected outputs printed outside the Nagios framework.

Running your own SRM-probe

The following steps are required to run the SRM-Probe for lhcb requiring in turn a proper lhcb and DIRAC environment

source /afs/cern.ch/lhcb/software/releases/LBSCRIPTS/LBSCRIPTS_v5r1/InstallArea/scripts/LbLogin.sh
SetupProject LHCbDirac
PYTHONPATH=$PYTHONPATH:/usr/lib/python2.4/site-packages/
export X509_USER_PROXY=/tmp/nagios_px
./SRM-probe -H srm-lhcb.cern.ch -m org.lhcb.SRM-PutExistsFile --file /etc/passwd --space-token CERN-USER

I report explicitly this in case your custom metrics require special scripts to be run to setup the environment as it is the case for DIRAC. Please note how gridmon packages PYTHONPATH has to re-set in this case setting up because DIRAC reset the PATHs. Note also how to invoke the metric (-m option) and the new parameters (--file and --space-token) passed.

The installation and the setup of these new probes is not a trivial procedures (having to modify the nagios configuration file and the Hash.pm located under /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm and having to restart the Nagios service). In any case - being quattorized - manual changes will be lost at the next reconfiguration and also the procedure will change soon so I strongly suggest to get in touch with Konstantin when you need to register new metrics permanently in your nagios-box. Just for sake of completeness we report here the relevant part of the Hash file with the description of a new metrics

########## LHCb ##############
# SRM-All does a set of tests, and returns detailed results via Passive tests
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{native} = "Nagios";
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{config} = {%{$SERVICE_TEMPL->{60}}};
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{probe} = 'org.lhcb/SRM-probe';
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{metricset} = "org.lhcb.SRM";
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{dependency}->{"hr.srce.SRM2-CertLifetime"} = 1;
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{dependency}->{"hr.srce.GridProxy-Valid"} = 0;
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{attribute}->{VONAME} = "--vo";
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{attribute}->{VO_FQAN} = "--vo-fqan";
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{attribute}->{X509_USER_PROXY} = "-x";
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{flags}->{NOLBNODE} = 1;
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{flags}->{VO} = 1;
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{flags}->{NRPE} = 1;
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{flags}->{OBSESS} = 1;
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{docurl} = "lhcb-doc.#SRM";
$WLCG_SERVICE->{'org.lhcb.SRM-PutExistsFile'}->{parameter}->{"--space-token"} = "CERN-USER";
# this should come from ATP
########## LHCb ##############

This configuration expects that the new SRM-probe is located under the path where RPM are installed

[root@samnag015 ~]# cp /afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/./SRM-probe  /usr/libexec/grid-monitoring/probes/org.lhcb/SRM-probe
We forced a bit things to check that everythikng is properly installable (by doing it manually) and at the end we have been happy: the probe is 100% complying Nagios as you can see from this output.

[root@samnag015 ~]# nagios-run-check -H srm-lhcb.cern.ch -s 
org.lhcb.SRM-PutExistsFile-lhcb -v -d
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.lhcb/SRM-probe -H 
"srm-lhcb.cern.ch" -t 600 --vo lhcb -x /etc/nagios/globus/userproxy.pem-lhcb 
--space-token CERN-USER'

Building the RPMs

RPM build

The .spec file in the org.lhcb example is perfectly working and I have managed to a RPM for lhcb just running the command (form sam-lhcb.cern.ch):
> sudo make rpmel5
The produced RPMs contains exactly all probes and configuration files required by nagios to run these active checks once installed.
rpm -qilp /usr/src/redhat/RPMS/noarch/grid-monitoring-probes-org.lhcb-0.0.1-1.el5.noarch.rpm

Koji

Nagios developers suggest to use however the Koji utility that allows you to be compliant with the rest of the ops team as far as concerns release and deployment process. It allows you to build remotely RPMs directly from SVN and feed a validation repository. This would allow you to keep one single production instance of your Nagios box (we are indeed missing the validation one) and only when you are sure of these new RPMS it would become just matter of modifying the quattor template to point to the new (validation) release of your probes. I have verified that the information available at the koji project page to install (just "yum install koji") and configure on your nagios box the Koji clients are exhaustive. Any further update of the nagios box should be coordinated with Wojciech Lapka Once installed Koji client (and enabled your DN according what reported above) you should have created in your koji machine a directory $HOME/.koji whose content is:
-bash-3.00$ ls -l .koji/
total 17
-rw-r--r--  1 digirola zp 1505 Jul 20 11:09 28a58577.0
-rw-r--r--  1 digirola zp 3554 Jul 20 11:09 cern-ca.pem
-rw-r--r--  1 digirola zp  165 Jul 20 11:09 config
-rw-r--r--  1 digirola zp 4944 Jul 20 11:14 usercertkey.pem
The file 28a58577.0 and cern-ca.pem are the certificates and CRL of your Certification Autority while usercertkey.pem is both your usercert and userkey merged in one single file.

Please note that the location of Certificates can be opportunely configured in your .koji/config file whose content looks like:

cat ../.koji/config 
[koji]
server = https://koji.afroditi.hellasgrid.gr/kojihub
topdir = /mnt/koji
cert = ~/.koji/usercertkey.pem
ca = ~/.koji/cern-ca.pem
serverca = ~/.koji/28a58577.0

Once you got at this point you can run the command that will build the RPMs in the remote koji server (specified in the config file) like in this example that will compile and produce an RPM noarch containing the org.atlas probes (using the spec and Makefile available in the SVN repository specified)

 koji build --scratch --nowait centos5-egee 'svn+http://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/probe/org.atlas/#HEAD'

To run this command just log in gdui.cern.ch (where koji clients are pre-installed). Once this command is run your output will look like:

Created task: 7833
Task info: http://localhost/koji/taskinfo?taskID=7833
The task id (7833 in this example) can be monitored at the koji link. This will result (if successful) a package at the centos5-vo_monitor-devel koji tag. In order to build and tag (without scratch) you should send to ctria@GRIDNOSPAMPLEASE.AUTH.GR names of the packages then run the same command w/o --scratch option and (for vo grid probe) the tag must be: centos5-vo_monitor
[santinel@lxb7962 ~]$ koji build  --nowait centos5-vo_monitor 'svn+http://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/probe/org.lhcb/#HEAD'
Created task: 7880
Task info: http://localhost/koji/taskinfo?taskID=7880
The next step is to tag the package for testing when you feel it is ready for testing: The following example tags the package:
koji tag-pkg centos5-vo_monitor-testing <buildname>
where buildname is the name of the build as it appears at the koji web interface. In a practical example:
[santinel@lxb7962 ~]$ koji tag-pkg centos5-vo_monitor grid-monitoring-probes-org.lhcb-0.0.2-1.el5
Created task 7886
Watching tasks (this may be safely interrupted)...
7886 tagBuild (noarch): free
7886 tagBuild (noarch): free -> open (koji-builder01.grid.auth.gr)
7886 tagBuild (noarch): open (koji-builder01.grid.auth.gr) -> closed
  0 free  0 open  1 done  0 failed

7886 tagBuild (noarch) completed successfully

The following koji tags are available:

  • centos5-vo_monitor-build: The building environment for vo_monitor software
  • centos5-vo_monitor-devel: The target repository for all centos5-vo_monitor builds
  • centos5-vo_monitor-testing: The tag you could use to tag the builds for testing repository
  • centos5-vo_monitor: The tag you could use to tag the builds for release repository
Once the package has been tagged for testing the release repository admins are tagging the build as released when it is ready for release:
koji tag-pkg centos5-vo_monitor <buildname>

Each tag may result an YUM repository so the above procedure gives us the following yum repositories:

  • centos5-vo_monitor-devel for new builds
  • centos5-vo_monitor-testing for builds in testing phase
  • centos5-vo_monitor for builds that are released to public

Testing your new RPMs

Before installing permanently you want to be sure the code is working. The following steps:
  • Log as root into your nagios box
  • download the rpm from Koji
[root@samnag015 ~]# wget http://koji.afroditi.hellasgrid.gr/packages/grid-monitoring-probes-org.lhcb/0.0.2/1.el5/noarch/grid-monitoring-probes-org.lhcb-0.0.2-1.el5.noarch.rpm
--2010-07-21 17:31:57--  http://koji.afroditi.hellasgrid.gr/packages/grid-monitoring-probes-org.lhcb/0.0.2/1.el5/noarch/grid-monitoring-probes-org.lhcb-0.0.2-1.el5.noarch.rpm
Resolving koji.afroditi.hellasgrid.gr... 195.251.55.109
Connecting to koji.afroditi.hellasgrid.gr|195.251.55.109|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12752 (12K) [application/x-rpm]
Saving to: `grid-monitoring-probes-org.lhcb-0.0.2-1.el5.noarch.rpm'

100%[==================================================================================================================================================>] 12,752      --.-K/s   in 0.1s    

2010-07-21 17:31:57 (94.1 KB/s) - `grid-monitoring-probes-org.lhcb-0.0.2-1.el5.noarch.rpm' saved [12752/12752]

  • Install it
[root@samnag015 ~]# rpm -ivh grid-monitoring-probes-org.lhcb-0.0.2-1.el5.noarch.rpm 
Preparing...                ########################################### [100%]
   1:grid-monitoring-probes-########################################### [100%]
and verify that the code is put in the right directory
[root@samnag015 ~]# ls -l /usr/libexec/grid-monitoring/probes/org.lhcb/
total 40
-rwxr-xr-x 1 root root 26799 Jul 21 16:28 SRM-probe
drwxr-xr-x 3 root root  4096 Jul 21 17:32 wnjob

Now submit your test using the installed path to your probes directory

/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -m org.sam.CE-JobState -H ce125.cern.ch --mb-destination /queue/test.org.lhcb --mb-uri stomp://gridmsg002.cern.ch:6163 --add-wntar-nag /usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb --add-wntar-nag-nosam --vo lhcb --namespace org.lhcb --wms wms203.cern.ch -x /tmp/x509up_u7442
or for the SRM tests:

-sh-3.2$ /afs/cern.ch/user/d/digirola/public/nagios_atlas/project/src/SRM/org.atlas/src/SRM-probe -H srmatlas.pic.es -m org.atlas.SRM-All  -x /afs/cern.ch/user/d/digirola/public/nagios_atlas/mypr --vo atlas  --pass-check-dest active --stdout
The options --pass-check-dest active --stdout gives more verbosity at the output of the probe.

Configuring your Nagios box to run your custom probes.

Now that your installed tests are working and you are happy with, we need to instruct Nagios to run them. First of all define the sites where you want to run on. You have to modify the file /etc/ncg/ncg.localdb. In this example we explicitly require to run in 7 big centers and only there.
[root@samnag013 ~]# cat /etc/ncg/ncg.localdb
#
# Local Rules file to modify NCG configuration
#
ADD_SITE!NIKHEF-ELPROD
ADD_SITE!IN2P3-CC
ADD_SITE!pic
ADD_SITE!CERN-PROD
ADD_SITE!INFN-T1
ADD_SITE!RAL-LCG2
ADD_SITE!FZK-LCG2
This file is included in the nagios.cfg main, in the section concerning topology. In this example we will ask Nagios to run on the sites specified in the previous file and sites in CERN ROC.

...
<NCG::SiteSet>
 <GOCDB>
    ROC=CERN
 </GOCDB>
  <File>
      DB_FILE=/etc/ncg/ncg.localdb
      DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>

</NCG::SiteSet>

<NCG::SiteContacts>
  <GOCDB/>
  <GOCDB>
    CONTACT_TYPE=alarm
  </GOCDB>
  <GOCDB>
    CONTACT_TYPE=roc
    ROC=CERN
  </GOCDB>
</NCG::SiteContacts>

be aware that it seems the same information must be set in different places of the nagios.cfg

....

<NCG::SiteInfo>

  <SAM>
    TIMEOUT=600
  </SAM>
  <File>
    DB_FILE=/etc/ncg/ncg.localdb
    DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>
</NCG::SiteInfo>

For completeness we report the uncommented parts of the ATLAS nagios.cfg file :

[root@samnag013 ~]# cat /etc/ncg/ncg.conf |grep -v ^#


<NCG::SiteSet>
 <GOCDB>
    ROC=CERN
 </GOCDB>
  <File>
      DB_FILE=/etc/ncg/ncg.localdb
      DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>

</NCG::SiteSet>

<NCG::SiteContacts>
  <GOCDB/>
  <GOCDB>
    CONTACT_TYPE=alarm
  </GOCDB>
  <GOCDB>
    CONTACT_TYPE=roc
    ROC=CERN
  </GOCDB>
</NCG::SiteContacts>


<NCG::ConfigGen>

  # Second level block defines specific module
  # parameters passed at invocation.

  <Nagios>
    NAGIOS_SERVER = sam-atlas.cern.ch
    MYPROXY_SERVER = myproxy.cern.ch
    MYPROXY_NAME = NagiosRetrieve-sam-atlas.cern.ch
    MYPROXY_USER = nagios
    GLITE_VERSION=32
    PROBES_TYPE=local
    TEMPLATES_DIR = /usr/share/grid-monitoring/config-gen/nagios
    OUTPUT_DIR = /etc/nagios/wlcg.d
    NRPE_OUTPUT_DIR = /etc/nagios/nrpe/
    NRPE_UI = 
    NAGIOS_ROLE = vo
    INCLUDE_EMPTY_HOSTS = 0
    ENABLE_NOTIFICATIONS = 1
    NAGIOS_ADMIN = root@localhost
    VO = atlas
    ROC=CERN
    LOCAL_METRIC_STORE = 1
  </Nagios>
</NCG::ConfigGen>


<NCG::SiteInfo>

  <SAM>
    TIMEOUT=600
  </SAM>
  <File>
    DB_FILE=/etc/ncg/ncg.localdb
    DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>
</NCG::SiteInfo>

<NCG::ConfigPublish>
  <ConfigCache>
     VO          = atlas
     NAGIOS_ROLE = vo
     NAGIOS_SERVER = sam-atlas.cern.ch
  </ConfigCache>
</NCG::ConfigPublish>

<NCG::LocalMetricsAttrs>
  <LDAP>
    LDAP_ADDRESS=$BDII
  </LDAP>

  <Active>
    GLITE_VERSION=32
  </Active>

  <File>
    DB_FILE=/etc/ncg/ncg.localdb
    DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>
</NCG::LocalMetricsAttrs>

<NCG::LocalRules>
  <File>
    DB_FILE=/etc/ncg/ncg.localdb
    DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>
</NCG::LocalRules>

<NCG::LocalMetrics>
  <File>
    DB_FILE=/etc/ncg/ncg.localdb
    DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>
  <Hash>
      PROFILE = VO
  </Hash>
</NCG::LocalMetrics>
<NCG::RemoteMetrics>
</NCG::RemoteMetrics>
include ncg.conf.d/*.conf

Finally you have to modify the Hash.pm file that contains information/description about metrics for different profiles and how these metrics must be invoked by Nagios (options). The Hash.pm file is located unde: /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics The complexity of the commands that we run so far to test our new code is hidden exactly because of the instructions specified in this file. Please note that as default all profiles (that might want to setup sets or sub-sets of different Nagios metrics ex. roc, VO, site, infrastructure, etc) are all merged in one single profile in the default file that you have to modify meaning that - once you define a service for vo profile as shown in this code's snippet

$WLCG_NODETYPE->{roc}->{SRMv2} = [
'org.atlas.SRM-PutGetDel'
];
or alternatively for the WN tests you must add the list of tests that you want that Nagios becomes aware of (and publish on the page Nagios portal for VOX. Both tests org.sam.CE-JobState and org.sam.CE-JobSubmit must be kept and optionally you can remove all other stuff from OPS.

$WLCG_NODETYPE->{roc}->{CE} = [
'org.sam.CE-JobState',
'org.sam.CE-JobSubmit',
'org.atlas.WN-sft-vo-swdir',
'org.sam.WN-SoftVer',
'org.atlas.WN-gangarobot_panda',
'org.atlas.WN-gangarobot_wms'
];

you are automatically instructing the nagios server in your box to consider this test to be valid also for other profiles. The relevant part to add in this large file are reported below for both CE and SRM service for the profile VO.


$SERVICE_TEMPL->{2}->{timeout} = 5;

$WLCG_SERVICE->{'org.sam.CE-JobState'}->{parameter}->{"--add-wntar-nag-nosamcf"} = 0;
#$WLCG_SERVICE->{'org.sam.CE-JobState'}->{parameter}->{"--add-wntar-nag"} = "/afs/cern.ch/user/d/digirola/public/nagios_atlas/project/src/wnjob/org.atlas/";
$WLCG_SERVICE->{'org.sam.CE-JobState'}->{parameter}->{"--add-wntar-nag"} = "/usr/libexec/grid-monitoring/probes/org.atlas/wnjob/org.atlas";
$WLCG_SERVICE->{'org.sam.CE-JobState'}->{parameter}->{"--mb-uri"} = "stomp://gridmsg002.cern.ch:6163"

Please note here that the CE-JobState probe dispatches the jobs to the WN and must be instructed exactly on the location where the input tarball that will be shipped to the WN sits. If you want to replicate then the complex command used to run your WN test jobs as described at the section Validating your probes (development mode) you must add the 4 above lines for the $WLCG_SERVICE org.sam.CE-JobState.

Then add the all WN checks that must be run and have to be known by Nagios as follows. Please note again that metricset is org.atlas.WN while in org.sam.CE-JobState is different

# WN checks
# WN CMS
$WLCG_SERVICE->{'org.atlas.WN-sft-vo-swdir'}->{flags}->{PASSIVE} = 1;
$WLCG_SERVICE->{'org.atlas.WN-sft-vo-swdir'}->{parent} = "org.sam.CE-JobState";
$WLCG_SERVICE->{'org.atlas.WN-sft-vo-swdir'}->{flags}->{VO} = 1;
$WLCG_SERVICE->{'org.atlas.WN-sft-vo-swdir'}->{flags}->{OBSESS} = 1;
$WLCG_SERVICE->{'org.atlas.WN-sft-vo-swdir'}->{metricset} = 'org.atlas.WN';
$WLCG_SERVICE->{'org.atlas.WN-sft-vo-swdir'}->{docurl} = "https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#WN";
# 

or alternatively for SRM checks:





#
# SRM checks 
# SRM-All does a set of tests, and returns detailed results via Passive tests
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{native} = "Nagios";
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{config} = {%{$SERVICE_TEMPL->{60}}};
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{probe} = 'org.atlas/SRM/org.atlas/src/SRM-probe';
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{metricset} = "org.atlas.SRM";
##$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{dependency}->{"hr.srce.SRM2-CertLifetime"} = 1;
##$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{dependency}->{"hr.srce.GridProxy-Valid"} = 0;
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{attribute}->{VONAME} = "--vo";
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{attribute}->{VO_FQAN} = "--vo-fqan";
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{attribute}->{X509_USER_PROXY} = "-x";
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{flags}->{NOLBNODE} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{flags}->{VO} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{flags}->{NRPE} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{flags}->{OBSESS} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-All-atlas'}->{docurl} = "https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#SRM";
#
#
#$WLCG_SERVICE->{'org.atlas.SRM-GetSURLs'}->{flags}->{PASSIVE} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-GetSURLs'}->{parent} = "org.atlas.SRM-All-atlas";
#$WLCG_SERVICE->{'org.atlas.SRM-GetSURLs'}->{flags}->{VO} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-GetSURLs'}->{flags}->{OBSESS} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-GetSURLs'}->{metricset} = "org.atlas.SRM";
#$WLCG_SERVICE->{'org.atlas.SRM-GetSURLs'}->{docurl} = "https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#SRM";


$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{parameter}->{"-m"} ='org.atlas.SRM-PutGetDel';
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{native} = "Nagios";
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{config} = {%{$SERVICE_TEMPL->{60}}};
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{probe} = 'org.atlas/SRM/org.atlas/src/SRM-probe';
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{metricset} = "org.atlas.SRM";
#$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{dependency}->{"hr.srce.SRM2-CertLifetime"} = 1;
#$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{dependency}->{"hr.srce.GridProxy-Valid"} = 0;
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{attribute}->{VONAME} = "--vo";
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{attribute}->{VO_FQAN} = "--vo-fqan";
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{attribute}->{X509_USER_PROXY} = "-x";
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{flags}->{NOLBNODE} = 1;
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{flags}->{VO} = 1;
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{flags}->{NRPE} = 1;
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{flags}->{OBSESS} = 1;
$WLCG_SERVICE->{'org.atlas.SRM-PutGetDel'}->{docurl} = "https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#SRM";
###To get ONLY one metric of you test executed (and not SRM-All one) you've to add also this
$WLCG_SERVICE->{'org.atlas.SRM-GetATLASInfo'}->{parameter}->{"-m"} = 'org.atlas.SRM-GetATLASInfo';

Please note about a convention between metrics invocation from probes are defined via gridmetics. By default, when a probe is invoked without specification of a metric (i.e. w/o _ -m _ option) the framework looks if there is "All" metric defined in self.metrics dictionary, and if yes it launches it. In the case when you inherit from gridmetrics.srmmetrics.SRMMetrics "All" is defined there all "All" is present in the self._metrics dict. (which is used to set self.metrics via self.set_metrics()) To actively invoke a particular metric you have to specifie -m to the probe. I.e. you need to add in the metric definition this parameter definition as quoted here:
$WLCG_SERVICE->{'org.atlas.SRM-GetATLASInfo'}->{parameter}->{"-m"} =
'org.atlas.SRM-GetATLASInfo';

Reconfiguration

Once you have modified the nagios configuration files you have to instruct Nagios server to pick'em up. For this reason the following steps have to be run.

  1. Remove /etc/nagios/wlcg.d/ (first backup it)
  2. Run ncg.pl (which runs anyway every three hours via cron job)
  3. Run "nagios -v /etc/nagios/nagios.cfg"
  4. Run "service nagios restart"

A special metrics: gLExec

The specific gLExec (but also other metrics) come for free with a native Nagios compliant WN-probe code. The probe delivering the org.sam.glexec.WN-gLExec test on the WN is org.sam.glexec.CE-JobState . The reason why Nagios does not use the same probe (org.sam.CE-JobState) used to deliver WN org.lhcb jobs is because usually this test requires a different FQAN than the production test jobs (AKA Role=pilot). This is the subtle difference that forces the definition of another profile in the Hash.pm files and opportune modification in the /etc/ncg/ncg.conf file as we will se in more details in this section.

Running with your own credentials

You can run directly the metrics by invoking it via the SRM-probe on the local host (just to check what the code does) using the following syntax:
/usr/libexec/grid-monitoring/probes/org.sam/WN-probe -m org.sam.WN-gLExec -H localhost -x /etc/nagios/globus/userproxy.pem-lhcb 
and the output will look like:

UNKNOWN: glexec command not found.
UNKNOWN: glexec command not found.
Testing from: samnag015.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=santinel/CN=564059/CN=Roberto Santinelli/CN=proxy/CN=proxy
VOMS FQANs: /lhcb/Role=production/Capability=NULL, /lhcb/Role=NULL/Capability=NULL
glexec command not found.
$GLEXEC_LOCATION not set.
Searching in $PATH.
glexec not found in: /usr/sue/bin:/usr/kerberos/bin:/opt/d-cache/srm/bin:/opt/d-cache/dcap/bin:/opt/edg/bin:/opt/glite/bin:/opt/globus/bin:/opt/lcg/bin:/usr/local/bin:/bin:/usr/bin:/opt/glite/sbin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin

the test failed because glexec is not installed nor configured in the box where it has run from. However it is easy to understand that this is the command that will be issued once landed in the remote WN. How to deliver it there? Same as for the SAM test with the difference that inheriting it form org.sam.glexec it is already nagios compliant. The probe to invoke in the UI is CE-probe that will dispatch the job containing the tarball. The following will submit glexec tests available in the org.sam.glexec.

/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -m org.sam.CE-JobState -H ce123.cern.ch --mb-destination /queue/test.org.lhcb --mb-uri stomp://gridmsg002.cern.ch:6163 --add-wntar-nag /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.glexec --add-wntar-nag-nosam --vo lhcb --namespace org.lhcb --wms wms203.cern.ch -x /tmp/pilot
Have a look at files under /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.glexec/ *.cfg files for a better clue of the metric invoked from the probe on WN.

The WN-probe can be queried with the -l option is for listing metrics defined in:

-sh-3.2$ /usr/libexec/grid-monitoring/probes/org.sam/WN-probe -l
...
serviceType: WN
metricDescription: gLExec - change unix identity.
metricLocality: remote
metricType: status
metricName: org.sam.WN-gLExec
EOT

Here is the snipped for the configuration in Hash.pm

$WLCG_SERVICE->{'org.sam.glexec.CE-JobState'}->{parameter}->{"--add-wntar-nag-nosamcfg"} 
= "";
$WLCG_SERVICE->{'org.sam.glexec.CE-JobState'}->{parameter}->{"--add-wntar-nag"} 
= "/usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.glexec";

If one needs to modify the test itself - just point to the correct directory when shipping the tarball to send to the WN with the probes and *.cfg included. Eg:

$WLCG_SERVICE->{'org.sam.glexec.CE-JobState'}->{parameter}->{"--add-wntar-nag"} 
= "/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb.glexec";

All is already there under the glexec profile in the Hash.pm

# glexec profile
$WLCG_NODETYPE->{glexec}->{'CE'} = [
'org.sam.glexec.CE-JobSubmit',
'org.sam.glexec.CE-JobState',
'org.sam.glexec.WN-gLExec']

Running gLexec tests as they are it is just matter of making nagios conscious of this profile by modifying the ncg.cfg (nagios configurator configuration file) in the relevant points that we report here:

  <Hash>
      PROFILE = glexec
      VO_FQAN = /lhcb/Role=pilot
  </Hash>
this activates the profile glexec (already existing in the Hash.pm) to be compared with the profile "VO" that by default is atcive in the VO nagios boxes.
  <Hash>
      PROFILE = VO
      VO_FQAN = /lhcb/Role=production
  </Hash>

Furthermore the following information must be inserted withing the Nagios tag. Please note that default FQAN must be specified too. This would allow to have different profiles using different capabilities.

  <Nagios>
    MYPROXY_NAME = NagiosRetrieve-sam-lhcb.cern.ch
    MYPROXY_USER = nagios
    TEMPLATES_DIR = /usr/share/grid-monitoring/config-gen/nagios
    OUTPUT_DIR = /etc/nagios/wlcg.d
    NRPE_OUTPUT_DIR = /etc/nagios/nrpe/
    NRPE_UI =
    NAGIOS_ROLE = vo
    INCLUDE_EMPTY_HOSTS = 0
    ENABLE_NOTIFICATIONS = 0

    VO_LHCB_DEFAULT_VO_FQAN = /lhcb/Role=production
    LOCAL_METRIC_STORE = 1
  </Nagios>

Writing your gLexec probe

  • Under your working area /afs/cern.ch/user/s/santinel/scratch0/nagios/project/src/wn.job create a directory org.lhcb.glexec (same level of org.lhcb containing WN SAM tests) and there create the subdirectories: etc and probes.
  • Under etc directory the usual services and commands .cfg files
  • Under probes the glexec WN probe
The reference python code is available http://svnweb.cern.ch/guest/sam/trunk/probes/src/gridmetrics/wnmetrics.py The metrics is :
def metricgLExec(self):

As all Python org.sam metrics it needs to be a method of a class sub-classed from probe.MetricGatherer. All the rest is the same as for SRM metrics.

NEET

Any other experiment specific test submitted within the Experiment Framework could well be published into Nagios via Message Brokers as passive check. Details and how to are described in this link -- RobertoSantinel - 14-Jun-2010
Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2010-11-29 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback