Monitoring with Nagios

]

DPM nagios packages include probes to monitor each of our ''node types''.

Installation

  1. Configure our [wiki:Dpm/Dev/Components#YumRepository unstable yum repository]
  2. Install the nagios plugins rpm corresponding to the node type
       # yum install nagios-nrpe nagios-plugins-[lcgdm|dpm-head|dpm-disk]
       
    ''nagios-plugins-lcgdm'' should be installed in the nagios host itself.

Configuration (monitored machine)

  1. Enable the ''nrpe.d'' config directory in the nagios configuration
       # vim /etc/nagios/nrpe.cfg
       ...
       include_dir=/etc/nrpe.d/
       ...
       
  2. Enable nrpe via xinetd
       # cp /opt/lcg/share/doc/nagios-plugins-lcgdm/examples/nrpe /etc/xinetd.d
       
    ''/usr/share/doc/nagios-plugins-lcgdm/examples/nrpe'' instead for an EMI installation.
  3. Restart the xinetd service
       # service xinetd restart
       

Configuration (nagios host)

The ''nagios host'' is the machine running the nagios daemon.

  1. declare following file in the basic nagios configuration file (definition files for nagios-plugins-lcgdm probes)
       # vim /etc/nagios/nagios.cfg
       ...
       cfg_file=/etc/nagios/generic-service.cfg
       cfg_file=/etc/nagios/lcgdm-services.cfg
       cfg_file=/etc/nagios/lcgdm-hosts.cfg
       cfg_file=/etc/nagios/lcgdm-commands.cfg
       ...
       
  2. For each of the machines to be monitored, add to ''/etc/nagios.d/lcgdm-hosts.cfg'' an entry like:
       # vim /etc/nagios/lcgdm-hosts.cfg
       ...
       define host {
               use             generic-host
               host_name       <hostname>
               name            <machine description>
               hostgroups      <node type(s)>
       }
       ...
       
    With the ''nagios-plugins-lcgdm'' rpm you get default configurations for the 3 node types: ''dpm-disks'', ''dpm-heads'', ''nagios-host''. You can fill ''hostgroups'' with a comma separated list of any of these types (as appropriate). Each of these types has to have at least one host in it.
  3. (For nagios 2 only) Some probes are installed on the nagios host. This probes must have the server hostname to work correctly. To modify in, edit the file ''/etc/nagios/lcgdm-hosts.cfg'' and modify the '-H' option.
       ...
       define command{
               command_name    check_dpns
               command_line    /usr/lib64/nagios/plugins/lcgdm/check_dpns -H testdpm-h
       }
       ...
       
  4. Reload the nagios daemon
       service nagios reload
       

Plotting the data

Detailed information under [wiki:Dpm/Admin/Monitoring/pnp4nagios How to set pnp4nagios].

How to develop a probe

Detailed information under [wiki:Dpm/Admin/Monitoring/ProbeDevel Probe development].

Frequently Asked Questions (FAQ)

How to enable / disable a probe

If you want to disable a probe in every client, then the easiest way is to comment the nagios service definition.
vim /etc/nagios/lcgdm-services.cfg

#define service {
#        use                     lcgdm-generic-service
#        hostgroup_name          dpm-heads, dpm-disks
#        service_description     DM_CERT
#        check_command           check_nrpe!check_hostcert
#}

If you want to disable a probe only a group (headnode or disknode), then you have to modify the hostgroup_name option:

vim /etc/nagios/lcgdm-services.cfg

define service {
        use                     lcgdm-generic-service
        hostgroup_name          dpm-heads (, dpm-disks, other-group)
        service_description     DM_CERT
        check_command           check_nrpe!check_hostcert
}

How to change the check frequency of a probe

The check frenquency is defined by the nagios option "normal_check_interval". This option can be applied either at a template level or a service level as following:
define service {
        use                     lcgdm-generic-service
        hostgroup_name          dpm-heads
        service_description     DM_SPACE_TOKEN
        check_command           check_nrpe!check_space_token
        normal_check_interval   60
}
The value specified in the service definition overide all the value in template.

Available probes

The DPM nagios probes are split in multiple rpm packages, depending on where they should be deployed.

nagios-plugins-lcgdm-common Probes common to the DPM Head Node, Disk Node and the LFC
nagios-plugins-dpm-head Probes for the DPM Head Node
nagios-plugins-dpm-disk Probes for the DPM Disk Node
nagios-plugins-lfc Probes for the LFC
nagios-plugins-lcgdm Probes, pnp4nagios templates and configuration files for the Nagios host

Package: nagios-plugins-lcgdm-common

check_cpu

# /usr/lib64/nagios/plugins/lcgdm/check_cpu -h
Checks the CPU activity (System/Idle/IOwait/IRQ)

   -w, --warning   Sets the warning values. Default: 60,60,60,100,80,60,60
   -c, --critical   Sets the critical values. Default: 70,70,70,100,90,70,70
   -i, --interval   Measure interval. Default: 5

The warning and critical value order is user,nice,system,idle,iowait,irq,softirq. 
The value will be considered percentage over the total. There is no need to use the % symbol. 
The result identify the percentage of time the CPU has spent performing the following operation: 

        user:           normal processes executing in user mode
        nice:           niced processes executing in user mode
        system:         processes executing in kernel mode
        idle:           twiddling thumbs
        iowait:         waiting for I/O to complete
        irq:            servicing interrupts
        softirq:        servicing softirqs

Description of work executed by the probe:
        1. Get informations about cpu activity using command /proc/stat
        2. wait 5 second
        3. Get new informations about cpu activity
        4. Considering these two values, compute the percentage the cpu spent in each mode
        5. Return values to nagios
                 Warning alert is triggered if a cpu state reach is corresponding threshold (in term of percentage of total usage)
                 Critical alert is triggered if a cpu state reach is corresponding threshold (in term of percentage of total usage) 


check_hostcert

# /usr/lib64/nagios/plugins/lcgdm/check_hostcert -h
Checks the validity of host certificates and CRLs

Usage: /usr/lib64/nagios/plugins/lcgdm/check_hostcert [options]

   -w, --warning   Sets the warning value, in days. Default: 10
   -c, --critical   Sets the critical value, in days. Default: 2
   -s, --subject   Checks the hostname vs the certificate's subject

Description of work executed by the probe:

        1. Execute an openssl command and retreive
                subject
                startdate
                enddate 
        2. Execute a command to retreive the local hostname
        3. Returns the values to nagios
                Warning alert is triggered if the validity left is too small
                Critical alert is triggered if the certificate is not valid or not enough validity left 

check_network

# /usr/lib64/nagios/plugins/lcgdm/check_network -h
Checks the network activity

Usage: /usr/lib64/nagios/plugins/lcgdm/check_network [options]

   -i, --interval   Measure interval. Default: 5
   -n, --interfaces   Which network interfaces must be monitored. All by default.

For each network interface to be monitored, this probe return the average input/output throughput and the percentage of packet dropped during the last interval (5 second by default)

Description of work executed by the probe:

        1. Get informations about network interface activity using the command (/proc/net/dev)
        2. Wait 5 seconds
        3. Get new informations about network interface activity
        4. By substract these values, the following informations can be determined
                Number of bytes transfered
                Number of packet transfered
                Number of packet dropped 
        5. Return these number to nagios
                No Warning or critical threshold can be set

check_process

# /usr/lib64/nagios/plugins/lcgdm/check_process -h

Checks a process activity

Usage: /usr/lib64/nagios/plugins/lcgdm/check_process [options]

   -p, --processes   A list of processes separated by commas (e.g. gridftp,rfiod)
   -w, --warning   Warning limits. It is a tuple with the values of instances,cpu%,mem%,threads,connections,descriptors. Default: 10,80%,50%,100,100,800
   -c. --critical   Critical limits. Same format as warning. Default: 20

For a given process (specified with -p option), the probe should return the following information:

        process instance:       Number of running instance
        process cpu:            Percentage of CPU usage dedicated
        process mem:            Percentage of memory dedicated
        process thread:         Number of running thread 
        process conn:           Number of connection opened
        process fd:             Number of file descriptor associated 

Description of work executed by the probe:

        1. Get informations about process activity using ps command
        2. Retrieve basic informations about the process:
                Number of instance
                CPU / memory usage dedicated to it
                Number of running thread
        3. Execute a "lsof" command to figure out how many network connection are currently opened for this process
        4. Execute a "lsof" command to figure out how many file descriptor are currently linked to this process
        5. Return values to nagios
                Warning or critical alerts are triggered if there is too much ressources dedicated to one process

Package: nagios-plugins-dpm-head

check_dpm_infosys

# /usr/lib64/nagios/plugins/lcgdm/check_dpm_infosys -h

Checks correctness of information published in the BDII

Usage:  /usr/lib64/nagios/plugins/lcgdm/check_dpm_infosys [options]

        -H, --host      The host to query. If not specified, DPM_HOST will be used. 'localhost' in last instance.
        -p, --port      The ldap port. Default: 2170

This probe expects a running local BDII and checks the correctness of information published in it. The rfio, gridftp and srm (both versions) are checked as well the srm manager services (httpg://$DPM_HOST:8443/srm/managerv1 and httpg://$DPM_HOST:8446/srm/managerv2)

Description of work executed by the probe:

        1. Initialize a ldap connection to the headnode
        2. Check if informations about gridftp and rfio protocols are correclty published
        3. Check if informations about srmv1 and srmv2 protocols are correclty published  
        4. check if informations about "httpg://hostname:8443/srm/managerv1" and "httpg://hostname:8446/srm/managerv2" services are correclty published
        5. Return values to nagios
                No Warning alert can be set
                Critical alert is triggerd if the ldap server is unreachable or one of the previous item don't publish informations correctly

check_dpm_perf

# /usr/lib64/nagios/plugins/lcgdm/check_dpm_perf -h

Check some function's statistics in the DPM logfile

Usage: /usr/lib64/nagios/plugins/lcgdm/check_dpm_perf [options]

        -l, --logfile   Sets the dpm logfile path. Default: /var/log/dpm/log
        -f, --functions Sets of function to monitor. Default: putdone
        -i, --interval  Sets the interval for the analysis. Default: 10 

The probe will parse the last interval of the DPM logfile and return the number of execution and the total time spent to execute each function in the list defined by "function="

Description of work executed by the probe:

        1. Open the DPM logfile and read it from the end to the last 10 minutes
        2. When a "returns" statement is found
                save the timestamp of the line 
        3. when a "request" statement is found 
                Link the request to its correct returns
                Retrieved the time spent for this function
        4. Create a results dictionary where keys are function's name and the values are:
                Total number of execution of this function
                Total time spent to execute this function
        5. Return values to nagios
                No warning or critical threshold can be set

check_dpns_perf

# /usr/lib64/nagios/plugins/lcgdm/check_dpns_perf -h

Check some function's statistics in the DPNS logfile

Usage: /usr/lib64/nagios/plugins/lcgdm/check_dpns_perf [options]

        -l, --logfile   Sets the dpns logfile path. Default: /var/log/dpns/log
        -f, --functions Sets of function to monitor. Default: mkdir, opendir, readdir, rmdir, creat, unlink
        -i, --interval  Sets the interval for the analysis. Default: 10 

The probe will parse the last interval of the DPNS logfile and return the number of execution and the total time spent to execute each function in the list defined by "function="

Description of work executed by the probe:

         1. Open the DPNS logfile and read it from the end to the last 10 minutes
        2. When a "returns" statement is found
                save the timestamp of the line 
        3. when a "request" statement is found 
                Link the request to its correct returns
                Retrieved the time spent for this function
        4. Create a results dictionary where keys are function's name and the values are:
                Total number of execution of this function
                Total time spent to execute this function
        5. Return values to nagios
                No warning or critical threshold can be set 

check_dpm_pool

# /usr/lib64/nagios/plugins/lcgdm/check_dpm_pool -h

Checks the DPM pools usage

Usage: /usr/lib64/nagios/plugins/lcgdm/check_dpm_pool [options]

        -w, --warning   Sets the warning limit for free space. It can be two values: pool. It accepts suffixes. (e.g. -w 100G). Default 5G.
        -c, --critical  Sets the critical limit for free space. It can be two values: pool. It accepts suffixes. (e.g. -c 50G). Default 1G.
        -p, --pools     Restricts the pools to check to a list sparated by commas. (e.g. pool1,pool2)
        -O, --VO        Restricts the pools to check to a specific VO. By name or id. (e.g. dteam)

The accepted suffixes are K(ibibyte), M(ebibyte), G(ibibyte), T(ebibyte) and P(ebibyte).

For each pool define with the "--pool" option, the probe will return the current used and free space. If the free space is under warning or critical threshold, the probe trigger an alert

Description of work executed by the probe:

        1. Retrieve information on the monitored pool using the command "dpm_getpools()"
        2. Parse the result of the command to find
                The free space
                The used space
        3. Return values to nagios
                Warning alert is triggered if free space is under 5 GBytes
                Critical alert is triggered if free space is under 1 GBytes

check_space_token

# /usr/lib64/nagios/plugins/lcgdm/check_space_token -h

Checks the available space per space token

Usage: # /usr/lib64/nagios/plugins/lcgdm/check_space_token [options]

        -H, --host      MySQL server.
        -u, --user      MySQL user name. Must have access to both dpm_db and cns_db;
        -p, --password  MySQL user password.
        -w, --warning   Default warning threshold in percent of unused space in a space token. (Default: 30 ).
        -c, --critical  Default warning threshold in percent of unused space in a space token. (Default: 10 ).

Description of work executed by the probe:

        1. Open a connection to the dpm database
        2. Query the database to retreive informations about space tokens (select * from dpm_space_reserv)
        3. Retreive for each monitored space token
                The free space
                The used space
        4. Return values to nagios
                Warning alert is trgiggered if free space is under 30 percent of the total space
                Critical alert is trgiggered if free space is under 10 percent of the total space

nagios-plugins-dpm-disk

check_partition_activity

# /usr/lib64/nagios/plugins/lcgdm/check_partition_activity -h

Checks all the partitions activity

Usage: /usr/lib64/nagios/plugins/lcgdm/check_partition_activity [options]

        -w, --warning   Sets the warning value. Default: 100
        -c, --critical  Sets the critical value. Default: 102
        -i, --interval  Measure interval. Default: 5
        -p, --partition Restrict the probe to one or more specific partitions, separated by commas. Use the device name, not the path (e.g. sda1 instead of /dev/sda1)
        -s, --sector    The sector size, in bytes. By default, 512

Return the read and write throughput of the specified partition during the last interval.

Package: nagios-plugins-lfc

check_oracle_expiration

Package: nagios-plugins-lcgdm

check_dpns

#/usr/lib64/nagios/plugins/lcgdm/check_dpns -h

Checks if the DPNS service is up, and the response time

Usage: /usr/lib64/nagios/plugins/lcgdm/check_dpns [options]

        -w, --warning   Sets the warning value, in milliseconds. Default: 300
        -c, --critical  Sets the critical value, in milliseconds. Default: 1000
        -H, --host      The host to query. If not specified, DPNS_HOST will be used. 'localhost' in last instance.

Description of work executed by the probe:

        1. Query the DPNS daemon with an http request on the port 5010
                Server alive if awnser: 'Connection reset by peer' (return code 104)
                Server down if answer: 'Connection refused' (return code 111)
        2. Returns time to execute the request to nagios
                Warning state is triggered if request is longer than 0.3s
                Critical state is triggered if request is longer than 1s

check_dpm

# /usr/lib64/nagios/plugins/lcgdm/check_dpm -h

Checks if the DPM service is up, and the response time

Usage: /usr/lib64/nagios/plugins/lcgdm/check_dpm [options]

        -w, --warning   Sets the warning value, in milliseconds. Default: 300
        -c, --critical  Sets the critical value, in milliseconds. Default: 1000
        -H, --host      The host to query. If not specified, DPM_HOST will be used. 'localhost' in last instance.

Description of work executed by the probe:

        1. Query the DPM daemon with an http request on the port 5015
                Server alive if awnser: 'Connection reset by peer' (return code 104)
                Server down if answer: 'Connection refused' (return code 111)
        2. Returns time to execute the request to nagios
                Warning state is triggered if request is longer than 0.3s
                Critical state is triggered if request is longer than 1s

check_gridftp

# /usr/lib64/nagios/plugins/lcgdm/check_gridftp -h

Checks if the GridFTP server is up, and the response times

Usage: /usr/lib64/nagios/plugins/lcgdm/check_gridftp [options]

        -w, --warning    Sets the warning value, in milliseconds. Default: 300
        -c, --critical   Sets the critical value, in milliseconds. Default: 1000
        -H, --host       The host to query. If not specified, DPM_HOST will be used.
        -p, --port       The GridFTP port. Default: 2811

Make a ping on the gridftp server to know if it currently working. Warning and critical threshold can be set to trigger alert if the ping delay is too hight.

Description of work executed by the probe:

        1. Query the GridFTP daemon with an http request on the port 2811
                Server alive if awnser: 'Connection reset by peer' (return code 104)
                Server down if answer: 'Connection refused' (return code 111)
        2. Returns time to execute the request to nagios
                Warning state is triggered if request is longer than 0.3s
                Critical state is triggered if request is longer than 1s

check_rfio

# /usr/lib64/nagios/plugins/lcgdm/check_rfio -h

Checks if the rfio server is up, and the response times

Usage: /usr/lib64/nagios/plugins/lcgdm/check_rfio [options]

        -w, --warning   Sets the warning value, in milliseconds. Default: 300
        -c, --critical  Sets the critical value, in milliseconds. Default: 1000
        -H, --host      The host to query. If not specified, DPM_HOST will be used.
        -p, --port      The rfio port. Default: 5001

Make a ping on the rfio server to know if it currently working. Warning and critical threshold can be set to trigger alert if the ping delay is too hight.

Description of work executed by the probe:

        1. Query the DPNS daemon with an http request on the port 5001
                Server alive if awnser: 'Connection reset by peer' (return code 104)
                Server down if answer: 'Connection refused' (return code 111)
        2. Returns time to execute the request to nagios
                Warning state is triggered if request is longer than 0.3s
                Critical state is triggered if request is longer than 1s

-- FabrizioFurano - 2016-12-07

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2016-12-07 - FabrizioFurano
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DPM All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback