LCGDM Monitoring with NAGIOS

Initial wish list

  • Server status: up, down, draining
    • dpm_ping hangs if the server is down (Jean Philippe is going to fix this)
  • Filesystem status: prod, drain, disabled
    • There are no diferences between drain and disabled
  • Collecting information about timing of single operations like open on both the metadata server and the disk server.
    • DPM/LFC specific
      • Use Python API for this (as LFC-probes down do)
  • Have a look into
    • LFC-probe (Uses the LFC API to write/read/measure time.)
    • SRM-probe (VO specific: BDII checks, uploads, get TURL, etc.)
    • SRM-probe (From LHCB: Put file and perform action)
    • LFC-probe (From LHCB: Same one as first LFC-probe)
    • GSSD-SRM2-probe (Checks published information in bdii/localhost)

LCGDM plugins

  • Check validity of host certificates. Warning and critical messages wi
    • check_hostcert
    • Warning and critical configurable: Days until the certificate expires
  • DB password lifetime
    • check_oracle_expiration
    • Warning and critical configurable: Days until the password expires
    • Connection string, user and password can be specified
  • Disk partitions activity (bytes/s in and out)
  • CPU utilization (System/Idle/IOwait/IRQ)
    • check_cpu
    • Warning and critical configurable: Upper limit of CPU percentage per category
  • Network activity: bytes/s in and out (and error percentage)
    • check_network
    • No warning or critical criteria.
    • Individual interfaces can be selected
  • Pool free space plus filesystem status
    • check_dpm_pool
    • Warning and critical configurable: Free space per subsystem or per pool. Specified as bytes (with suffixes K,M,G,T,P).
    • Individual pools can be selected, but no filesystems.
  • Collecting information about disk server activity (network, disk I/O, memory, number of connections) splitting the information between sequential I/O (gridFTP and rfcp) and random I/O (rfio and xroot)
    • check_process Can be used for that, excepting disk I/O and network usage (apparently a kernel patch is needed for that)
    • Warning and critical configurable: Number of instances, % of CPU, % of memory, number of threads, number of connections, number of file descriptors.
    • Individual processes can be selected.
    • It needs sudo permissions for lsof in order to retrieve file descriptors and sockets information.
  • DPNS ping
    • check_dpns
    • Warning and critical configurable: ping time in millisecond.
    • Can be used remotely.
  • GridFTP
    • check_gridftp
    • No warning criteria. Critical if a file can not be uploaded, downloaded, or the comparison is not successful.
    • Can be used remotely.
  • Published information
    • check_dpm_infosys
    • No warning criteria. Critical if any of the requests information is not being published.
    • Can be used remotely.
  • RFIO
  • DPM transfers
    • check_dpm_transfers
    • Parses the DPM log, and gives the number of succeed transfer, the total, and the warning and error margins. It is incremental, so during the day the number will be growing.
  • Number of requests per VO
    • check_requests_per_vo
    • Queries the database to retrieve the number of requests per VO in the last interval.

From NAGIOS itself

  • DB activity and size
    • NAGIOS: check_oracle, check_mysql
  • Number of processes and threads in use
    • NAGIOS: check_procs (not threads, though)
  • Check if filesystem correctly mounted
    • NAGIOS: check_disk already does this
  • Disk partitions: used and free
    • NAGIOS: check_disk
  • Memory: swap, free and used
    • NAGIOS: check_swap
  • Load average
    • NAGIOS: check_load

From grid-monitoring

Other resources

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2011-07-11 - AlejandroAlvarez
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback