Grid Monitoring Probes Specification

Versioning

This document defines version 0.91 of the specification, produced 4/6/2007.

Overview

Many site fabric monitoring systems are used within WLCG. There is currently no standard set of monitoring probes to check the status of grid services. We wish for a common set of probes to be used to gather metrics on WLCG Grid Services, and that this set of probes can be run from within any fabric monitoring system. Therefore, it is necessary for the probes to be neutral to any particular fabric monitoring solution. In order to do this, this specification defines:
  • A single format for the input to and output from a probe
  • A way for the probe to be interrogated to determine which metrics exist, and some information about them.

We have based this format on the format used for SAM Tests, with some modifications to make it easier to integrate into other monitoring systems, e.g. return codes, argument names, ...

Note that we focus on probes that are grid-specific, and usually probe the external, public, interface of a service. In particular, we assume that low-level probes (e.g. Is a daemon running, how much disk space is free on a partition) would be gathered using fabric-specific probes.

Message Formats

We assume that the probe format is normally used as an "internal" interface within a fabric monitoring service. Thus, for the sake of efficiency , we have not gone for an XML format, but instead for a simpler key-value pair format. Most values are limited to a single line, and terminated by a '\n'. There is one exception, detailsData, which allows for a single multi-line value to be part of the message. It must, if it exists, be the final line in the message, and is terminated by a line containing the string EOT on a line by itself.

A message has the following format (in EBNF), where "number" and "digit" have the normal definitions. We use the EBNF syntax based on Regular Expressions since this is the most commonly understood version (see http://www.cs.man.ac.uk/~pjj/bnf/bnf.html#EBNF for more details):

special            ::= '.' | '-' | ... (* probably this is all printable ASCII apart from '\n' ! *)
messageend         ::= "EOT\n"
colon              ::= ':'
space              ::= ' '
key                ::= ALPHA  ( ALPHA | DIGIT )*
simplevalue        ::= ( ALPHA | DIGIT ) ( ALPHA | DIGIT | space | special )*
multilinevalue     ::= ( simplevalue  '\n' )+ 
line               ::= key space* colon space* simplevalue '\n'
detailsline        ::= "detailsdata" space* colon space* multilinevalue
message            ::= line+ detailsline? messageend
Sample messages are :
serviceType: glite-LFC
metricName: org.wlcg.LFC-Write
metricType: status
EOT

serviceType: globus-GridFTP
metricName: org.wlcg.GridFTP-Transfer
metricStatus: OK
timestamp: 2007-02-09T14:42:20.024534Z
summaryData: Transfer OK
detailsData: Upload to remote computer succeeded. 
Download from remote computer succeeded. 
File successfully removed from remote computer. 
Received file is valid. 
EOT

Date Formats

For date fields, we assume a restricted subset of ISO8601. This is the same as in the GridMonitoringDataExchangeStandard#Timestamp_format document, except we allow fractional time:
  • W3C date and time format (based on ISO8601)
  • all components from year to second
  • Fractional seconds are allowed
  • all values in UTC ("Z" as timezone)

An example of this format would be : "2007-02-09T14:42:20.024534Z"

Service Types

We expect there to be a relatively small set of services that would be monitored, so have not use a highly structured namespace. A service type is simply a string, usually the name of the service as commonly known, or the name of the main daemon for the service. For projects (such as Globus, gLite, ...) you can prefix the project name. Examples include glite-LFC, globus-GridFTP, mysql, httpd. In particular, we assume if there is a set of names in common usage (e.g. GLUE for naming services in the information system GlueService table) they should be used in preference to inventing new ones !

In the output of a probe, the service type is signified by a line with the key serviceType e.g.

serviceType : globus-GridFTP

Namespace for metric name

In comparison to Service Types we expect there to be many more metrics, and also for these to be written by many people. Therefore we suggest a namespace, based on reverse DNS naming (as used e.g. in VO naming), to identify the responsibles/author of the metric.

Examples include:

metricName: org.glite.testing.LFC-Read
metricName: org.osg.GRAM-Check
metricName: ch.cern.fio.CheckHostCert

Types of Versioning

In the messages described by this specification, probes can return various versioning information. We define the following keys to represent 'standard' version types:
Key Description
probeVersion The version of the probe software itself
probeSpecificationVersion The version of this document that the probe input and output conforms to
serviceVersion The version of the service that the probe can run against

The serviceVersion may be a single value, an inequality (as in RPM dependency versioning), or a list - e.g all the following are valid.

serviceVersion: 2.5.0
serviceVersion: >= 1.2.3
serviceVersion: 2.5.0, 2.5.1, 2.5.2
serviceVersion: 1.0.0, >= 1.2.3

We assume normal semantics are used for when to change minor, major and patch version numbers.

Considerations about Metrics

Status and Performance Metrics

We consider two different types of metric:
  • Status Metric : This determines the status of a particular subset of functionality of the service. It is returned from the probe with key metricStatus. It uses a set of standard status values to categorize this:

Status Description
OK Service is running as expected
WARNING Service may be degraded in some way, or about to become degraded
CRITICAL Service has a problem affecting functionality and/or availability
UNKNOWN Cannot determine service status

Note that UNKNOWN is used by the probe when the probe has a internal problem which means that it cannot accurately determine the status of the service. This is different, for instance, to the service not being contactable.

  • Performance Metric : This is used to gather some information about the service, e.g. load, number of transactions/sec, ... It is returned from the probe with key performanceData. This is a typed value, e.g, string, int, float, boolean

Locality of probe

We have mentioned in the overview that this document is mainly concerned with probes that connect to the public external interface of a service. But also some information may be gathered directly on the host at the OS level, e.g. by parsing log files.

In order to accommodate both these possibilities, there are two different ways that a metric can signal where it was run:

  • When the metric is gathered locally on the host Here, the hostname is published in a line with key hostName. It should be a FQDN, not just a bare host name. E.g.
serviceType: globus-gsi
metricName: org.wlcg.check-host-cert
hostName: lfc101.cern.ch
...
EOT
  • When the metric is gathered via the public service interface. Here the URI of the service endpoint used is published in a line with key serviceURI. This will be a URI (equivalent to GlueServiceURI in the Information System). Also, the host on which the probe was run will be published in a line with key gatheredAt. E.g.
serviceType: glite-LFC
metricName: org.wlcg.LFC-Read
serviceURI: lfc://lfc101.cern.ch/
gatheredAt: mon.cern.ch
...
EOT

Note here that the serviceURI may not, in fact, be an actual URI - this is because in WLCG not all service publish full URIs in the Information system, but often only publish hostnames or hostname:port pairs.

Supported probe input arguments

-h, --help : Documenting the probe

The -h or --help option should give information on the probe, for instance the service name and versions it supports running against. This may be in plain text format, but we advise to provide as a minimum, at the top of the output the following fields: probeVersion, serviceType, serviceVersion, probeSpecificationVersion.

Also, an additional arguments that the probe may support should be added here.

An example is shown below:

$ GRAM-probe -h
GRAM-probe
probeVersion: 1.7
serviceType: globus-GRAM
serviceVersion: 2.5.0, 3.1.1
probeSpecificationVersion: 0.91

Probe for functional checking of Globus GRAM Gatekeeper

Usage: GRAM-probe
    -u, --uri SERVICEURI
        Service URI. Accepted URIs are:
            host
            host:port
            host:port/service
            host/service
    -m, --metric STRING
        Which metric to perform.
    -l
        Print WLCG-style metric list
    -t, --timeout INTEGER
        Set timeout
    -h, --help
        Print help message

    -x, --proxy CERTFILE
        Location of Nagios user's proxy file (Default: /tmp/x509up_u500)

    -w, --warning INTEGER
        Warning threshold for certificate lifetime
    -c, --critical INTEGER
        Critical threshold for certificate lifetime

-l : Listing metrics that a probe runs

All probes should support a -l argument that outputs the description of the supported metrics. It should output on STDOUT a textual record per metric that it supports. For a status metric the following three field are mandatory:
Field Description
serviceType The service type that this probe works against
metricName The name of the metric
metricType This should be the constant value 'status'
An example is :
serviceType: SRM
 metricName: org.glite.SRM-getFile
 metricType: status
 EOT
And for a performance metric, the following four field are mandatory:
Field Description
serviceType The service type that this probe works against
metricName The name of the metric
metricType This should be the constant value 'performance'
dataType The type of the data: float, int, string, boolean
An example is :
serviceType:  globus-GridFTP
metricName: org.wlcg.GridFTP-Transfer-Speed
dataType: float
metricType: performance
EOT

General options

When gathering metrics values for a probe, there are several standard options:
Option Required Meaning
-m Yes The specific metric to gather.
-u Yes when remote probe ServiceURI - the URI of a service endpoint to run the test against.
-t secs No A per-metric timeout that the probe should obey. It is up to the probe to enforce the timeout, if provided

An example is shown below:

$LFC-probe -u lfc://lfc-central.cern.ch/ -m org.wlcg.LFC-Read -v dteam
serviceType: glite-LFC
metricName: org.wlcg.LFC-Read
metricStatus: OK
timestamp: 2007-02-09T14:42:21.024Z
summaryData: OK
voName: dteam
serviceURI: lfc://lfc-central.cern.ch/
gatheredAt: lxadm01.cern.ch
EOT

Note that additional arguments are allowed (in this example, -v dteam)

Probe output Format

The probe returns on STDOUT, a formatted record per Metric gathered. The fields in the record are:

Field Required? Description
serviceType Required The service the metric was gathered from
metricName Required The name of the metric
metricStatus Required A return status code, selected from the status codes above
performanceData Optional Performance data returned by a performance metric
summaryData Optional A one-line summary for the gathered metric
detailsData Optional This allows a multi-line detailed entry to be provided - it must be the last entry before the EOT
voName Optional the VO that the metric was gathered for
hostName Optional The hostName on which a local metric was gathered
serviceURI Optional The URI of a remote service the metric was gathered for
gatheredAt Optional The name of the host which gathered the metric
timestamp Required The time the metric was gathered

There is no defined ordering on the return values, and a probe may return extra values with keys different than those mentioned here.

An example for a status metric is :

serviceType: glite-LFC
metricName: org.glite.LFC-Write
timestamp: 2007-02-09T14:42:20.024534Z
summaryData: OK
metricStatus: OK
voName: dteam
serviceURI: lfc://lfc101.cern.ch/
gatheredAt: mon.cern.ch
EOT

An example for a performance metric is :

serviceType: glite-LFC
metricName: org.glite.LFC-Readdir
timestamp: 2007-02-09T14:42:22.000Z
summaryData: 0.145
metricStatus: OK
voName: dteam
serviceURI: lfc://lfc-central.cern.ch/
gatheredAt:  mon.cern.ch
EOT

The return code from the probe can take on either one of two values, and should be syncronized with the value provided in metricStatus

  • 0 If the probe could gather the metric successfully - metricStatus is OK, WARNING, CRITICAL.
  • 1. The probe could not gather the metric successfully. metricStatus must be UNKNOWN. More details on the problem can be in the summaryData and detailsData fields of the metric data.

Specification Changelog

Version 0.91 - 4/6/07

  • Made ServiceURI (-u option) not be mandatory in CLI options, since it could be a 'local fabric' probe
  • Updated and clarified EBNF syntax. All examples in this document are unchanged, but some other possibly valid messages are no longer valid
  • noted that serviceType should be based, where possible, on an existing scheme for naming services e.g. GLUE
  • Added note on use of version in the messages.

Version 0.90 - 25/5/07

  • First labelled version

Comments

Please add you comments below:

  • Arvind 18 March 2008

    • I am wondering if it'd be useful to modify that in the next spec version to include a few more pieces of information:
      • Type of metric (SE/CE/foo/bar)
      • How often is the recommended interval to run the probe/metric
      • Help webpage, etc. -- miscellaneous information
      • Re-instate recommended abbreviation for metric

  • -- DanielRodrigues - 21 Feb 2008
    • simple value should allow '/' for the possibility of a path value (ex: /usr/dfrodrig/config.xml )
  • -- s.m.fisher@rlNOSPAMPLEASE.ac.uk and a.j.wilson@rlNOSPAMPLEASE.ac.uk
    • Please use proper EBNF - it is now an ISO standard
    • There is a lot of commonality between this and the proposed OGF logging format. This is based on key = value pairs, has time stamps and has an "event name" which is rather similar to your "metric name"
    • We don't like the use of EOT. It is not compatible with the OGF logging format
    • Don't define your own service types but just follow GLUE
    • The service version should be a metric in its own right
    • The probe version should be returned by a -v or --version
  • -- Emir Imamagic - 30 Mar 2007
    • I think that probe naming will cause us problems. What happens when multiple probes for the same service appear. Possible solution is to make a wrapper around them, but that's just another layer. Assuming that we have a repository of probes and DB with mappings between services and probes, naming of executable is not really that important. The only important thing is that serviceName-metricName is unique. DONE - Requirement dropped on probe naming
    • I think it would be good to have standardized option for port defined in the standard. Since we have serviceEndpoint in data exchange standard, maybe that option (e.g. -u serviceUri) should be mentioned here as alternative to host and port? DONE - -h now removed, and -u added
    • Do we really need timestamp in ISO8601? In a way that is just a burden of conversion on probes. Isn't it more efficient to perform this conversion in transport or presentation layer, if we assume that the probes are invoked far more often. On the other hand it would be good to be consistent with standard data exchange. (comment on JamesCasey - 30 Mar 2007 - 1) DONE - ISO8601 like the exchange format
    • This is actually good question. VO will be defined in proxy certificate used for running checks anyways. However, some probes could need it, e.g. for forming directory path on SRM or some other query. But in that sense it doesn't differ from other possible sensor-specific (e.g. service port, dedicated core service, etc) and maybe shouldn't be explicitly listed in standard. (comment on JamesCasey - 30 Mar 2007 - 2) DONE - no longer a standard option - just an 'additional probe argument' which is probe-specific
    • Didn't we agree to go with serviceName/hostName/gatheredAt combination? I already put this logic in my probes: if option -s (currently -h) is not defined probe hostName, otherwise serviceName. Parameter gatheredAt is present only when pulling data from remote monitoring instance. (comment on JamesCasey - 22 Mar 2007) *DONE - option will be -u, not -s, since we specify a URI *
    • Option -v would be a good thing, output could be version of probe and version of service. Although supported version of service can be defined per metric. However that will just make dependencies too complicated. (comment on JamesCasey 's comment on Emir Imamagic - 16 Mar 2007) DONE - it's actually called -h or --help
  • -- JamesCasey - 30 Mar 2007 (from discussions with Alain Roy/Rob Quick)
    • Should the date format be ISO8601? DONE - yes
    • How is the --vo argument used? (Interaction with vo attribute in VOMS ?) DONE - no longer needed
  • -- JamesCasey - 22 Mar 2007 (from discussions with Emir/Ian)
    • We need to resolve how to signify that a given metric has been gathered locally or remotely (so they have the same name, for correlation, but are two different metric sequences) DONE
    • We do need serviceName, for SAM style remote checks (which might be run locally at the site, but are still "remote" to the tested host) DONE
  • -- Emir Imamagic - 16 Mar 2007
    • I agree with Ian's first two comments. Would it make sense to include probe version and/or supported service version in probe output? OK - will look how to encode this - perhaps another option ( -v) ? DONE - -h option
    • I would suggest changing nodeName to hostName, since the host is more general expression for machine than node. Usage of more general expressions would make this standard applicable more widely (e.g. grid infrastructures beside WLCG). DONE
    • It would be nice to have a timeout option (-t) for the standard probe. Probe developer probably knows best how to cleanly stop retrieving specific metric and therefore should implement stop mechanism in the probe. DONE
    • I suggest that the voName field (and vo parameter) be optional. There are cases where you want to make a generic check of service (e.g. if the service is up). That would also enable probes to be "easily" utilized in environments and monitoring frameworks which are vo-agnostic. DONE
  • -- IanNeilson - 16 Mar 2007
    • It may be too late given existing infrastructures but I believe that the return code of the probe should not relate to the state of the service probed but only to whether the probe was actually run successfully. There is usually a gray area between but in general it will be neater, and imposes no significant overhead, to encode the metric result in the probe's output and leave the status to indicate whether such output is even there and valid. In this model for multiple metrics the return status would be 'fail' if any single metric failed to be gathered since the probe was asked to gather them all. DONE
    • I think I prefer to see serviceName explicit (if you mean e.g. LFC) included even though it should be extractable from metricName. An ugly use-case for the future there might be to allow the re-use of some probes against new implementations of service interfaces? DONE
    • There seems no encoding to return per-probe availability. For me this seems the most complex task and is quite distinct from the metric gathering so and I would rather see a separation into an 'availability calculator' (which could take the piped input from the probe(s) and/or be done elsewhere) rather than trying to make the probe a more general object. *AGREE - this should not be in the probe
  • -- JamesCasey - 12 Mar 2007
    • Return code for probe when more than 1 metric is run ? DONE - now only one metric can be run at a time
    • serviceName - as well as, or instead of, nodeName ? DONE -both
    • availability - can we calculate this per metric ? Or per probe ? EXTERNAL to probe
-- JamesCasey - 12 Mar 2007
Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2008-03-18 - JamesCasey
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback