SAM Probes and Metrics

gLite M/W Services to be tested

List of gLite M/W services/modules that require testing.

Service Probe comment
AMGA_mysql - developed by SA3. At least AMGA-ping (service level ping) is needed
AMGA_postgres - developed by SA3. At least AMGA-ping (service level ping) is needed
t/sBDII org.bdii/check_bdii_*,/usr/bin/gstat-validate-*, nagios/check_ldap gstat-validation, probes-org.bdii
CE org.sam/CE-probe, org.sam/WN-probe LCG-CE via WMS (probes-org.sam)
CREAM CE org.sam/CREAMCE-probe, org.sam/WN-probe, org.sam/CREAMCEDJS-probe CREAM CE via WMS and direct job submission (probes-org.sam)
FTS_oracle ch.cern/FTS-probe ch.cern.FTS-ChannelList (probes-ch.cern)
FTA_oracle - -
FTM - -
LB native Nagios check org.nagios.LocalLogger-PortCheck
LFC_mysql/oracle ch.cern/LFC-probe ch.cern.LFC-{Read,Write,Readdir,ReadDli} (probes-ch.cern)
MON ch.cern/RGMA-probe ch.cern.RGMA-ServiceStatus (probes-ch.cern)
PX hr.srce/MyProxy-probe hr.srce.MyProxy-Store (probes-hr.srce)
SE_dcache org.sam/SRM-probe org.sam.SRM-<metricName>
SE_dpm_disk org.sam/SRM-probe org.sam.SRM-<metricName>
SE_dpm_mysql org.sam/SRM-probe org.sam.SRM-<metricName>
TORQUE_client - site level fabric monitoring
TORQUE_server - site level fabric monitoring; plus APEL as indirect test
VOBOX org.alice/VOBOX-probe org.alice.VOBOX-{6 tests} link
VOMS_mysql/oracle org.nmap -
WMS org.sam/WMS-probe, hr.srce/{WMProxy-probe,WMS-probe} probes-org.sam - asynchronous; probes-hr.srce synchronous

SAM vs Nagios tests naming correspondence

For naming correspondence between critical SAM tests and Nagios metrics see.

grid-monitoring-probes-org.sam RPM

grid-monitoring-probes-org.sam RPM is available through EGEE SA1 repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (also, via egee-NAGIOS meta RPM). The RPM's directories structure is the following show hide

/etc/gridmon/
/usr/lib/python2.4/site-packages/gridmetrics/
/usr/libexec/grid-monitoring/probes/org.sam/
/usr/libexec/grid-monitoring/probes/org.sam/wnjob
/usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/{bin/,etc/,lib/,plugins/,probes/,tmp/,var/}
/usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/{etc/wn.d/org.sam/,probes/org.sam/}

  • Dependencies
    python >= 2.4
    python-GridMon >= 1.1.3
    python-ldap  
    python-suds >= 0.3.5
    grid-monitoring-probes-hr.srce >= 0.20.1
    

Currently the RPM consists of:

  • SAM Nagios probes (in /usr/libexec/grid-monitoring/probes/org.sam/):
    • CE-probe - CE probe containing a number of CE tests (metrics) for jobs submission via WMS
    • CREAMCE-probe - as above, but for CREAM CEs
    • CREAMCEDJS-probe - direct job submission to CREAM CEs (asynchronous)
    • SRM-probe - SRM probe containing a number of metrics for SRM service
    • T-probe - template probe, which serves as an example for writing your own probes based on the Python framework currently provided by the package (see Writing a probe under "Python based probes using org.sam's 'gridmonsam' module" section on the same page)
    • WN-probe - WN probe containing a number of metrics to be run on WNs
    • WMS-probe - metrics to test if jobs submission through WMS works (asynchronous)
  • wrapper checks (in /usr/libexec/grid-monitoring/probes/org.sam/):
    • samtest-run - to run "native" SAM tests (see link)
    • nagtest-run - to run "semi"-Nagios checks (see link)
  • /usr/libexec/grid-monitoring/probes/org.sam/wnjob - directory containing
    • nagios.d/ - directory with Nagios used as checks' scheduler on WNs
    • nagrun.sh - wrapper script to be launched on WNs (sets up required environment, launches and monitors Nagios, periodically sends WN metrics results to Message Bus)
    • org.sam/ -
      • probes/ - directory with SAM WN probes/tests ("new and old" ones), samtest-run and nagtest-run wrappers
      • etc/ - WN Nagios configuration for the above checks
  • gridmetics Python package (in /usr/lib/python2.4/site-packages/):
    • used by the above SAM probes.
  • /etc/gridmon/ - configuration directory:
    • org.sam.conf - main configuration file
    • org.sam.errdb - collection of common gLite m/w error messages and their mapping to Nagios statuses

Source code can be browsed here: https://svnweb.cern.ch/trac/sam/browser/trunk/probes, http://svnweb.cern.ch/guest/sam/trunk/probes

Latest 10 commits show hide

SRM

  • Equivalence [see link for critical tests defined in SAM for SRM for 'OPS' VO]

SAM Sensor WLCG/Nagios probe
SRMv2 org.sam.SRM-probe

SAM Test WLCG/Nagios metric
* SRMv2-host-cert-valid hr.srce.SRM2-CertLifetime
- org.sam.SRM-All
* SRMv2-get-SURLs org.sam.SRM-GetSURLs
* SRMv2-ls-dir org.sam.SRM-LsDir
* SRMv2-put org.sam.SRM-Put
* SRMv2-ls org.sam.SRM-Ls
* SRMv2-gt org.sam.SRM-GetTURLs
* SRMv2-get org.sam.SRM-Get
* SRMv2-del org.sam.SRM-Del
* - critical & accounted for availability

  • Probe org.sam.SRM-probe tests SRM service of versions v 1 and 2.

probeName: org.sam.SRM-probe
serviceVersion: 1.*, 2.*
  • Metrics descriptions
Metrics Description
org.sam.SRM-All Wrapper metric to launch the other metrics and publish passive checks results to Nagios.
org.sam.SRM-GetSURLs Get full SRM endpoint(s) and storage areas from BDII.
org.sam.SRM-LsDir List content of VO's top level space area(s) in SRM.
org.sam.SRM-Put Copy a local file to the SRM into default space area(s).
org.sam.SRM-Ls List (previously copied) file(s) on the SRM.
org.sam.SRM-GetTURLs Get Transport URLs for the file copied to storage.
org.sam.SRM-Get Copy given remote file(s) from SRM to a local file.
org.sam.SRM-Del Delete given file(s) from SRM.

  • Metrics specific options
    • SRM type (version)
      • --srmv [1|2] (Default: 2)
    • LDAP URL
      • --ldap-url [ldap://]server[:port] (Defaults: server lcg-bdii.cern.ch, port 2170) org.sam.SRM-GetSURLs
    • timeouts:
      • --ldap-timeout timeout (sec) (Default: 10) org.sam.SRM-GetSURLs
      • --se-timeout timeout (sec) (Default: 120) all except org.sam.SRM-GetSURLs

  • Dependency tree for the metrics in org.sam.SRM-probe
   
        1:GetSURLs
         ^     ^
        /       \
   2:LsDir  ____3:Put________
            ^   ^     ^     ^
           /   /       \     \
       4:Ls 5:GetTURLs 6:Get 7:Del
   
   eg.:
   2:LsDir - "sequence number":"metrics abbreviation"

  • help output from org.sam.SRM-probe

show hide

[kvs] src > ./SRM-probe
Usage: /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe
[-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec] 
[-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric 
specific parameters>]

-V                 Displays version
-h|--help          Displays help
-t|--timeout sec   Sets metric's global timeout. (Default: 600)
-m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
                   If not given, a default wrapper metric will be executed.
-H|--hostname FQDN Hostname where a service to be tested is running on
-u|--uri <URI>     Service URI to be tested
-v|--verbose 0-3   Verbosity. (Default: 0)
                   0 Single line, minimal output. Summary
                   1 Single line, additional information
                   2 Multi line, configuration debug output
                   3 Lots of details for plugin problem diagnosis  
-l|--list          Metrics list in WLCG format
-x                 VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
--nosanity         Don't sanitize metrics output.

  Mandatory paramters: hostname (-H) or URI (-u). 

  If specified with -m|--metric <name>, the given metric will be executed. 
  Otherwise, a wrapper metric (acting as an active check) will be run. The 
  latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"  

    Metrics common parameters:   

Reporting passive checks (when used with wrapper checks)
    
--pass-check-dest <config|nsca|nagcmd|active> (Default: config) 

--pass-check-conf <path> Configuration file for reporting passive checks.
                         Used with '--pass-check-dest config'. Overrides 
                         passive checks submission library default one.

--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
                        is set to 'nsca'.
--nsca-port <port>      Port NSCA is listening on (Default: 5667)
--send-nsca <path>      NSCA client binary.  (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)

--nagcmdfile <path>   Nagios command file. 
                      Order: $NAGIOS_COMMANDFILE, --nagcmdfile 
                      (Default: /var/nagios/rw/nagios.cmd) 

--vo <name>           Virtual Organization. (Default: ops)
--err-db <file>       Full path. Database file containing gLite CLI/API errors
                      for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,>  Comma separated list of topics (Default: default)

--work-dir <dir>      Working directory for metrics.
                      (Default: /var/run/gridprobes/<VO>)

--stdout              Detailed output of metrics will be printed to stdout as 
                      it is being produced by metrics. The default is to store
                      the output in a container and, then, produce Nagios 
                      compiant output.

    Metrics specific options:

--srmv <1|2>           (Default: 2)

org.sam.SRM-GetSURLs
--ldap-uri <URI>       Format [ldap://]hostname[:port[/]] 
                       (Default: ldap://sam-bdii.cern.ch:2170)
--ldap-timeout <sec>   (Default: 10)   
    
org.sam.SRM-{LsDir,Put,Ls,GetTURLs,Get,Del}
--se-timeout <sec>     (Default: 120)

SRM Results

show hide

  • - -- no values
  • / -- no mapping

  • [1] SAM SRMv2-get-SURLs vs [2] SAM-Nag SRMv2-org.sam.SRM-GetSURLs
STATUS [1] [2] # [1] [2] # [1] [2]
- 2009-06-03; 08:20 # 2009-06-03; 16:00 # 2009-06-09; 15:54
0/na - - # - 1 # - 3
10/ok 324 323 # 323 322 # 322 310
20/info - / # - / # - /
30/note - / # - / # - /
40/warn - - # - - # 6 12
50/error 13 5 # 14 5 # 6 8
60/crit - - # - - # - /
100/maint - / # - / # - /
nodes 337 328 # 337 328 # 334 333

  • [1] SAM SRMv2-put vs [2] SAM-Nag SRMv2-org.sam.SRM-Put
STATUS [1] [2] # [1] [2] # [1] [2]
- 2009-06-03; 08:20 # 2009-06-03; 16:00 # 2009-06-09; 15:54
0/na - - # - 1 # - -
10/ok 324 323 # 323 322 # 328 321
20/info - / # - / # - /
30/note - / # - / # - /
40/warn - - # - - # - 6
50/error 13 5 # 14 5 # 6 6
60/crit - / # - / # - /
100/maint - / # - / # - /
nodes 337 328 # 337 328 # 334 333

CREAM CE

submission via WMS

There are no differences between LCG and CREAM CEs wrt this way of jobs submission (and thus monitoring). Please refer to CE section.

Probe and metrics names differ only in the name of the service (CREAM vs CE): probe org.sam/CREAMCE-probe, metrics org.sam.CREAMCE-*

direct submission

org.sam.CREAMCEDJS-probe probe with the following metrics

metric descripton
org.sam.CREAMCEDJS-DirectJobState [Active+Passive] Direct job submission to CREAM CE
org.sam.CREAMCEDJS-DirectJobStatus [Passive] Final status of direct job submission to CREAM CE
org.sam.CREAMCEDJS-DirectJobMonit [Active] Babysit submitted grid jobs
org.sam.CREAMCEDJS-ServiceInfo Get CREAM CE service info
org.sam.CREAMCEDJS-SubmitAllowed Check if submission to the CREAM CE is allowed
org.sam.CREAMCEDJS-DelegateProxy Delegate proxy to CREAM CE

CE

  • Equivalence [see link for critical tests defined in SAM for CE for 'OPS' VO]
SAM Sensor WLCG/Nagios probe
CE org.sam.CE-probe

SAM Test WLCG/Nagios metric
* CE-sft-job org.sam.CE-JobSubmit
* CE-host-cert-valid hr.srce.GRAM-CertLifetime
* - critical & accounted for availability

  • Probe org.sam.CE-probe tests job submission to CEs via WMS.
    • The check delivers tests to WNs (eg. org.sam.WN-probe) and executes respective metrics there. Currently, Nagios is used as a scheduler on WNs. 'handle_service_check' OCSP is used to store metrics results as WLCG tuples and, then, 'send_to_msg' periodically (invoked from wrapper script on WN) sends the tuples to MB.

  • Metrics descriptions

metricName metricDescription metricType metricLocality
org.sam.CE-JobState Submits grid job to CE status remote
org.sam.CE-JobMonit Monitors grid jobs submitted to CEs status remote

  • help output from org.sam.CE-probe

show hide

# ./CE-probe -h
Usage: /usr/libexec/grid-monitoring/probes/org.sam/CE-probe
[-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec] 
[-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric 
specific parameters>]

-V                 Displays version
-h|--help          Displays help
-t|--timeout sec   Sets metric's global timeout. (Default: 600)
-m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
                   If not given, a default wrapper metric will be executed.
-H|--hostname FQDN Hostname where a service to be tested is running on
-u|--uri <URI>     Service URI to be tested
-v|--verbose 0-3   Verbosity. (Default: 0)
                   0 Single line, minimal output. Summary
                   1 Single line, additional information
                   2 Multi line, configuration debug output
                   3 Lots of details for plugin problem diagnosis  
-l|--list          Metrics list in WLCG format
-x                 VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
--nosanity         Don't sanitize metrics output.

  Mandatory paramters: hostname (-H) or URI (-u). 

  If specified with -m|--metric <name>, the given metric will be executed. 
  Otherwise, a wrapper metric (acting as an active check) will be run. The 
  latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"  

    Metrics common parameters:   

Reporting passive checks (when used with wrapper checks)
    
--pass-check-dest <config|nsca|nagcmd|active> (Default: config) 

--pass-check-conf <path> Configuration file for reporting passive checks.
                         Used with '--pass-check-dest config'. Overrides 
                         passive checks submission library default one.

--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
                        is set to 'nsca'.
--nsca-port <port>      Port NSCA is listening on (Default: 5667)
--send-nsca <path>      NSCA client binary.  (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)

--nagcmdfile <path>   Nagios command file. 
                      Order: $NAGIOS_COMMANDFILE, --nagcmdfile 
                      (Default: /var/nagios/rw/nagios.cmd) 

--vo <name>           Virtual Organization. (Default: ops)
--err-db <file>       Full path. Database file containing gLite CLI/API errors
                      for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,>  Comma separated list of topics (Default: default)

--work-dir <dir>      Working directory for metrics.
                      (Default: /var/run/gridprobes/<VO>)

--stdout              Detailed output of metrics will be printed to stdout as 
                      it is being produced by metrics. The default is to store
                      the output in a container and, then, produce Nagios 
                      compiant output.

    Metrics specific parameters:

--namespace <string>    Name-space for the probe. (Default: org.sam)
--config <file1,>       Comma separated list of metrics configuration files.
                        (Default: /etc/gridmon/org.sam.conf)
org.sam.CE-JobState
--mb-destination <dest> Mandatory parameter. The destination queue/topic on 
                        Message Broker to publish to.
--mb-uri <URI>   Message Broker URI. If not given, MB discovery will be 
                 performed on WN to find working MB. 
                 Format for <URI>: [failover://\(]<uri>,[...][\)]
                 <uri> - stomp://FQDN:port/ or http://FQDN/message
                 (Default: service discovery on WN.)
--wms <wms>      WMS to be used for job submission. If not given, default 
                 WMProxy end-points defined on the UI will be used.
--timeout-wnjob-global <sec>   Global timeout for a job on WN. (Default: 600)
--add-wntar-nag <d1,d2,..>  Comma-separated list of top level directories with 
                            Nagios compliant directories structure to be added 
                            to tarball to be sent to WN.
--add-wntar-nag-nosam       Instructs the metric not to include standard SAM WN
                            probes and their Nagios config to WN tarball. 
                            (Default: WN probes are included)
--add-wntar-nag-nosamcfg    Instructs the metric not to include Nagios 
                            configuration for SAM WN probes to WN tarball. The 
                            probes themselves and respective Python packages, 
                            however, will be included.
--jdl-templ <file>    JDL template file (full path). Default: 
                      <org.sam.ProbesLocation>/wnjob/org.sam.gridJob.jdl.template
--jdl-retrycount <val>          JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val>   JDL ShallowRetryCount (Default: 1).
--wnjob-location <dir>  Full path to directory contaning WN scheduler.
                        (Default: <org.sam.ProbesLocation>/wnjob)
--wnjob-verb <0-3>    Verbosity level on WN (Default: 1)

org.sam.CE-JobMonit
--timeout-job-global <sec>  Global timeout for jobs. Job will be canceled
                            and dropped if it is not in terminal state by 
                            that time. (Default: 3300)
--timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with 
                            'no compatible resources'. (Default: 2700)
--hosts <h1,h2,..>  Comma-separated list of CE hostnames to run monitor on.

CE Results

show hide

  • - -- no values
  • / -- no mapping

  • [1] SAM CE-sft-job vs [2] SAM-Nag org.sam.CE-JobSubmit
STATUS [1] [2] # [1] [2] # [1] [2] # [1] [2]
- 2009-05-29; 17:10 # 2009-05-29; 18:30 # 2009-06-03; 08:20 # 2009-06-09; 15:54
0/na - - # - - # - - # - -
10/ok - 254 # 366 340 # 367 351 # 372 362
20/info - / # - / # - / # - /
30/note - / # - / # - / # 1 /
40/warn - 25 # - - # - 39 # - 29
50/error - - # 28 38 # 28 - # 24 -
60/crit - / # - / # - / # - /
100/maint - / # - / # - / # - /
nodes - 279 # 394 378 # 395 390 # 397 391

Nagios CE testing

  • three Nagios checks
    • CE-JobState - active + passive check (service <hostNameCE,CE-JobState>). Runs hourly.
      • submits grid job to CE
      • accepts passive check results (from CE-JobMonit) for submitted grid job - holds a status of the grid job
    • CE-JobMonit - active check (service <localhost,CE-JobMonit>). Runs each 5 min.
      • checks statuses of all submitted jobs and updates CE-JobState and CE-JobSubmit (acts as a babysitter for all grid jobs submitted by CE-JobState service instances). CE-JobState and CE-JobSubmit are updated (as passive checks) either via Naigos command file or NSCA.
    • CE-JobSubmit - passive check (service <hostNameCE,CE-JobSubmit>)
      • holds terminal status of job submission to CE (mapping from gLite job terminal states ['Done','Aborted','Canceled'] to Nagios status [OK,WARNING,CRITICAL,UNKNOWN])

  • Nagios configuration on an example.
    • services.cfg for org.sam.CE-{JobState,JobMonit,JobSubmit}-<VO> using ncg-generic-service and ncg-passive-service service object templates
show hide
# org.sam.CE-JobState : [active+passive] submits grid job to CE, holds a status of the grid job
define service{
        use                             ncg-generic-service
        host_name                       ce110.cern.ch
        servicegroups                   local, ops
        service_description             org.sam.CE-JobState-ops
        contact_groups                  CERN_PPS-site
        check_command                   ncg_check_native!$USER10$/CE-probe!600!-x $USER5$ --vo ops -m org.sam.CE-JobState --mb-destination /topic/grid...
        active_checks_enabled           1
        passive_checks_enabled          1
        normal_check_interval           60
        retry_check_interval            15
        max_check_attempts              3
        obsess_over_service             0
        # + _vo, _service_uri, _metric_name, _metric_set, _site_name
}

# org.sam.CE-JobMonit : [active] babysitter
define service{
        use                             ncg-generic-service
        host_name                       lxvm0325.cern.ch
        servicegroups                   local, ops
        service_description             org.sam.CE-JobMonit-ops
        contact_groups                  nagios-admins
        check_command                   ncg_check_native!$USER10$/CE-probe!600!-x $USER5$ --vo ops -m org.sam.CE-JobMonit --mb-destination /topic/grid...
        active_checks_enabled           1
        passive_checks_enabled          0
        normal_check_interval           5
        retry_check_interval            2
        max_check_attempts              2
        obsess_over_service             0
        # + _vo, _service_uri, _metric_name, _metric_set, _site_name
}

# org.sam.CE-JobSubmit : [passive] terminal status of job submission to CE
define service{
        use                             ncg-passive-service
        host_name                       ce110.cern.ch
        servicegroups                   local, ops
        service_description             org.sam.CE-JobSubmit-ops
        contact_groups                  CERN_PPS-site
        check_command                   ncg_check_passive!"just nothing"
        obsess_over_service             1
        # + _vo, _service_uri, _metric_name, _metric_set, _site_name
}

Jobs Submission and Monitoring

According to WMS Job State Machine (link p.17) job can be in the following states

  • non-terminal Submitted, Waiting, Ready, Scheduled, Running more less
    • Submitted: job is entered by the user to the UI but not yet transferred to NS for processing
    • Waiting: job has been accepted by NS and is waiting for WM processing or is being processed by WM Helper modules (e.g., WM is busy, no appropriate CE (cluster) has been found yet, ...).
    • Ready: job has been processed by WM and its Helper modules (especially, appropriate CE has been found) but not yet transferred to the CE (local batch system queue) via Job Controller and CondorC.
    • Scheduled: job is waiting in the queue on the Computing Element.
    • Running: job is running.
  • terminal Done, Aborted, Canceled, Cleared more less
    • Done: job exited or is considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way).
    • Aborted: job processing was aborted by WMS (waiting in the WM queue or CE for too long, over-use of quotas, expiration of user credentials, etc.).
    • Canceled: job has been successfully canceled on user request.
    • Cleared: output sandbox was transferred to the user or removed due to the timeout.

On WMS there are two main parameters responsible for timeouts in job matchmaking

  • MatchRetryPeriod = 3500 (58 min) - interval between successive retries to match a job a resource (T_WMS_MatchRetr)
  • ExpiryPeriod = 7200 (2 hours) - time after which job will be aborted with 'no compatible resources' (T_WMS_Exp)

Defaults allow job to be matched at most three times within two hours after job submission.

With JDL

   JobType="Normal";
   ...
   RetryCount = 0;
   ShallowRetryCount = 1;
   Requirements = other.GlueCEInfoHostName == "<CE hostname>";
   

and 1 hour interval between jobs submission it is advisable to set e.g. MatchRetryPeriod = 1320 (22 min) and ExpiryPeriod = 3000 (50 min). This way WMS will naturally abort jobs if info about CE isn't available in IS.

In Nagios jobs submission and monitoring was implemented in the following way.

  • timeouts defined for org.sam.CE-JobState and org.sam.CE-JobMonit metrics
    org.sam.CE-JobState
    --timeout-job-discard <sec> Discard job after the timeout. (Default: 21600)
    
    org.sam.CE-JobMonit
    --timeout-job-global <sec>  Global timeout for jobs. Job will be canceled
                                and dropped if it is not in terminal state by 
                                that time. (Default: 3300)
    --timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with 
                                'no compatible resources'. (Default: 2700)
    --timeout-job-discard <sec> Discard job after the timeout. (Default: 21600)
    --timeout-job-schedrun <sec> Scheduled/Running states timeout. (Default: 19800)
       

  • org.sam.CE-JobState metric (active Nagios check). Runs hourly (normal_check_interval 60).
    • initially submits job and saves /<workdirRun>/<voName>/<nameSpace>/<serviceType>/<nodeName>/activejob.map with submitTimeStamp|hostNameCE|serviceDesc|jobID|jobState|lastStateTimeStamp.
    • if activejob.map was found
      • jobState is terminal state - discard the job, proceed with submission
      • jobState is non-terminal state
        • lastStateTimeStamp - submitTimeStamp < timeout-job-discard - exit with OK: Active job - <jobState> [time]
        • lastStateTimeStamp - submitTimeStamp > timeout-job-discard - discard the job, proceed with submission

  • org.sam.CE-JobMonit metric (active Nagios check; checks all jobs and updates activejob.map, org.sam.CE-JobState & org.sam.CE-JobSubmit). Runs each 5 min (normal_check_interval 5). For all currently submitted jobs (activejob.map files) get job state from WMS
    • on error getting job state
      • UI problem - update org.sam.CE-JobState with WARNING
      • WMS problem
        • timeNow - submitTimeStamp < timeout-job-discard - update org.sam.CE-JobState with WARNING (unable to get job status. Job will be deleted in N min; N = (timeout-job-discard - (timeNow - submitTimeStamp))/60)
        • else - update org.sam.CE-JobState with WARNING and org.sam.CE-JobSubmit with UNKNOWN (unable to get job status. Job discarded.)
    • on OK getting job state
      • Done
        • Current Status: Done (Success) - update org.sam.CE-Job{State,Submit} with OK.
        • Current Status: Done (Exit Code =0) - Framework on WN exists with Nagios compliant exit codes. Check Exit code:. Update org.sam.CE-Job{State,Submit} respectively with WARNING, CRITICAL, UNKNOWN.
        • delete activejob.map.
      • Aborted
        • get logging info and get reason
          • request expired
            • BrokerHelper: no compatible resources - update org.sam.CE-Job{State,Submit} with CRITICAL (Job was aborted. Failed to match.).
            • else - update org.sam.CE-Job{State,Submit} with UNKOWN (Job was aborted. Check WMS.)
          • else - update org.sam.CE-Job{State,Submit} with CRITICAL (Job was aborted.).
        • delete activejob.map.
      • Cleared
        • delete activejob.map.
      • Cancelled
        • delete activejob.map.
      • Waiting
        • get logging info and get reason
          • no compatible resources
            • timeNow - submitTimeStamp > timeout-job-waiting - update org.sam.CE-Job{State,Submit} with CRITICAL (BrokerHelper: no compatible resources). Cancel and discard the job.
            • else - update org.sam.CE-JobState with WARNING. Update activejob.map.
          • else
            • timeNow - submitTimeStamp > timeout-job-discard - cancel & delete activejob.map.
      • Ready, Submitted
        • timeNow - submitTimeStamp > timeout-job-global - update org.sam.CE-Job{State,Submit} with UNKNOWN. Get logging info & include into details data; cancel & delete activejob.map.
        • else - update activejob.map.
      • Scheduled, Running
        • timeNow - submitTimeStamp > timeout-job-schedrun - update org.sam.CE-Job{State,Submit} with WARNING. Get logging info & include into details data; cancel & delete activejob.map. Issue CRITICAL on the second successive timeout. (JobMonit has the states counter).
        • else - update activejob.map.

Currently [08-02-2010] we are monitoring with all the defaults

                T_N_SCHED(1h)                T_J_DISCARD
          T_J_GLOB |               T_J_SCHEDRUN  |
      T_J_WAIT  |  |                          |  |
|-----------|---|--#---------------|------...-|--| -> t
                  |                |
                T_WMS_MatchRetr  T_WMS_Exp

T_J_WAIT - 45 min (--timeout-job-waiting)
T_J_GLOB - 55 min (--timeout-job-global)
T_WMS_MatchRetr - 58 min (MatchRetryPeriod)
T_N_SCHED - 1 hour (Nagios metric scheduling)
T_WMS_Exp - 2 hours (ExpiryPeriod)
T_J_SCHEDRUN - 5h30min (--timeout-job-schedrun)
T_J_DISCARD - 6 hours (--timeout-job-discard)
   
Thus, in most cases we cancel jobs being in Waiting due to no compatible resources when T_J_WAIT kicks in (after only one initial matchmaking in WM) and issue CRITICAL for org.sam.CE-Job{State,Submit}.

Moving to the case T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT (or even 2*T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT) is fairly possible. Thus, in case of the jobs to CEs which are not (properly) published in IS the jobs will be naturally discarded by WMS (Aborted; reason no compatible resources). In such case, monitoring metric (org.sam.CE-JobMonit) is ready to handle such cases and will issue CRITICAL against the CE.

WN

  • Equivalence
SAM Sensor WLCG/Nagios probe(s)
testjob org.sam.WN-probe, samtest-run + SAM native tests

SAM Test WLCG/Nagios metric
* CE-sft-brokerinfo org.sam.WN-Bi [samtest-run]
* CE-sft-caver org.sam.WN-CAver [samtest-run]
* CE-sft-csh org.sam.WN-Csh [samtest-run]
* CE-sft-softver org.sam.WN-SoftVer [samtest-run]
* CE-sft-lcg-rm org.sam.WN-Rep [wrapper for: org.sam.WN-{RepISenv,...,WN-RepDel}], [WN-probe]
-* CE-sft-lcg-rm-gfal org.sam.WN-RepISenv
-* CE-sft-lcg-rm-free org.sam.WN-RepFree
-* CE-sft-lcg-rm-cr org.sam.WN-RepCr
-* CE-sft-lcg-rm-cp org.sam.WN-RepGet
-* CE-sft-lcg-rm-rep org.sam.WN-RepRep
-* CE-sft-lcg-rm-del org.sam.WN-RepDel
CE-sft-posix org.sam.WN-Gfal [wrapper for: org.sam.WN-{GfalCp,...,WN-GfalDel}]; remote POSIX I/O via GFAL: should depend on successful completion of org.sam.WN-Bi and org.sam.WN-RepCr
- org.sam.WN-GfalCp
- org.sam.WN-GfalRead
- org.sam.WN-GfalWrite
- org.sam.WN-GfalDel
CE-wn-sec-crl org.sam.sec or SWAT
CE-wn-sec-fp org.sam.sec or SWAT
CE-sft-wn SWAT
CE-sft-vo-tag SWAT
CE-sft-vo-swdir SWAT
CE-sft-rgma SWAT
* - critical & accounted for availability

-* - sub-tests of CE-sft-lcg-rm, if one of them fails the main wrapper fails

  • Probe org.sam.WN-probe
    • performs security, replica management, remote POSIX I/O with GFAL checks on WNs

  • Metrics descriptions

metricName metricDescription metricType metricLocality
org.sam.WN-Rep Wrapper check to launch the replica management checks and publish passive check results to Nagios. status local
org.sam.WN-RepISenv Check if LCG_GFAL_INFOSYS variable is set status local
org.sam.WN-RepFree Check if Close (or VO default) SE has any free space left according to the information system. status remote
org.sam.WN-RepCr Copy and register a file to the Close (or default) SE into default space area. Retrieve list of replicas. status remote
org.sam.WN-RepGet Copy the file back from Close SE to the WN. Compare the files. status remote
org.sam.WN-RepRep Replicate the file from close SE to a chosen 'central' SE. status remote
org.sam.WN-RepDel Delete given file(s) from SRM. status remote
org.sam.WN-PyVer Check version of Python installed on WN. status local
org.sam.WN-Gfal Wrapper check to launch checks for remote POSIX I/O via GFAL and publish passive check results to Nagios. status local
org.sam.WN-Gfal* ... TODO ... status remote
org.sam.WN-CAver CA integrity and existence (NB! validity is not tested. It's a responsibility of IGTF http://signet-ca.ijs.si/nagios/) status local
org.sam.WN-CAcrl validity of CA CRL status local

  • Dependency tree for the org.sam.WN-Rep* metrics in org.sam.WN-probe probe:

        0:Rep (wrapper)
          |
     1:RepISenv
       ^     ^
      /       \
 2:RepFree   __3:RepCr____
              ^   ^      ^
             /    |       \
       4:RepGet 5:RepRep 6:RepDel

All metrics "1..6" are considered as critical. If at least one of them fails wrapper metric org.sam.WN-Rep fails as well. This corresponds to current SAM test implementation and GridView availability calculations (for equivalency with SAM check table above for CE-sft-lcg-rm / org.sam.WN-Rep and SAM critical tests).

  • help output from org.sam.WN-probe

show hide

# ./WN-probe -h
Usage: /usr/libexec/grid-monitoring/probes/org.sam/WN-probe
[-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec] 
[-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric 
specific parameters>]

-V                 Displays version
-h|--help          Displays help
-t|--timeout sec   Sets metric's global timeout. (Default: 600)
-m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
                   If not given, a default wrapper metric will be executed.
-H|--hostname FQDN Hostname where a service to be tested is running on
-u|--uri <URI>     Service URI to be tested
-v|--verbose 0-3   Verbosity. (Default: 0)
                   0 Single line, minimal output. Summary
                   1 Single line, additional information
                   2 Multi line, configuration debug output
                   3 Lots of details for plugin problem diagnosis  
-l|--list          Metrics list in WLCG format
-x                 VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
--nosanity         Don't sanitize metrics output.

  Mandatory paramters: hostname (-H) or URI (-u). 

  If specified with -m|--metric <name>, the given metric will be executed. 
  Otherwise, a wrapper metric (acting as an active check) will be run. The 
  latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"  

    Metrics common parameters:   

Reporting passive checks (when used with wrapper checks)
    
--pass-check-dest <config|nsca|nagcmd|active> (Default: config) 

--pass-check-conf <path> Configuration file for reporting passive checks.
                         Used with '--pass-check-dest config'. Overrides 
                         passive checks submission library default one.

--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
                        is set to 'nsca'.
--nsca-port <port>      Port NSCA is listening on (Default: 5667)
--send-nsca <path>      NSCA client binary.  (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)

--nagcmdfile <path>   Nagios command file. 
                      Order: $NAGIOS_COMMANDFILE, --nagcmdfile 
                      (Default: /var/nagios/rw/nagios.cmd) 

--vo <name>           Virtual Organization. (Default: ops)
--err-db <file>       Full path. Database file containing gLite CLI/API errors
                      for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,>  Comma separated list of topics (Default: default)

--work-dir <dir>      Working directory for metrics.
                      (Default: /var/run/gridprobes/<VO>)

--stdout              Detailed output of metrics will be printed to stdout as 
                      it is being produced by metrics. The default is to store
                      the output in a container and, then, produce Nagios 
                      compiant output.

    Metrics specific options:

org.sam.WN-{Rep,Rep{Cr,Get,Rep,Del}}
--lfc <FQDN>    LFC to be used with the replica management tests.
                (Default: prod-lfc-shared-central.cern.ch)

org.sam.WN-{Rep,RepRep}
--se-rep <FQDN> "Central" SE to be used with the replica management test.
                (Default: samdpm002.cern.ch)

gLExec

This is WN test which tries to execute export GLEXEC_CLIENT_CERT=$X509_USER_PROXY; <path_to>/glexec /usr/bin/id -a. A separate CE job submission/monitoring is used (org.sam.glexec.CE-Job{State,Submit,Monit}-<VO>) to deliver the test to WNs. Nagios metric name is org.sam.glexec.WN-gLExec-<VO>. OPS VO Role=pilot is used to submit the jobs.

Currently [Friday, March 05 2010], the test exit codes (Nagios) and summaries are following:

   status = 'OK'
   summary = 'success'
   
   status = 'UNKNOWN'
   summary = "glexec command not found."

   status = 'WARNING'
   summary = "client cert file error: <error message>"
   
   status = 'WARNING'
   summary = "executable can't be executed (126)"
   
   status = 'WARNING'
   summary = "client error (201)"
   
   status = 'WARNING'
   summary = "system error (202)"
   
   status = 'WARNING'
   summary = "authorization error (203)"
   
   status = 'UNKNOWN'
   summary = "exit code overlap (204)"
   
   status = 'UNKNOWN'
   summary = "unrecognised exit code (<exit code>)"
   

WN Results

show hide

  • - -- no values
  • / -- no mapping

  • [1] SAM CE-sft-lcg-rm vs [2] SAM-Nag org.sam.WN-Rep
STATUS [1] [2] # [1] [2] # [1] [2] # [1] [2] # [1] [2]
- 2009-05-29; 17:10 # 2009-05-29; 18:40 # 2009-06-03; 08:20 # 2009-06-03; 16:00 # 2009-06-09; 15:54
0/na - 9 # - 35 # - 35 # - 38 # - 20
10/ok - 35 # 355 148 # 352 173 # 353 182 # 369 292
20/info - / # - / # - / # - / # - /
30/note - / # - / # - / # - / # - /
40/warn - 3 # - 4 # 1 4 # - 4 # 1 4
50/error - 6 # 30 47 # 32 59 # 33 49 # 19 34
60/crit - / # - / # - / # - / # - /
100/maint - / # - / # - / # - / # - /
nodes - 53 # 385 234 # 385 271 # 386 273 # 389 350

  • [1] SAM CE-sft-caver vs [2] SAM-Nag org.sam.WN-CAver
STATUS [1] [2] # [1] [2]
- 2009-06-03; 08:20 # 2009-06-09; 15:54
0/na - - # - 1
10/ok 383 262 # 387 342
20/info - / # - /
30/note - / # - /
40/warn - 9 # - 7
50/error 1 - # - -
60/crit 1 / # 2 /
100/maint - / # - /
nodes 385 271 # 389 350

  • [1] SAM CE-sft-softver vs [2] SAM-Nag CE-org.sam.WN-SoftVer
STATUS [1] [2] # [1] [2]
- 2009-06-03; 08:20 # 2009-06-09; 15:54
0/na - - # - -
10/ok 385 262 # 389 343
20/info - / # - /
30/note - / # - /
40/warn - 9 # - 7
50/error - - # - -
60/crit - / # - /
100/maint - / # - /
nodes 385 271 # 389 350

Missing WN test results

No. Problem Reason Region / Site / CE
1. Can't locate URI.pm in @INC site: missing perl-URI or not in $PERL5LIB 1. AP PH-ASTI-BUHAWI buhawi.pscigrid.gov.ph GGUS ticketYes / Done
2. CA VICTORIA-LCG2 lcg-ce.rcf.uvic.ca GGUS ticketYes / Done
3. UKI UKI-NORTHGRID-LIV-HEP hepgrid3.ph.liv.ac.uk GGUS ticketYes / Done
4. DECH SCAI cedric.scai.fraunhofer.de GGUS ticketYes / Done
5. CA CA-SCINET-T2 lcg-ce1.scinet.utoronto.ca GGUS TicketYes / Done
6. SWE CESGA-EGEE ce2.egee.cesga.es GGUS TicketYes / Done
2. Can't locate Net/LDAP.pm in @INC site: missing perl-LDAP or not in $PERL5LIB 1. NE ITPA-LCG2 spektras.itpa.lt GGUS ticketYes / Done
2. AP INDIACMS-TIFR ce.indiacms.res.in GGUS Ticket Yes / Done
3. Compilation failed in .../send_to_msg Can't load ... auto/Time/HiRes/HiRes.so: wrong ELF class: ELFCLASS32 FW Yes / Done: Perl x86_64 loading i386 auto/Time/HiRes/HiRes.so [solution: r.635 removed 'auto/'] 1. NE HPC2N glite02-kvm.hpc2n.umu.se
4. Passive... 'org.sam.WNRepCr-ops'..., but the service could not be found! FW Yes / Done: "dash" disappears 'org.sam.WN_-_RepCr-ops' [solved: "dash" miraculously disappeared from the code 8-/] Affected sites that failed one of tests in a set of dependent tests (in publication of passive check results)
5. Can't connect to given MB (gridmsg002). Can't discover others (either can't connect to tBDII (Perl LDAP) or can't connect to discovered MBs) site: check_broker & find_all_brokers - sites firewall outgoing TCP:{6162,6163}. Request to add the ports to M/W ports table was submitted to John White [09-02-2010] 1. IT GRISU-ENEA-GRID egce1-cresco.portici.enea.it, egce-cresco.portici.enea.it, egce.frascati.enea.it GGUS #55378 Yes / Done
2. SWE IEETA axon-g01.ieeta.pt GGUS #55377 Yes / Done
3. SWE UPorto grid001.fe.up.pt, grid001.fc.up.pt GGUS #55379 Yes / Done
4. UKI cpDIASie gridgate.cp.dias.ie GGUS #55380 Yes / Done
5. IT SISSA-Trieste ce-01.grid.sissa.it (GOCDB 07-02-2010 not monitored) No


Issue 5.

CEs seen suffering from the issue but now in OK state. Time window:

 Wed Feb  3 01:00:00 CET 2010
 Sun Feb  8 18:45:36 CET 2010

SWE UPV-GRyCAP ramses.dsic.upv.es [08-02-2010]
NE KTU-BG-GLITE ce.bg.ktu.lt OK [07-02-2010]
SEE HG-02-IASA ce01.marie.hellasgrid.gr OK [07-02-2010]
CE PSNC creamce.reef.man.poznan.pl OK [07-02-2010] CREAM CE
NE CSC egee-ce.csc.fi OK [07-02-2010]
CE GUP-JKU egee-ce1.gup.uni-linz.ac.at OK [07-02-2010]
UKI UKI-SOUTHGRID-BHAM-HEP epgce4.ph.bham.ac.uk OK [07-02-2010]
AP MY-UPM-BIRUNI-01 haitham.biruni.upm.my OK [07-02-2010]
FR AUVERGRID iut03auvergridce01.univ-bpclermont.fr OK [07-02-2010]
CANADA VICTORIA-LCG2 lcg-ce.rcf.uvic.ca OK [07-02-2010]
UKI UKI-SOUTHGRID-OX-HEP ngsce-test.oerc.ox.ac.uk OK [07-02-2010]
DECH UNI-DORTMUND udo-ce03.grid.tu-dortmund.de OK [07-02-2010]
SEE WEIZMANN-LCG2 wipp-ce.weizmann.ac.il OK [07-02-2010]

Detailed error messages for the above issues. show hide

  1. Can't locate URI.pm in @INC
    Check if provided MBs are working.
    WARNING: Provided MB isn't accessible stomp://gridmsg002.cern.ch:6163/
    Trying to obtain it from IS.
    ERROR: Failed to obtain Message Broker URI from bdii.grid.sinica.edu.tw:2170.
    Can't locate URI.pm in @INC (@INC contains: /opt/lcg/lib64/perl /opt/gpt/lib/perl/x86_64-linux-thread-multi /opt/gpt/lib/perl /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/site_perl/5.8.5 /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/vendor_perl/5.8.5 /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl/5.8.7 /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgBroker.pm line 30, <DATA> line 225. BEGIN failed--compilation aborted at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgBroker.pm line 30, <DATA> line 225. Compilation failed in require at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/bin/find_broker line 34, <DATA> line 225. BEGIN failed--compilation aborted at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/bin/find_broker line 34, <DATA> line 225.
    lobus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/bin/check_broker line 17.
       
  2. Can't locate Net/LDAP.pm in @INC
    Check if provided MBs are working.
    WARNING: Provided MB isn't accessible stomp://gridmsg002.cern.ch:6163/
    Trying to obtain it from IS.
    ERROR: Failed to obtain Message Broker URI from bdii.mif.vu.lt:2170.
    Can't locate Net/LDAP.pm in @INC (@INC contains: /opt/gLite/lcg/lib/perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /opt/gLite/gpt/lib/perl/i386-linux-thread-multi /opt/gLite/gpt/lib/perl /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/site_perl/5.8.5 /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.7/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl/5.8.7 /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.7/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.8/i386-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/find_broker line 32. BEGIN failed--compilation aborted at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/find_broker line 32.
    idMon/MsgBroker.pm line 30.
    BEGIN failed--compilation aborted at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgBroker.pm line 30.
    Compilation failed in require at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/check_broker line 17.
    BEGIN failed--compilation aborted at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/check_broker line 17.
       
  3. Compilation failed in require at /glite/home/ops/ops042/.../.nagios/libexec/grid-monitoring/plugins/nagios/send_to_msg
    _send_to_msg [Sat Jan 16 09:29:35 CET 2010]: 
    /ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgCache.pm line 23.
    Compilation failed in require at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/libexec/grid-monitoring/plugins/nagios/send_to_msg line 57.
    BEGIN failed--compilation aborted at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/libexec/grid-monitoring/plugins/nagios/send_to_msg line 57.
    Can't load '/glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/auto/Time/HiRes/HiRes.so' for module Time::HiRes: /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/auto/Time/HiRes/HiRes.so: wrong ELF class: ELFCLASS32 at /usr/lib/perl/5.8/DynaLoader.pm line 225.
     at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/IPC/DirQueue.pm line 63
    Compilation failed in require at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/IPC/DirQueue.pm line 63.
    BEGIN failed--compilation aborted at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/IPC/DirQueue.pm line 63.
    Compilation failed in require at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgCache.pm line 23.
    BEGIN failed--compilation aborted at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgCache.pm line 23.
       
  4. Passive... 'org.sam.WNRepCr-ops'..., but the service could not be found!
    Warning:  Passive check result was received for service 'org.sam.WNRepCr-ops' on host 'localhost.localdomain', but the service could not be found!
       
  5. ERROR: Failed to obtain Message Broker URI from ... Need more debug info on what has happened on WN!
    Check if provided MBs are working.
    WARNING: Provided MB isn't accessible stomp://gridmsg002.cern.ch:6163/
    Trying to obtain it from IS.
    ERROR: Failed to obtain Message Broker URI from egee-bdii.cnaf.infn.it:2170.
    Could not connect to BDII at egee-bdii.cnaf.infn.it:2170 at /gpor_proj/spagogrid/egee/home/crescoops004/gram_scratch_71nVDKzHgW/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2f9Ll_5fjnEXmasADcsUWm2rxQ/.nagios/bin/find_broker line 81, <DATA> line 225.
       
  6. Message Broker URI: empty; site problem - firewalling outgoing TCP:6163. Request to add TCP:{6162,6163} to M/W ports table was submitted to John White.
    Check if provided MB is accessible [stomp://gridmsg002.cern.ch:6163/].
    INFO : Testing Broker: stomp://gridmsg002.cern.ch:6163/
    INFO : Couldn't connect to : stomp://gridmsg002.cern.ch:6163/
    WARNING: Provided MB isn't accessible [stomp://gridmsg002.cern.ch:6163/].
    Trying to obtain it from IS.
    All found brokers [BDII topbdii01.ncg.ingrid.pt:2170]:
    INFO : Getting info from BDII topbdii01.ncg.ingrid.pt:2170
    INFO : Testing Broker: stomp://gridmsg101.cern.ch:6163/
    INFO : Testing Broker: stomp://msg.cro-ngi.hr:6163/
    ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170].
    Exiting.
       

WMS

  • SAM vs Nagios metrics
SAM Sensor WLCG/Nagios probe
gRB org.sam.WMS-probe

SAM Test WLCG/Nagios metric
* gRB-sft-submit org.sam.WMS-{JobState,JobStatus}
* gRB-host-cert-valid hr.srce.GRAM-CertLifetime
* - critical

  • Probe org.sam.WMS-probe tests job submission to WMS.

  • Metrics descriptions

metricName metricDescription metricType metricLocality
org.sam.WMS-JobState Submits grid job to CE status remote
org.sam.WMS-JobMonit Monitors grid jobs submitted to CEs status remote
org.sam.WMS-JobSubmit [Passive] (in Nagios sense) Holds final status of job submission status remote

Running org.sam.* probes/metrics

Integration of WN checks

Default behavior of org.sam.CE-JobState metric

WN tarball is assembled by org.sam/CE-probe -m org.sam.CE-JobState ... metric at runtime. By default only org.sam probes/checks (org.sam/WN-probe and org.sam/wnjob/org.sam/probes/org.sam/*) are packed into the WN tarball. As on WNs Nagios is used as checks launcher it is also being packed to the tarball, as well as respective Nagios configurations (located in org.sam/wnjob/org.sam/etc/wn.d/org.sam/) of the above org.sam WN probes/checks. Example of submission of CE+WN tests:

/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H ce000.cern.ch \
                         -m org.sam.CE-JobState --mb-destination /a/b/c

Integration of your WN checks

The following CLI parameters to org.sam.CE-JobState metric are available:

--add-wntar-nag <d1,d2,..>  Comma-separated list of top level directories with 
                            Nagios compliant directories structure to be added 
                            to tarball to be sent to WN.
--add-wntar-nag-nosam       Instructs the metric not to include standard SAM WN
                            probes and their Nagios config to WN tarball. 
                            (Default: WN probes are included)
--add-wntar-nag-nosamcfg    Instructs the metric not to include Nagios 
                            configuration for SAM WN probes to WN tarball. The 
                            probes themselves and respective Python packages, 
                            however, will be included.

  • with --add-wntar-nag <d1,d2,..> parameter the respective "Nagios compliant directories structure" should look like this:
       [kvs] ~ tree /path/to/your/pobes/wnjob/org.my/
       /path/to/your/pobes/wnjob/org.my/
       |-- etc
       |   `-- wn.d
       |       `-- org.my
       |           |-- commands.cfg
       |           `-- services.cfg
       `-- probes
           `-- org.my
               |-- check_A
               |-- check_B
               `-- checks_lib.sh
    
    • probes/org.my/* should contain your probes/checks
    • etc/wn.d/org.my/ should contain file(s) with .cfg extension with Nagios command and service objects definitions (optionally, service dependencies definitions). In your etc/wn.d/org.my/*.cfg files please use the following paths defining Nagios macros and the framework template names:
      • $USER3$ - macro defining path to <nagiosRoot>/probes/ directory on WN. Usage:
        define command{
               command_name   check_A1
               command_line   $USER3$/org.my/check_A
               }
        
      • <wnjobWorkDir> - will be substituted with the job's working directory on WN. Handy if your check requires and creates a working directory. Possible usage (assumes -w instructs check_A to create <wnjobWorkDir>/.mygridprobes directory):
        define command{
               command_name   check_A2
               command_line   $USER3$/org.my/check_A -w <wnjobWorkDir>/.mygridprobes
               }
        

    For this particular part of Nagios objects configuration and macros please see the following Nagios resources: Object Configuration Overview, Service definitions, Command definitions, Service dependency definitions, Nagios macros, Macros and resource file.

    Also, as an example you might want to check the objects configurations defined for org.sam WN checks in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/ of grid-monitoring-probes-org.sam RPM. The Nagios resource that will be used on WN on UI is located here /usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/etc/resource.cfg.

    Example (only relevant CLI parameters are shown):

       ./CE-probe -m org.sam.CE-JobState \
                  --add-wntar-nag /path/to/your/pobes/wnjob/org.my/,/path/to/org.sam.sec
    

    This will add into WN tarball two sets of WN checks provided by org.my and org.sam.sec. NB! org.sam WN checks and their respective Nagios configurations will still be added and launched on WN, as well!

  • Use --add-wntar-nag-nosam if you neither want org.sam probes to be run on WN nor want to use the org.sam probes with your probable custom configurations. This will instruct org.sam.CE-JobState metric not to include org.sam WN probes and their Nagios configuration to WN tarball. Example (only relevant CLI parameters are shown):
          ./CE-probe -m org.sam.CE-JobState --add-wntar-nag-nosam \
                     --add-wntar-nag /path/to/your/pobes/wnjob/org.my/
    
    WN tarball will contain only your probes and Nagios configurations from /path/to/your/pobes/wnjob/org.my/.
  • Use --add-wntar-nag-nosamcfg if you don't want org.sam probes to be run on WN, but still want the probes to be included in the WN tarball. This will instruct org.sam.CE-JobState metric to include org.sam WN probes, Python gridmon and gridmonsam packages with respective modules into the WN tarball. Nagios configuration of the probes will not be included to WN tarball. This is done for your convenience:
    1. in case you want to use some of org.sam provided probes/wrappers (eg. $USER3$/org.sam/{nag,sam}test-run), though, you will have to include your custom Nagios objects configurations yourselves (can be taken directly from /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/*.cfg and included into your *.cfg-s)
    2. you developed your Python WN probes using gridmon or gridmonsam packages. The latter packages will be added to $PYTHONPATH before launching Nagios on WN, so, your probes can safely import required modules from them.
    Example (only relevant CLI parameters are shown):
          ./CE-probe -m org.sam.CE-JobState --add-wntar-nag-nosamcfg \
                     --add-wntar-nag /path/to/your/pobes/wnjob/org.my/
    
    WN tarball will contain your probes and Nagios configurations from /path/to/your/pobes/wnjob/org.my/, as well as org.sam probes (but without their standard org.sam Nagios configurations), gridmon and gridmonsam Python packages.

Providing JDL

The JDL used by the framework is the following:

[
Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
Arguments = "<jdlArguments>";
InputSandbox =  {"<jdlInputSandboxExecutable>", "<jdlInputSandboxTarball>"};
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = other.GlueCEInfoHostName == "<jdlReqCEInfoHostName>";
]

and located in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.gridJob.jdl.template.

  • Substitutable elements of the template are
    • <jdlExecutable> - name of executable on WN (default set by the framework nagrun.sh)
    • <jdlArguments> - list of arguments to <jdlExecutable> (set by the framework)
    • <jdlInputSandboxExecutable> - path to executable on UI (set by the framework to /metric/work/dir/<jdlExecutable>)
    • <jdlInputSandboxTarball> - path to WN tarball (set by the framework to /metric/work/dir/gridjob.tgz)
    • <jdlRetryCount> - default 0 (can be modified via CLI parameter --jdl-retrycount (see below))
    • <jdlShallowRetryCount> - default 1 (can be modified via CLI parameter --jdl-shallowretrycount (see below))
    • <jdlReqCEInfoHostName> - CE host name (set by the framework)

"Third-party" JDL

One can provide its own JDL template with the following parameter to org.sam.CE-JobState metric:

--jdl-templ <file>    JDL template file (full path). Default:
                      <org.sam.ProbesLocation>/wnjob/org.sam.gridJob.jdl.template

For better flexibility RetryCount and ShallowRetryCount ClassAdds were exposed as parameters to org.sam.CE-JobState metric:

--jdl-retrycount <val>          JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val>   JDL ShallowRetryCount (Default: 1).

Defining context for metric execution

When same resources (endpoints) are tested from the same Nagios instance under the same VO, but with different probes/metrics and the metrics require working directories it's clearly not enough to have the following /<probesWorkDir>/<VO>/<nameSpace>/<serviceAbbr>/<hostnameORendpoint>/ as a working directory. This is due to the fact that <nameSpace> is defined by the provider of the probe. Such situation usually appears in job submission tests.

Thus, the following parameter was added to CE and CREAM CE probes

    Metrics specific parameters:

--namespace <string>    Name-space for the probe. (Default: org.sam)

If given, the metrics' working directory of the probe will be /<probesWorkDir>/<VO>/<string>/<serviceAbbr>/<hostnameORendpoint>/

The following example clarifies the situation. Use-case of security checks (org.sam.sec).

Preconditions:

  • a set of WN security checks must be submitted to WNs
  • the checks submission wants to use org.sam/{CREAM}CE-probe -m org.sam.CE-Job{State,Monit} metrics to submit and monitor CE jobs
  • submission will be carried out under the same VO as for normal org.sam CE test jobs (say ops VO)
  • either
    • the checks must not be submitted with the normal org.sam CE test jobs
      • or
    • the required scheduling interval for the security checks is different from the normal org.sam CE ones

This effectively means that (if not otherwise defined) submissions of the separate CE tests from org.sam and org.sam.sec will use the same jobs working directories (e.g., /var/run/gridprobes/ops/org.sam/CE/hostname.to.test/), which is not desirable (will not properly work, in fact).

  • org.sam WN checks, ops VO, to CE hostname.to.test
    ./CE-probe -H hostname.to.test -m org.sam.CE-JobState \
               --vo ops ...
  • org.sam.sec WN checks (must be taken from /path/to/probes/org.sam.sec), org.sam checks mustn't be executed on WN, ops VO, to CE hostname.to.test
    ./CE-probe -H hostname.to.test -m org.sam.CE-JobState \
               --vo ops --add-wntar-nag-nosam \
               --add-wntar-nag /path/to/probes/org.sam.sec ...

Instead, by using --namespace org.sam.sec one can separate the metrics' working contexts:

    ./CE-probe -H hostname.to.test -m org.sam.CE-JobState \
               --namespace org.sam.sec --vo ops --add-wntar-nag-nosam \
               --add-wntar-nag /path/to/probes/org.sam.sec ...

This instructs CE-probe to create /var/run/gridprobes/ops/org.sam.sec/CE/hostname.to.test/ and perform job preparation/submission from that directory.

There should be respective org.sam.CE-JobMonit check configured on the Nagios instance to monitor jobs in org.sam.sec context.

    ./CE-probe -m org.sam.CE-JobMonit --namespace org.sam.sec ...

Wrapper checks (for "complex" checks)

To preserve an order and a context in the execution of a sequence of metrics (so called "complex" check) sometimes it is desirable to wrap the metrics execution in (Nagios) active check (eg., org.sam.SRM-All, org.sam.WN-Rep metrics) and report wrapped metrics' results (to Nagios) as passive checkes.

NB! Check this sub-section for running wrapper checks with nagios-run-check on Nagios instance.

Reporting passive check results from wrapper checks

In most cases wrapper checks will be used non-interactively and report wrapped metrics' results to Nagios. Using Nagios command file or NSCA to report metrics results (to Nagios) as passive checks from a wrapper check (eg, from org.sam.SRM-All for org.sam.SRM-{GetSURLs,Put,...}). Snippet from running a probe with "-h" option:

Reporting passive checks (when used with wrapper checks)

--pass-check-dest <config|nsca|nagcmd|active> (Default: config)

--pass-check-conf <path> Configuration file for reporting passive checks.
                         Used with '--pass-check-dest config'. Overrides
                         passive checks submission library default one.

--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
                        is set to 'nsca'.
--nsca-port <port>      Port NSCA is listening on (Default: 5667)
--send-nsca <path>      NSCA client binary.  (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)

--nagcmdfile <path>   Nagios command file.
                      Order: $NAGIOS_COMMANDFILE, --nagcmdfile
                      (Default: /var/nagios/rw/nagios.cmd)

To report results of internally run (wrapped) metrics to Nagios as passive check results three options <config|nsca|nagcmd> to --pass-check-dest parameter are available.

  • Nagios command file (if probe is runing on the Nagios box)
    • --pass-check-dest nagcmd [--nagcmdfile /path/to/named.fifo]

  • NSCA send_nsca binary (if probe is running on UI box)
    • --pass-check-dest nsca --nsca-server <fqdn|ip> [--nsca-port port] [--send-nsca /path/to/send_nsca] [--send-nsca-conf /path/to/send_nsca.conf]

  • config - NSCA or Nagios command file methods should be taken from a configuration file.
    • --pass-check-dest config [--pass-check-conf ]

By default, probes will use config option. This means that the module (gridmon/nagios/nagios.py from python-GridMon) responsible for passive checks submission will use node global configuration defined in a configuration file (by default, /etc/nagios-submit.conf). If nsca or nagcmd are explicitly used, then, it's up to the user to supply correct options for the submission method (ie., the configuration file is not used).

Reporting "active" check results from wrapper checks

To run a wrapper check from command line and see the "wrapped" metrics' results printed out to stdout set --pass-check-dest to active:

[kvs] ~ > /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -m org.sam.SRM-All \
                                        -u lxdpm104.cern.ch --pass-check-dest active
metric >>> org.sam.SRM-GetSURLs
OK: Got SRM endpoint(s) and SAPath(s) from BDII
OK: Got SRM endpoint(s) and SAPath(s) from BDII
metric >>> org.sam.SRM-LsDir
OK: SAPath[/dpm/cern.ch/home/ops]-ok;
OK: SAPath[/dpm/cern.ch/home/ops]-ok;
metric >>> org.sam.SRM-Put
OK: File was copied to SRM.
OK: File was copied to SRM.
metric >>> org.sam.SRM-Ls
OK: listing [/dpm/cern.ch/home/ops/testfile-put-1231319739-3cbfe0c470ba.txt]-ok;
OK: listing [/dpm/cern.ch/home/ops/testfile-put-1231319739-3cbfe0c470ba.txt]-ok;
metric >>> org.sam.SRM-GetTURLs
OK: TURLs gsiftp, rfio
OK: TURLs gsiftp, rfio
metric >>> org.sam.SRM-Get
OK: File was copied from SRM. Diff successful.
OK: File was copied from SRM. Diff successful.
metric >>> org.sam.SRM-Del
OK: file was deleted from SRM.
OK: file was deleted from SRM.
OK: success.
OK: success.
[kvs] ~ > echo $?
0
[kvs] ~ >

This way it is possible to test the behavior of the probe/metrics without needing to have a writable named pipe (Nagios command file) or working NSCA.

Running wrapper checks with nagios-run-check

As wrapper checks by default report sub-checks' results to nagios command file as passive checks, when running them for testing purposes on a Nagios instance it's advisable to disable the publication of the passive checks to Nagios. This is the order of how this must be done (note -d and -v options to nagios-run-check)

~> nagios-run-check -d -v -H axon-g05.ieeta.pt -s org.sam.SRM-All-/ops/Role=lcgadmin
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H 
      "axon-g05.ieeta.pt" -t 600 --vo ops --vo-fqan /ops/Role=lcgadmin 
      -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin'

Copy the command and add --pass-check-dest active. Then run as

~> su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H 
      "axon-g05.ieeta.pt" -t 600 --vo ops --vo-fqan /ops/Role=lcgadmin 
      -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin 
      --pass-check-dest active'

You can also add --stdout option

--stdout              Detailed output of metrics will be printed to stdout as
                      it is being produced by metrics. The default is to store
                      the output in a container and, then, produce Nagios
                      compliant output.

gstat-validation, grid-monitoring-probes-org.bdii RPMs

sBDII

  • Equivalence
SAM/gstat Sensor WLCG/Nagios probe(s)
sBDII /usr/bin/gstat-validate-sanity-check,/usr/libexec/grid-monitoring/probes/org.bdii/check_bdii_entries

SAM Test WLCG/Nagios metric
* sBDII-performance org.bdii.Entries
* sBDII-sanity org.gstat.SanityCheck

sBDII Results

show hide

  • - -- no values
  • / -- no mapping

  • [1] SAM sBDII-performance vs [2] SAM-Nag org.bdii.Entries
STATUS [1] [2] # [1] [2] # [1] [2] # [1] [2]
- 2009-05-15; 14:00 # 2009-05-19; 10:00 # 2009-06-03; 08:20 # 2009-06-09; 15:54
0/na 6 11 # 6 14 # - 4 # 2 5
10/ok 250 270 # 262 267 # 271 284 # 270 281
20/info - / # - / # - / # - /
30/note - / # - / # - / # - /
40/warn - - # - - # - - # - -
50/error 1 - # 1 - # 1 - # 1 -
60/crit - / # - / # - / # - /
100/maint 21 / # 12 / # 12 / # 8 /
nodes 278 281 # 281 281 # 283 288 # 281 286

  • [1] SAM sBDII-sanity vs [2] SAM-Nag org.gstat.SanityCheck
STATUS [1] [2] # [1] [2] # [1] [2] # [1] [2]
- 2009-05-15; 14:00 # 2009-05-19; 10:00 # 2009-06-03; 08:20 # 2009-06-09; 15:54
0/na - - # - - # - - # - -
10/ok 245 209 # 255 251 # 262 248 # 263 252
20/info - / # - / # - / # - /
30/note 2 / # 3 / # 1 / # 1 /
40/warn 3 20 # 4 19 # 9 26 # 7 22
50/error 6 52 # 7 11 # - 14 # 2 12
60/crit - / # - / # - / # - /
100/maint 21 / # 12 / # 11 / # 8 /
nodes 277 281 # 281 281 # 283 288 # 281 286

Probes/metrics by ch.cern and hr.srce

As of : Thu Nov 6 15:07:47 CET 2008

The following probes/metrics are provided by grid-monitoring-probes-ch.cern and grid-monitoring-probes-hr.srce.

grid-monitoring-probes-ch.cern

serviceType probeName metricName metricDescription metricType metricLocality
glite-FTS-WS ch.cern/FTS-probe -
ch.cern.FTS-ChannelList (ks) list channels on FTS status remote
glite-LFC ch.cern/LFC-probe
ch.cern.LFC-ReadDli Do a read from a DLI status remote
ch.cern.LFC-Write Test if we can update the modification time of an entry in the catalog status remote
ch.cern.LFC-Read Test if we can read an entry in the catalog status remote
ch.cern.LFC-Readdir Time how long it takes to read a directory (/grid) performance remote
glite-RGMA ch.cern/RGMA-probe
ch.cern.RGMA-ServiceStatus ... status remote
ch.cern.RGMA-CertLifetime ... status remote

show hide

/usr/libexec/grid-monitoring/probes/ch.cern/FTS-probe
serviceType: glite-FTS-WS
metricName: ch.cern.FTS-ChannelList
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/ch.cern/LFC-probe
serviceType: glite-LFC
metricDescription: Do a read from a DLI
metricLocality: remote
metricType: status
metricName: ch.cern.LFC-ReadDli
EOT
serviceType: glite-LFC
metricDescription: Test if we can update the modification time of an entry in the catalog
metricLocality: remote
metricType: status
metricName: ch.cern.LFC-Write
EOT
serviceType: glite-LFC
metricDescription: Test if we can read an entry in the catalog
metricLocality: remote
metricType: status
metricName: ch.cern.LFC-Read
EOT
serviceType: glite-LFC
dataType: float
metricDescription: Time how long it takes to read a directory (/grid)
metricType: performance
metricLocality: remote
metricName: ch.cern.LFC-Readdir
EOT

/usr/libexec/grid-monitoring/probes/ch.cern/RGMA-probe
serviceType: glite-RGMA
metricName: ch.cern.RGMA-ServiceStatus
metricType: status
EOT
serviceType: glite-RGMA
metricName: ch.cern.RGMA-CertLifetime
metricType: status
EOT

grid-monitoring-probes-hr.srce

serviceType probeName metricName metricDescription metricType metricLocality
CAdistribution hr.srce/CAdist-probe -
hr.srce.CAdist-Version ... status remote (?)
DPM hr.srce/DPM-probe
hr.srce.DPM-Query ... status remote
DPNS hr.srce/DPNS-probe
hr.srce.DPNS-List ... status remote (?)
globus-GRAM hr.srce/GRAM-probe
hr.srce.GRAM-CertLifetime ... status remote
hr.srce.GRAM-Auth ... status remote
hr.srce.GRAM-Command ... status remote
gsiftp hr.srce/GridFTP-probe
hr.srce.GridFTP-Transfer ... status remote
! GridProxy hr.srce/GridProxy-probe
! hr.srce.GridProxy-Valid ... status local
MyProxy hr.srce/MyProxy-probe
hr.srce.MyProxy-CertLifetime ... status remote
hr.srce.MyProxy-ProxyLifetime ... status remote
hr.srce.MyProxy-Store ... status remote
ResourceBroker hr.srce/ResourceBroker-probe
hr.srce.ResourceBroker-CertLifetime ... status remote
hr.srce.ResourceBroker-RunJob ... status remote
SRM hr.srce/SRM-probe
hr.srce.SRM1-CertLifetime ... status remote
hr.srce.SRM1-Ping ... status remote
hr.srce.SRM2-CertLifetime ... status remote
hr.srce.SRM-Transfer ... status remote
org.glite.wms.WMProxy hr.srce/WMProxy-probe
hr.srce.WMProxy-CertLifetime ... status remote
hr.srce.WMProxy-RunJob ... status remote
org.glite.wms.NetworkServer hr.srce/WMS-probe
hr.srce.WMS-CertLifetime ... status remote
hr.srce.WMS-RunJob ... status remote

show hide

/usr/libexec/grid-monitoring/probes/hr.srce/CAdist-probe
serviceType: CAdistribution
metricName: hr.srce.CAdist-Version
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/DPM-probe
serviceType: DPM
metricName: hr.srce.DPM-Query
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/DPNS-probe
serviceType: DPNS
metricName: hr.srce.DPNS-List
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/GRAM-probe
serviceType: globus-GRAM
metricName: hr.srce.GRAM-CertLifetime
metricType: status
EOT
serviceType: globus-GRAM
metricName: hr.srce.GRAM-Auth
metricType: status
EOT
serviceType: globus-GRAM
metricName: hr.srce.GRAM-Command
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/GridFTP-probe
serviceType: gsiftp
metricName: hr.srce.GridFTP-Transfer
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/GridProxy-probe
serviceType: GridProxy
metricName: hr.srce.GridProxy-Valid
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/MyProxy-probe
serviceType: MyProxy
metricName: hr.srce.MyProxy-CertLifetime
metricType: status
EOT
serviceType: MyProxy
metricName: hr.srce.MyProxy-ProxyLifetime
metricType: status
EOT
serviceType: MyProxy
metricName: hr.srce.MyProxy-Store
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/ResourceBroker-probe
serviceType: ResourceBroker
metricName: hr.srce.ResourceBroker-CertLifetime
metricType: status
EOT
serviceType: ResourceBroker
metricName: hr.srce.ResourceBroker-RunJob
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/SRM-probe
serviceType: SRM
metricName: hr.srce.SRM1-CertLifetime
metricType: status
EOT
serviceType: SRM
metricName: hr.srce.SRM1-Ping
metricType: status
EOT
serviceType: SRM
metricName: hr.srce.SRM2-CertLifetime
metricType: status
EOT
serviceType: SRM
metricName: hr.srce.SRM-Transfer
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/WMProxy-probe
serviceType: org.glite.wms.WMProxy
metricName: hr.srce.WMProxy-CertLifetime
metricType: status
EOT
serviceType: org.glite.wms.WMProxy
metricName: hr.srce.WMProxy-RunJob
metricType: status
EOT

/usr/libexec/grid-monitoring/probes/hr.srce/WMS-probe
serviceType: org.glite.wms.NetworkServer
metricName: hr.srce.WMS-CertLifetime
metricType: status
EOT
serviceType: org.glite.wms.NetworkServer
metricName: hr.srce.WMS-RunJob
metricType: status
EOT

Metrics naming:

Probe Metrics
<nameSpace>.<serviceAbbreviation>-probe <nameSpace>.<serviceAbbreviation>-<metricName>
org.sam.SRM-probe org.sam.SRM-LsDir

Timeouts:

  • metric global timeout: metricTimeout
  • timeouts on operations in a metric (set): metricOperationTimeouts = {metricOperationTimeout_1, ..., metricOperationTimeout_N}

condition:

metricTimeout >= SUM_{i=1}^N (metricOperationTimeout_i) - N

Command line options:

  • Probe's general options
-h|--help
-l|--list
-m|--metric metricName
-u|--uri serviceURI
-t|--timeout timeout (sec)
-n|--node hostname (FQDN)
-v|--vo VO
--wlcg [WLCG output instead of default Nagios]
-
  • Metrics specific options (NB! all should be long options)
    • see for each probe/metric specifically

Getting data for comparison from SAM DBs

Here is the Python script to get (latest) test data from two SAM DBs (SAM Prod / SAM-Nagios) and print them in tables for each test. You will need cx_Oracle Python module (available eg. on sam-val.cern.ch) to run it. You must know both DBs accounts and passwords wink Tests, DBs, query etc. are set in the script itself. Modify them in-place. The script is attached to the page sam-samnag-cmp_stat.py. Example of running the script:

show hide

[kvs] src > ./sam-samnag-cmp_stat.py
>>> fetching SAM: CE-sft-job ... done
>>> fetching SAMNag: CE-org.sam.CE-JobSubmit ... done
>>> fetching SAM: CE-sft-lcg-rm ... done
>>> fetching SAMNag: CE-org.sam.WN-Rep ... done
>>> fetching SAM: SRMv2-get-SURLs ... done
>>> fetching SAMNag: SRMv2-org.sam.SRM-GetSURLs ... done
>>> fetching SAM: SRMv2-put ... done
>>> fetching SAMNag: SRMv2-org.sam.SRM-Put ... done
>>> fetching SAM: sBDII-sanity ... done
>>> fetching SAMNag: sBDII-org.gstat.SanityCheck ... done
Wed, 03 Jun 2009 06:19:57 +0000
===> {'SAM': 'CE-sft-job', 'SAMNag': 'CE-org.sam.CE-JobSubmit'}
na   | - | - |
ok   |367|351|
info | - | - |
note | - | - |
warn | - | 39|
error| 28| - |
crit | - | - |
maint| - | - |
     |395|390|
===> {'SAM': 'CE-sft-lcg-rm', 'SAMNag': 'CE-org.sam.WN-Rep'}
na   | - | 35|
ok   |352|173|
info | - | - |
note | - | - |
warn |  1|  4|
error| 32| 59|
crit | - | - |
maint| - | - |
     |385|271|
===> {'SAM': 'SRMv2-get-SURLs', 'SAMNag': 'SRMv2-org.sam.SRM-GetSURLs'}
na   | - | - |
ok   |323|323|
info | - | - |
note | - | - |
warn | - | - |
error| 14|  5|
crit | - | - |
maint| - | - |
     |337|328|
===> {'SAM': 'SRMv2-put', 'SAMNag': 'SRMv2-org.sam.SRM-Put'}
na   | - |  4|
ok   |308|291|
info | - | - |
note | - | - |
warn | 14|  5|
error| 15| 28|
crit | - | - |
maint| - | - |
     |337|328|
===> {'SAM': 'sBDII-sanity', 'SAMNag': 'sBDII-org.gstat.SanityCheck'}
na   | - | - |
ok   |262|248|
info | - | - |
note |  1| - |
warn |  9| 26|
error| - | 14|
crit | - | - |
maint| 11| - |
     |283|288|
[kvs] src >

P.S.: worth reading

  • having integral checks, which perform multiple operations on a service in "one go", in general, assumes that the check by itself already defines integral availability of a particular service. This comes from the fact that different functional operations on a service are, in fact, parts of one test. Then, the only this integral value reaches Metrics DB. Such an approach doesn't allow for a flexibility in service availability calculations (which in other case could at different times define different metrics to be taken for service availability calculations). Such approach reduces modularity ("plug-ability") of the probes, as well.

SAM MDDB Profiles

Follow the link - SAM MDDB Profiles

-- KonstantinSkaburskas - 11 Oct 2008

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt sam-samnag-cmp_stat.py.txt r1 manage 2.5 K 2009-06-02 - 15:35 KonstantinSkaburskas simple script to fetch tests data from two DBs and compare the results
Edit | Attach | Watch | Print version | History: r100 < r99 < r98 < r97 < r96 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r97 - 2010-08-06 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback