SAM Probes and Metrics
gLite M/W Services to be tested
List of gLite M/W services/modules that require testing.
Service |
Probe |
comment |
AMGA_mysql |
- |
developed by SA3. At least AMGA-ping (service level ping) is needed |
AMGA_postgres |
- |
developed by SA3. At least AMGA-ping (service level ping) is needed |
t/sBDII |
org.bdii/check_bdii_*,/usr/bin/gstat-validate-*, nagios/check_ldap |
gstat-validation, probes-org.bdii |
CE |
org.sam/CE-probe, org.sam/WN-probe |
LCG-CE via WMS (probes-org.sam) |
CREAM CE |
org.sam/CREAMCE-probe, org.sam/WN-probe, org.sam/CREAMCEDJS-probe |
CREAM CE via WMS and direct job submission (probes-org.sam) |
FTS_oracle |
ch.cern/FTS-probe |
ch.cern.FTS-ChannelList (probes-ch.cern) |
FTA_oracle |
- |
- |
FTM |
- |
- |
LB |
native Nagios check |
org.nagios.LocalLogger-PortCheck |
LFC_mysql/oracle |
ch.cern/LFC-probe |
ch.cern.LFC-{Read,Write,Readdir,ReadDli} (probes-ch.cern) |
MON |
ch.cern/RGMA-probe |
ch.cern.RGMA-ServiceStatus (probes-ch.cern) |
PX |
hr.srce/MyProxy-probe |
hr.srce.MyProxy-Store (probes-hr.srce) |
SE_dcache |
org.sam/SRM-probe |
org.sam.SRM-<metricName> |
SE_dpm_disk |
org.sam/SRM-probe |
org.sam.SRM-<metricName> |
SE_dpm_mysql |
org.sam/SRM-probe |
org.sam.SRM-<metricName> |
TORQUE_client |
- |
site level fabric monitoring |
TORQUE_server |
- |
site level fabric monitoring; plus APEL as indirect test |
VOBOX |
org.alice/VOBOX-probe |
org.alice.VOBOX-{6 tests} link |
VOMS_mysql/oracle |
org.nmap |
- |
WMS |
org.sam/WMS-probe, hr.srce/{WMProxy-probe,WMS-probe} |
probes-org.sam - asynchronous; probes-hr.srce synchronous |
SAM vs Nagios tests naming correspondence
For naming correspondence between critical SAM tests and Nagios metrics
see.
grid-monitoring-probes-org.sam RPM
grid-monitoring-probes-org.sam
RPM is available through EGEE SA1 repository
http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/
(also, via
egee-NAGIOS
meta RPM). The RPM's directories structure is the following
show
hide
/etc/gridmon/
/usr/lib/python2.4/site-packages/gridmetrics/
/usr/libexec/grid-monitoring/probes/org.sam/
/usr/libexec/grid-monitoring/probes/org.sam/wnjob
/usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/{bin/,etc/,lib/,plugins/,probes/,tmp/,var/}
/usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/{etc/wn.d/org.sam/,probes/org.sam/}
Currently the RPM consists of:
- SAM Nagios probes (in
/usr/libexec/grid-monitoring/probes/org.sam/
):
-
CE-probe
- CE probe containing a number of CE tests (metrics) for jobs submission via WMS
-
CREAMCE-probe
- as above, but for CREAM CEs
-
CREAMCEDJS-probe
- direct job submission to CREAM CEs (asynchronous)
-
SRM-probe
- SRM probe containing a number of metrics for SRM service
-
T-probe
- template probe, which serves as an example for writing your own probes based on the Python framework currently provided by the package (see Writing a probe under "Python based probes using org.sam's 'gridmonsam' module" section on the same page)
-
WN-probe
- WN probe containing a number of metrics to be run on WNs
-
WMS-probe
- metrics to test if jobs submission through WMS works (asynchronous)
- wrapper checks (in
/usr/libexec/grid-monitoring/probes/org.sam/
):
-
samtest-run
- to run "native" SAM tests (see link)
-
nagtest-run
- to run "semi"-Nagios checks (see link)
-
/usr/libexec/grid-monitoring/probes/org.sam/wnjob
- directory containing
-
nagios.d/
- directory with Nagios used as checks' scheduler on WNs
-
nagrun.sh
- wrapper script to be launched on WNs (sets up required environment, launches and monitors Nagios, periodically sends WN metrics results to Message Bus)
-
org.sam/
-
-
probes/
- directory with SAM WN probes/tests ("new and old" ones), samtest-run
and nagtest-run
wrappers
-
etc/
- WN Nagios configuration for the above checks
-
gridmetics
Python package (in /usr/lib/python2.4/site-packages/
):
- used by the above SAM probes.
-
/etc/gridmon/
- configuration directory:
-
org.sam.conf
- main configuration file
-
org.sam.errdb
- collection of common gLite m/w error messages and their mapping to Nagios statuses
Source code can be browsed here:
https://svnweb.cern.ch/trac/sam/browser/trunk/probes
,
http://svnweb.cern.ch/guest/sam/trunk/probes
Latest 10 commits
show
hide
SRM
- Equivalence [see link for critical tests defined in SAM for SRM for 'OPS' VO]
SAM Test |
WLCG/Nagios metric |
* SRMv2-host-cert-valid |
hr.srce.SRM2-CertLifetime |
- |
org.sam.SRM-All |
* SRMv2-get-SURLs |
org.sam.SRM-GetSURLs |
* SRMv2-ls-dir |
org.sam.SRM-LsDir |
* SRMv2-put |
org.sam.SRM-Put |
* SRMv2-ls |
org.sam.SRM-Ls |
* SRMv2-gt |
org.sam.SRM-GetTURLs |
* SRMv2-get |
org.sam.SRM-Get |
* SRMv2-del |
org.sam.SRM-Del |
* - critical & accounted for availability
- Probe
org.sam.SRM-probe
tests SRM service of versions v 1 and 2.
probeName: org.sam.SRM-probe
serviceVersion: 1.*, 2.*
Metrics |
Description |
org.sam.SRM-All |
Wrapper metric to launch the other metrics and publish passive checks results to Nagios. |
org.sam.SRM-GetSURLs |
Get full SRM endpoint(s) and storage areas from BDII. |
org.sam.SRM-LsDir |
List content of VO's top level space area(s) in SRM. |
org.sam.SRM-Put |
Copy a local file to the SRM into default space area(s). |
org.sam.SRM-Ls |
List (previously copied) file(s) on the SRM. |
org.sam.SRM-GetTURLs |
Get Transport URLs for the file copied to storage. |
org.sam.SRM-Get |
Copy given remote file(s) from SRM to a local file. |
org.sam.SRM-Del |
Delete given file(s) from SRM. |
- Metrics specific options
- SRM type (version)
-
--srmv [1|2]
(Default: 2)
- LDAP URL
-
--ldap-url [ldap://]server[:port]
(Defaults: server
lcg-bdii.cern.ch, port
2170) org.sam.SRM-GetSURLs
- timeouts:
-
--ldap-timeout timeout
(sec) (Default: 10) org.sam.SRM-GetSURLs
-
--se-timeout timeout
(sec) (Default: 120) all except org.sam.SRM-GetSURLs
- Dependency tree for the metrics in
org.sam.SRM-probe
1:GetSURLs
^ ^
/ \
2:LsDir ____3:Put________
^ ^ ^ ^
/ / \ \
4:Ls 5:GetTURLs 6:Get 7:Del
eg.:
2:LsDir - "sequence number":"metrics abbreviation"
- help output from
org.sam.SRM-probe
show
hide
[kvs] src > ./SRM-probe
Usage: /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe
[-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec]
[-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric
specific parameters>]
-V Displays version
-h|--help Displays help
-t|--timeout sec Sets metric's global timeout. (Default: 600)
-m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
If not given, a default wrapper metric will be executed.
-H|--hostname FQDN Hostname where a service to be tested is running on
-u|--uri <URI> Service URI to be tested
-v|--verbose 0-3 Verbosity. (Default: 0)
0 Single line, minimal output. Summary
1 Single line, additional information
2 Multi line, configuration debug output
3 Lots of details for plugin problem diagnosis
-l|--list Metrics list in WLCG format
-x VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
--nosanity Don't sanitize metrics output.
Mandatory paramters: hostname (-H) or URI (-u).
If specified with -m|--metric <name>, the given metric will be executed.
Otherwise, a wrapper metric (acting as an active check) will be run. The
latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"
Metrics common parameters:
Reporting passive checks (when used with wrapper checks)
--pass-check-dest <config|nsca|nagcmd|active> (Default: config)
--pass-check-conf <path> Configuration file for reporting passive checks.
Used with '--pass-check-dest config'. Overrides
passive checks submission library default one.
--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
is set to 'nsca'.
--nsca-port <port> Port NSCA is listening on (Default: 5667)
--send-nsca <path> NSCA client binary. (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)
--nagcmdfile <path> Nagios command file.
Order: $NAGIOS_COMMANDFILE, --nagcmdfile
(Default: /var/nagios/rw/nagios.cmd)
--vo <name> Virtual Organization. (Default: ops)
--err-db <file> Full path. Database file containing gLite CLI/API errors
for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,> Comma separated list of topics (Default: default)
--work-dir <dir> Working directory for metrics.
(Default: /var/run/gridprobes/<VO>)
--stdout Detailed output of metrics will be printed to stdout as
it is being produced by metrics. The default is to store
the output in a container and, then, produce Nagios
compiant output.
Metrics specific options:
--srmv <1|2> (Default: 2)
org.sam.SRM-GetSURLs
--ldap-uri <URI> Format [ldap://]hostname[:port[/]]
(Default: ldap://sam-bdii.cern.ch:2170)
--ldap-timeout <sec> (Default: 10)
org.sam.SRM-{LsDir,Put,Ls,GetTURLs,Get,Del}
--se-timeout <sec> (Default: 120)
SRM Results
show
hide
-
-
-- no values
-
/
-- no mapping
- [1] SAM
SRMv2-get-SURLs
vs [2] SAM-Nag SRMv2-org.sam.SRM-GetSURLs
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-06-03; 08:20 |
# |
2009-06-03; 16:00 |
# |
2009-06-09; 15:54 |
0/na |
- |
- |
# |
- |
1 |
# |
- |
3 |
10/ok |
324 |
323 |
# |
323 |
322 |
# |
322 |
310 |
20/info |
- |
/ |
# |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
# |
- |
/ |
40/warn |
- |
- |
# |
- |
- |
# |
6 |
12 |
50/error |
13 |
5 |
# |
14 |
5 |
# |
6 |
8 |
60/crit |
- |
- |
# |
- |
- |
# |
- |
/ |
100/maint |
- |
/ |
# |
- |
/ |
# |
- |
/ |
nodes |
337 |
328 |
# |
337 |
328 |
# |
334 |
333 |
- [1] SAM
SRMv2-put
vs [2] SAM-Nag SRMv2-org.sam.SRM-Put
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-06-03; 08:20 |
# |
2009-06-03; 16:00 |
# |
2009-06-09; 15:54 |
0/na |
- |
- |
# |
- |
1 |
# |
- |
- |
10/ok |
324 |
323 |
# |
323 |
322 |
# |
328 |
321 |
20/info |
- |
/ |
# |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
# |
- |
/ |
40/warn |
- |
- |
# |
- |
- |
# |
- |
6 |
50/error |
13 |
5 |
# |
14 |
5 |
# |
6 |
6 |
60/crit |
- |
/ |
# |
- |
/ |
# |
- |
/ |
100/maint |
- |
/ |
# |
- |
/ |
# |
- |
/ |
nodes |
337 |
328 |
# |
337 |
328 |
# |
334 |
333 |
submission via WMS
There are no differences between LCG and
CREAM CEs wrt this way of jobs submission (and thus monitoring). Please refer to
CE section.
Probe and metrics names differ only in the name of the service (
CREAM
vs
CE
): probe
org.sam/CREAMCE-probe
, metrics
org.sam.CREAMCE-*
direct submission
org.sam.CREAMCEDJS-probe
probe with the following metrics
metric |
descripton |
org.sam.CREAMCEDJS-DirectJobState |
[Active+Passive] Direct job submission to CREAM CE |
org.sam.CREAMCEDJS-DirectJobStatus |
[Passive] Final status of direct job submission to CREAM CE |
org.sam.CREAMCEDJS-DirectJobMonit |
[Active] Babysit submitted grid jobs |
org.sam.CREAMCEDJS-ServiceInfo |
Get CREAM CE service info |
org.sam.CREAMCEDJS-SubmitAllowed |
Check if submission to the CREAM CE is allowed |
org.sam.CREAMCEDJS-DelegateProxy |
Delegate proxy to CREAM CE |
CE
- Equivalence [see link for critical tests defined in SAM for CE for 'OPS' VO]
* - critical & accounted for availability
- Probe
org.sam.CE-probe
tests job submission to CEs
via WMS
.
- The check delivers tests to WNs (eg.
org.sam.WN-probe
) and executes respective metrics there. Currently, Nagios is used as a scheduler on WNs. 'handle_service_check'
OCSP is used to store metrics results as WLCG tuples and, then, 'send_to_msg'
periodically (invoked from wrapper script on WN) sends the tuples to MB.
- help output from
org.sam.CE-probe
show
hide
# ./CE-probe -h
Usage: /usr/libexec/grid-monitoring/probes/org.sam/CE-probe
[-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec]
[-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric
specific parameters>]
-V Displays version
-h|--help Displays help
-t|--timeout sec Sets metric's global timeout. (Default: 600)
-m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
If not given, a default wrapper metric will be executed.
-H|--hostname FQDN Hostname where a service to be tested is running on
-u|--uri <URI> Service URI to be tested
-v|--verbose 0-3 Verbosity. (Default: 0)
0 Single line, minimal output. Summary
1 Single line, additional information
2 Multi line, configuration debug output
3 Lots of details for plugin problem diagnosis
-l|--list Metrics list in WLCG format
-x VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
--nosanity Don't sanitize metrics output.
Mandatory paramters: hostname (-H) or URI (-u).
If specified with -m|--metric <name>, the given metric will be executed.
Otherwise, a wrapper metric (acting as an active check) will be run. The
latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"
Metrics common parameters:
Reporting passive checks (when used with wrapper checks)
--pass-check-dest <config|nsca|nagcmd|active> (Default: config)
--pass-check-conf <path> Configuration file for reporting passive checks.
Used with '--pass-check-dest config'. Overrides
passive checks submission library default one.
--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
is set to 'nsca'.
--nsca-port <port> Port NSCA is listening on (Default: 5667)
--send-nsca <path> NSCA client binary. (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)
--nagcmdfile <path> Nagios command file.
Order: $NAGIOS_COMMANDFILE, --nagcmdfile
(Default: /var/nagios/rw/nagios.cmd)
--vo <name> Virtual Organization. (Default: ops)
--err-db <file> Full path. Database file containing gLite CLI/API errors
for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,> Comma separated list of topics (Default: default)
--work-dir <dir> Working directory for metrics.
(Default: /var/run/gridprobes/<VO>)
--stdout Detailed output of metrics will be printed to stdout as
it is being produced by metrics. The default is to store
the output in a container and, then, produce Nagios
compiant output.
Metrics specific parameters:
--namespace <string> Name-space for the probe. (Default: org.sam)
--config <file1,> Comma separated list of metrics configuration files.
(Default: /etc/gridmon/org.sam.conf)
org.sam.CE-JobState
--mb-destination <dest> Mandatory parameter. The destination queue/topic on
Message Broker to publish to.
--mb-uri <URI> Message Broker URI. If not given, MB discovery will be
performed on WN to find working MB.
Format for <URI>: [failover://\(]<uri>,[...][\)]
<uri> - stomp://FQDN:port/ or http://FQDN/message
(Default: service discovery on WN.)
--wms <wms> WMS to be used for job submission. If not given, default
WMProxy end-points defined on the UI will be used.
--timeout-wnjob-global <sec> Global timeout for a job on WN. (Default: 600)
--add-wntar-nag <d1,d2,..> Comma-separated list of top level directories with
Nagios compliant directories structure to be added
to tarball to be sent to WN.
--add-wntar-nag-nosam Instructs the metric not to include standard SAM WN
probes and their Nagios config to WN tarball.
(Default: WN probes are included)
--add-wntar-nag-nosamcfg Instructs the metric not to include Nagios
configuration for SAM WN probes to WN tarball. The
probes themselves and respective Python packages,
however, will be included.
--jdl-templ <file> JDL template file (full path). Default:
<org.sam.ProbesLocation>/wnjob/org.sam.gridJob.jdl.template
--jdl-retrycount <val> JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val> JDL ShallowRetryCount (Default: 1).
--wnjob-location <dir> Full path to directory contaning WN scheduler.
(Default: <org.sam.ProbesLocation>/wnjob)
--wnjob-verb <0-3> Verbosity level on WN (Default: 1)
org.sam.CE-JobMonit
--timeout-job-global <sec> Global timeout for jobs. Job will be canceled
and dropped if it is not in terminal state by
that time. (Default: 3300)
--timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with
'no compatible resources'. (Default: 2700)
--hosts <h1,h2,..> Comma-separated list of CE hostnames to run monitor on.
CE Results
show
hide
-
-
-- no values
-
/
-- no mapping
- [1] SAM
CE-sft-job
vs [2] SAM-Nag org.sam.CE-JobSubmit
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-05-29; 17:10 |
# |
2009-05-29; 18:30 |
# |
2009-06-03; 08:20 |
# |
2009-06-09; 15:54 |
0/na |
- |
- |
# |
- |
- |
# |
- |
- |
# |
- |
- |
10/ok |
- |
254 |
# |
366 |
340 |
# |
367 |
351 |
# |
372 |
362 |
20/info |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
1 |
/ |
40/warn |
- |
25 |
# |
- |
- |
# |
- |
39 |
# |
- |
29 |
50/error |
- |
- |
# |
28 |
38 |
# |
28 |
- |
# |
24 |
- |
60/crit |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
100/maint |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
nodes |
- |
279 |
# |
394 |
378 |
# |
395 |
390 |
# |
397 |
391 |
Nagios CE testing
- three Nagios checks
-
CE-JobState
- active + passive check (service <hostNameCE,CE-JobState>
). Runs hourly.
- submits grid job to CE
- accepts passive check results (from
CE-JobMonit
) for submitted grid job - holds a status of the grid job
-
CE-JobMonit
- active check (service <localhost,CE-JobMonit>
). Runs each 5 min.
- checks statuses of all submitted jobs and updates
CE-JobState
and CE-JobSubmit
(acts as a babysitter for all grid jobs submitted by CE-JobState
service instances). CE-JobState
and CE-JobSubmit
are updated (as passive checks) either via Naigos command file or NSCA.
-
CE-JobSubmit
- passive check (service <hostNameCE,CE-JobSubmit>
)
- holds terminal status of job submission to CE (mapping from gLite job terminal states ['Done','Aborted','Canceled'] to Nagios status [OK,WARNING,CRITICAL,UNKNOWN])
- Nagios configuration on an example.
- services.cfg for
org.sam.CE-{JobState,JobMonit,JobSubmit}-<VO>
using ncg-generic-service
and ncg-passive-service
service object templates
show
hide
# org.sam.CE-JobState : [active+passive] submits grid job to CE, holds a status of the grid job
define service{
use ncg-generic-service
host_name ce110.cern.ch
servicegroups local, ops
service_description org.sam.CE-JobState-ops
contact_groups CERN_PPS-site
check_command ncg_check_native!$USER10$/CE-probe!600!-x $USER5$ --vo ops -m org.sam.CE-JobState --mb-destination /topic/grid...
active_checks_enabled 1
passive_checks_enabled 1
normal_check_interval 60
retry_check_interval 15
max_check_attempts 3
obsess_over_service 0
# + _vo, _service_uri, _metric_name, _metric_set, _site_name
}
# org.sam.CE-JobMonit : [active] babysitter
define service{
use ncg-generic-service
host_name lxvm0325.cern.ch
servicegroups local, ops
service_description org.sam.CE-JobMonit-ops
contact_groups nagios-admins
check_command ncg_check_native!$USER10$/CE-probe!600!-x $USER5$ --vo ops -m org.sam.CE-JobMonit --mb-destination /topic/grid...
active_checks_enabled 1
passive_checks_enabled 0
normal_check_interval 5
retry_check_interval 2
max_check_attempts 2
obsess_over_service 0
# + _vo, _service_uri, _metric_name, _metric_set, _site_name
}
# org.sam.CE-JobSubmit : [passive] terminal status of job submission to CE
define service{
use ncg-passive-service
host_name ce110.cern.ch
servicegroups local, ops
service_description org.sam.CE-JobSubmit-ops
contact_groups CERN_PPS-site
check_command ncg_check_passive!"just nothing"
obsess_over_service 1
# + _vo, _service_uri, _metric_name, _metric_set, _site_name
}
Jobs Submission and Monitoring
According to WMS Job State Machine (
link p.17
) job can be in the following states
- non-terminal
Submitted
, Waiting
, Ready
, Scheduled
, Running
more
less
-
Submitted
: job is entered by the user to the UI but not yet transferred to NS for processing
-
Waiting
: job has been accepted by NS and is waiting for WM processing or is being processed by WM Helper modules (e.g., WM is busy, no appropriate CE (cluster) has been found yet, ...).
-
Ready
: job has been processed by WM and its Helper modules (especially, appropriate CE has been found) but not yet transferred to the CE (local batch system queue) via Job Controller and CondorC.
-
Scheduled
: job is waiting in the queue on the Computing Element.
-
Running
: job is running.
- terminal
Done
, Aborted
, Canceled
, Cleared
more
less
-
Done
: job exited or is considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way).
-
Aborted
: job processing was aborted by WMS (waiting in the WM queue or CE for too long, over-use of quotas, expiration of user credentials, etc.).
-
Canceled
: job has been successfully canceled on user request.
-
Cleared
: output sandbox was transferred to the user or removed due to the timeout.
On WMS there are two main parameters responsible for timeouts in job matchmaking
-
MatchRetryPeriod = 3500
(58 min) - interval between successive retries to match a job a resource (T_WMS_MatchRetr
)
-
ExpiryPeriod = 7200
(2 hours) - time after which job will be aborted with 'no compatible resources' (T_WMS_Exp
)
Defaults allow job to be matched at most three times within two hours after job submission.
With JDL
JobType="Normal";
...
RetryCount = 0;
ShallowRetryCount = 1;
Requirements = other.GlueCEInfoHostName == "<CE hostname>";
and 1 hour interval between jobs submission it is advisable to set e.g.
MatchRetryPeriod = 1320
(22 min) and
ExpiryPeriod = 3000
(50 min). This way WMS will naturally abort jobs if info about CE isn't available in IS.
In Nagios jobs submission and monitoring was implemented in the following way.
-
org.sam.CE-JobState
metric (active Nagios check). Runs hourly (normal_check_interval 60
).
- initially submits job and saves
/<workdirRun>/<voName>/<nameSpace>/<serviceType>/<nodeName>/activejob.map
with submitTimeStamp|hostNameCE|serviceDesc|jobID|jobState|lastStateTimeStamp
.
- if
activejob.map
was found
-
jobState is terminal state
- discard the job, proceed with submission
-
jobState is non-terminal state
-
lastStateTimeStamp - submitTimeStamp < timeout-job-discard
- exit with OK: Active job - <jobState> [time]
-
lastStateTimeStamp - submitTimeStamp > timeout-job-discard
- discard the job, proceed with submission
-
org.sam.CE-JobMonit
metric (active Nagios check; checks all jobs and updates activejob.map
, org.sam.CE-JobState
& org.sam.CE-JobSubmit
). Runs each 5 min (normal_check_interval 5
). For all currently submitted jobs (activejob.map
files) get job state from WMS
- on error getting job state
- UI problem - update
org.sam.CE-JobState
with WARNING
- WMS problem
-
timeNow - submitTimeStamp < timeout-job-discard
- update org.sam.CE-JobState
with WARNING
(unable to get job status. Job will be deleted in N min
; N = (timeout-job-discard - (timeNow - submitTimeStamp))/60
)
- else - update
org.sam.CE-JobState
with WARNING
and org.sam.CE-JobSubmit
with UNKNOWN
(unable to get job status. Job discarded.
)
- on OK getting job state
-
Done
-
Current Status: Done (Success)
- update org.sam.CE-Job{State,Submit}
with OK
.
-
Current Status: Done (Exit Code =0)
- Framework on WN exists with Nagios compliant exit codes. Check Exit code:
. Update org.sam.CE-Job{State,Submit}
respectively with WARNING
, CRITICAL
, UNKNOWN
.
- delete
activejob.map
.
-
Aborted
- get logging info and get reason
-
request expired
-
BrokerHelper: no compatible resources
- update org.sam.CE-Job{State,Submit}
with CRITICAL
(Job was aborted. Failed to match.
).
- else - update
org.sam.CE-Job{State,Submit}
with UNKOWN
(Job was aborted. Check WMS.
)
- else - update
org.sam.CE-Job{State,Submit}
with CRITICAL
(Job was aborted.
).
- delete
activejob.map
.
-
Cleared
-
Cancelled
-
Waiting
- get logging info and get reason
-
no compatible resources
-
timeNow - submitTimeStamp > timeout-job-waiting
- update org.sam.CE-Job{State,Submit}
with CRITICAL
(BrokerHelper: no compatible resources
). Cancel and discard the job.
- else - update
org.sam.CE-JobState
with WARNING
. Update activejob.map
.
- else
-
timeNow - submitTimeStamp > timeout-job-discard
- cancel & delete activejob.map
.
-
Ready
, Submitted
-
timeNow - submitTimeStamp > timeout-job-global
- update org.sam.CE-Job{State,Submit}
with UNKNOWN
. Get logging info & include into details data; cancel & delete activejob.map
.
- else - update
activejob.map
.
-
Scheduled
, Running
-
timeNow - submitTimeStamp > timeout-job-schedrun
- update org.sam.CE-Job{State,Submit}
with WARNING
. Get logging info & include into details data; cancel & delete activejob.map
. Issue CRITICAL
on the second successive timeout. (JobMonit
has the states counter).
- else - update
activejob.map
.
Currently [08-02-2010] we are monitoring with all the defaults
T_N_SCHED(1h) T_J_DISCARD
T_J_GLOB | T_J_SCHEDRUN |
T_J_WAIT | | | |
|-----------|---|--#---------------|------...-|--| -> t
| |
T_WMS_MatchRetr T_WMS_Exp
T_J_WAIT - 45 min (--timeout-job-waiting)
T_J_GLOB - 55 min (--timeout-job-global)
T_WMS_MatchRetr - 58 min (MatchRetryPeriod)
T_N_SCHED - 1 hour (Nagios metric scheduling)
T_WMS_Exp - 2 hours (ExpiryPeriod)
T_J_SCHEDRUN - 5h30min (--timeout-job-schedrun)
T_J_DISCARD - 6 hours (--timeout-job-discard)
Thus, in most cases we cancel jobs being in
Waiting
due to
no compatible resources
when
T_J_WAIT
kicks in (after only one initial matchmaking in WM) and issue
CRITICAL
for
org.sam.CE-Job{State,Submit}
.
Moving to the case
T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT
(or even
2*T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT
) is fairly possible. Thus, in case of the jobs to CEs which are not (properly) published in IS the jobs will be naturally discarded by WMS (
Aborted
; reason
no compatible resources
). In such case, monitoring metric (
org.sam.CE-JobMonit
) is ready to handle such cases and will issue
CRITICAL
against the CE.
WN
SAM Test |
WLCG/Nagios metric |
* CE-sft-brokerinfo |
org.sam.WN-Bi [samtest-run] |
* CE-sft-caver |
org.sam.WN-CAver [samtest-run] |
* CE-sft-csh |
org.sam.WN-Csh [samtest-run] |
* CE-sft-softver |
org.sam.WN-SoftVer [samtest-run] |
|
* CE-sft-lcg-rm |
org.sam.WN-Rep [wrapper for: org.sam.WN-{RepISenv,...,WN-RepDel}], [WN-probe] |
-* CE-sft-lcg-rm-gfal |
org.sam.WN-RepISenv |
-* CE-sft-lcg-rm-free |
org.sam.WN-RepFree |
-* CE-sft-lcg-rm-cr |
org.sam.WN-RepCr |
-* CE-sft-lcg-rm-cp |
org.sam.WN-RepGet |
-* CE-sft-lcg-rm-rep |
org.sam.WN-RepRep |
-* CE-sft-lcg-rm-del |
org.sam.WN-RepDel |
|
CE-sft-posix |
org.sam.WN-Gfal [wrapper for: org.sam.WN-{GfalCp,...,WN-GfalDel}]; remote POSIX I/O via GFAL: should depend on successful completion of org.sam.WN-Bi and org.sam.WN-RepCr |
- |
org.sam.WN-GfalCp |
- |
org.sam.WN-GfalRead |
- |
org.sam.WN-GfalWrite |
- |
org.sam.WN-GfalDel |
|
CE-wn-sec-crl |
org.sam.sec or SWAT |
CE-wn-sec-fp |
org.sam.sec or SWAT |
CE-sft-wn |
SWAT |
CE-sft-vo-tag |
SWAT |
CE-sft-vo-swdir |
SWAT |
CE-sft-rgma |
SWAT |
* - critical & accounted for availability
-* - sub-tests of
CE-sft-lcg-rm
, if one of them fails the main wrapper fails
- Probe
org.sam.WN-probe
- performs security, replica management, remote POSIX I/O with GFAL checks on WNs
metricName |
metricDescription |
metricType |
metricLocality |
org.sam.WN-Rep |
Wrapper check to launch the replica management checks and publish passive check results to Nagios. |
status |
local |
org.sam.WN-RepISenv |
Check if LCG_GFAL_INFOSYS variable is set |
status |
local |
org.sam.WN-RepFree |
Check if Close (or VO default) SE has any free space left according to the information system. |
status |
remote |
org.sam.WN-RepCr |
Copy and register a file to the Close (or default) SE into default space area. Retrieve list of replicas. |
status |
remote |
org.sam.WN-RepGet |
Copy the file back from Close SE to the WN. Compare the files. |
status |
remote |
org.sam.WN-RepRep |
Replicate the file from close SE to a chosen 'central' SE. |
status |
remote |
org.sam.WN-RepDel |
Delete given file(s) from SRM. |
status |
remote |
|
org.sam.WN-PyVer |
Check version of Python installed on WN. |
status |
local |
|
org.sam.WN-Gfal |
Wrapper check to launch checks for remote POSIX I/O via GFAL and publish passive check results to Nagios. |
status |
local |
org.sam.WN-Gfal* |
... TODO ... |
status |
remote |
org.sam.WN-CAver |
CA integrity and existence (NB! validity is not tested. It's a responsibility of IGTF http://signet-ca.ijs.si/nagios/ ) |
status |
local |
org.sam.WN-CAcrl |
validity of CA CRL |
status |
local |
- Dependency tree for the
org.sam.WN-Rep*
metrics in org.sam.WN-probe
probe:
0:Rep (wrapper)
|
1:RepISenv
^ ^
/ \
2:RepFree __3:RepCr____
^ ^ ^
/ | \
4:RepGet 5:RepRep 6:RepDel
All metrics
"1..6"
are considered as critical. If at least one of them fails wrapper metric
org.sam.WN-Rep
fails as well. This corresponds to current SAM test implementation and
GridView availability
calculations (for equivalency with SAM check table above for
CE-sft-lcg-rm
/
org.sam.WN-Rep
and
SAM critical tests).
- help output from
org.sam.WN-probe
show
hide
# ./WN-probe -h
Usage: /usr/libexec/grid-monitoring/probes/org.sam/WN-probe
[-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec]
[-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric
specific parameters>]
-V Displays version
-h|--help Displays help
-t|--timeout sec Sets metric's global timeout. (Default: 600)
-m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
If not given, a default wrapper metric will be executed.
-H|--hostname FQDN Hostname where a service to be tested is running on
-u|--uri <URI> Service URI to be tested
-v|--verbose 0-3 Verbosity. (Default: 0)
0 Single line, minimal output. Summary
1 Single line, additional information
2 Multi line, configuration debug output
3 Lots of details for plugin problem diagnosis
-l|--list Metrics list in WLCG format
-x VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
--nosanity Don't sanitize metrics output.
Mandatory paramters: hostname (-H) or URI (-u).
If specified with -m|--metric <name>, the given metric will be executed.
Otherwise, a wrapper metric (acting as an active check) will be run. The
latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"
Metrics common parameters:
Reporting passive checks (when used with wrapper checks)
--pass-check-dest <config|nsca|nagcmd|active> (Default: config)
--pass-check-conf <path> Configuration file for reporting passive checks.
Used with '--pass-check-dest config'. Overrides
passive checks submission library default one.
--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
is set to 'nsca'.
--nsca-port <port> Port NSCA is listening on (Default: 5667)
--send-nsca <path> NSCA client binary. (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)
--nagcmdfile <path> Nagios command file.
Order: $NAGIOS_COMMANDFILE, --nagcmdfile
(Default: /var/nagios/rw/nagios.cmd)
--vo <name> Virtual Organization. (Default: ops)
--err-db <file> Full path. Database file containing gLite CLI/API errors
for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,> Comma separated list of topics (Default: default)
--work-dir <dir> Working directory for metrics.
(Default: /var/run/gridprobes/<VO>)
--stdout Detailed output of metrics will be printed to stdout as
it is being produced by metrics. The default is to store
the output in a container and, then, produce Nagios
compiant output.
Metrics specific options:
org.sam.WN-{Rep,Rep{Cr,Get,Rep,Del}}
--lfc <FQDN> LFC to be used with the replica management tests.
(Default: prod-lfc-shared-central.cern.ch)
org.sam.WN-{Rep,RepRep}
--se-rep <FQDN> "Central" SE to be used with the replica management test.
(Default: samdpm002.cern.ch)
gLExec
This is WN test which tries to execute
export GLEXEC_CLIENT_CERT=$X509_USER_PROXY; <path_to>/glexec /usr/bin/id -a
. A separate CE job submission/monitoring is used (
org.sam.glexec.CE-Job{State,Submit,Monit}-<VO>
) to deliver the test to WNs. Nagios metric name is
org.sam.glexec.WN-gLExec-<VO>
. OPS VO Role=pilot is used to submit the jobs.
Currently [Friday, March 05 2010], the test exit codes (Nagios) and summaries are following:
status = 'OK'
summary = 'success'
status = 'UNKNOWN'
summary = "glexec command not found."
status = 'WARNING'
summary = "client cert file error: <error message>"
status = 'WARNING'
summary = "executable can't be executed (126)"
status = 'WARNING'
summary = "client error (201)"
status = 'WARNING'
summary = "system error (202)"
status = 'WARNING'
summary = "authorization error (203)"
status = 'UNKNOWN'
summary = "exit code overlap (204)"
status = 'UNKNOWN'
summary = "unrecognised exit code (<exit code>)"
WN Results
show
hide
-
-
-- no values
-
/
-- no mapping
- [1] SAM
CE-sft-lcg-rm
vs [2] SAM-Nag org.sam.WN-Rep
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-05-29; 17:10 |
# |
2009-05-29; 18:40 |
# |
2009-06-03; 08:20 |
# |
2009-06-03; 16:00 |
# |
2009-06-09; 15:54 |
0/na |
- |
9 |
# |
- |
35 |
# |
- |
35 |
# |
- |
38 |
# |
- |
20 |
10/ok |
- |
35 |
# |
355 |
148 |
# |
352 |
173 |
# |
353 |
182 |
# |
369 |
292 |
20/info |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
40/warn |
- |
3 |
# |
- |
4 |
# |
1 |
4 |
# |
- |
4 |
# |
1 |
4 |
50/error |
- |
6 |
# |
30 |
47 |
# |
32 |
59 |
# |
33 |
49 |
# |
19 |
34 |
60/crit |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
100/maint |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
nodes |
- |
53 |
# |
385 |
234 |
# |
385 |
271 |
# |
386 |
273 |
# |
389 |
350 |
- [1] SAM
CE-sft-caver
vs [2] SAM-Nag org.sam.WN-CAver
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-06-03; 08:20 |
# |
2009-06-09; 15:54 |
0/na |
- |
- |
# |
- |
1 |
10/ok |
383 |
262 |
# |
387 |
342 |
20/info |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
40/warn |
- |
9 |
# |
- |
7 |
50/error |
1 |
- |
# |
- |
- |
60/crit |
1 |
/ |
# |
2 |
/ |
100/maint |
- |
/ |
# |
- |
/ |
nodes |
385 |
271 |
# |
389 |
350 |
- [1] SAM
CE-sft-softver
vs [2] SAM-Nag CE-org.sam.WN-SoftVer
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-06-03; 08:20 |
# |
2009-06-09; 15:54 |
0/na |
- |
- |
# |
- |
- |
10/ok |
385 |
262 |
# |
389 |
343 |
20/info |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
40/warn |
- |
9 |
# |
- |
7 |
50/error |
- |
- |
# |
- |
- |
60/crit |
- |
/ |
# |
- |
/ |
100/maint |
- |
/ |
# |
- |
/ |
nodes |
385 |
271 |
# |
389 |
350 |
Missing WN test results
Issue 5.
CEs seen suffering from the issue but now in OK state. Time window:
Wed Feb 3 01:00:00 CET 2010
Sun Feb 8 18:45:36 CET 2010
SWE UPV-GRyCAP ramses.dsic.upv.es [08-02-2010]
NE KTU-BG-GLITE ce.bg.ktu.lt OK [07-02-2010]
SEE HG-02-IASA ce01.marie.hellasgrid.gr OK [07-02-2010]
CE PSNC creamce.reef.man.poznan.pl OK [07-02-2010] CREAM CE
NE CSC egee-ce.csc.fi OK [07-02-2010]
CE GUP-JKU egee-ce1.gup.uni-linz.ac.at OK [07-02-2010]
UKI UKI-SOUTHGRID-BHAM-HEP epgce4.ph.bham.ac.uk OK [07-02-2010]
AP MY-UPM-BIRUNI-01 haitham.biruni.upm.my OK [07-02-2010]
FR AUVERGRID iut03auvergridce01.univ-bpclermont.fr OK [07-02-2010]
CANADA VICTORIA-LCG2 lcg-ce.rcf.uvic.ca OK [07-02-2010]
UKI UKI-SOUTHGRID-OX-HEP ngsce-test.oerc.ox.ac.uk OK [07-02-2010]
DECH UNI-DORTMUND udo-ce03.grid.tu-dortmund.de OK [07-02-2010]
SEE WEIZMANN-LCG2 wipp-ce.weizmann.ac.il OK [07-02-2010]
Detailed error messages for the above issues.
show
hide
-
Can't locate URI.pm in @INC
Check if provided MBs are working.
WARNING: Provided MB isn't accessible stomp://gridmsg002.cern.ch:6163/
Trying to obtain it from IS.
ERROR: Failed to obtain Message Broker URI from bdii.grid.sinica.edu.tw:2170.
Can't locate URI.pm in @INC (@INC contains: /opt/lcg/lib64/perl /opt/gpt/lib/perl/x86_64-linux-thread-multi /opt/gpt/lib/perl /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/site_perl/5.8.5 /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/vendor_perl/5.8.5 /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl/5.8.7 /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgBroker.pm line 30, <DATA> line 225. BEGIN failed--compilation aborted at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgBroker.pm line 30, <DATA> line 225. Compilation failed in require at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/bin/find_broker line 34, <DATA> line 225. BEGIN failed--compilation aborted at /home/ops031/globus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/bin/find_broker line 34, <DATA> line 225.
lobus-tmp.compute-1-13.1672.0/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2f-LUJNLqgrm7yadDNrsfHmQ/.nagios/bin/check_broker line 17.
-
Can't locate Net/LDAP.pm in @INC
Check if provided MBs are working.
WARNING: Provided MB isn't accessible stomp://gridmsg002.cern.ch:6163/
Trying to obtain it from IS.
ERROR: Failed to obtain Message Broker URI from bdii.mif.vu.lt:2170.
Can't locate Net/LDAP.pm in @INC (@INC contains: /opt/gLite/lcg/lib/perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /opt/gLite/gpt/lib/perl/i386-linux-thread-multi /opt/gLite/gpt/lib/perl /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/site_perl/5.8.5 /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.7/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl/5.8.7 /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.7/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.8/i386-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/find_broker line 32. BEGIN failed--compilation aborted at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/find_broker line 32.
idMon/MsgBroker.pm line 30.
BEGIN failed--compilation aborted at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgBroker.pm line 30.
Compilation failed in require at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/check_broker line 17.
BEGIN failed--compilation aborted at /home/ops003/globus-tmp.wn4a.26989.0/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2fBkx0Jmc1zU7RKQtN-gXq4w/.nagios/bin/check_broker line 17.
-
Compilation failed in require at /glite/home/ops/ops042/.../.nagios/libexec/grid-monitoring/plugins/nagios/send_to_msg
_send_to_msg [Sat Jan 16 09:29:35 CET 2010]:
/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgCache.pm line 23.
Compilation failed in require at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/libexec/grid-monitoring/plugins/nagios/send_to_msg line 57.
BEGIN failed--compilation aborted at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/libexec/grid-monitoring/plugins/nagios/send_to_msg line 57.
Can't load '/glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/auto/Time/HiRes/HiRes.so' for module Time::HiRes: /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/auto/Time/HiRes/HiRes.so: wrong ELF class: ELFCLASS32 at /usr/lib/perl/5.8/DynaLoader.pm line 225.
at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/IPC/DirQueue.pm line 63
Compilation failed in require at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/IPC/DirQueue.pm line 63.
BEGIN failed--compilation aborted at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/vendor_perl/5.8.5/IPC/DirQueue.pm line 63.
Compilation failed in require at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgCache.pm line 23.
BEGIN failed--compilation aborted at /glite/home/ops/ops042/gram_scratch_aI3WCJLGUe/https_3a_2f_2fwmssamtest01.cern.ch_3a9000_2fsCjfOt0BlxCoJ0te6SJ8qA/.nagios/lib/perl5/site_perl/5.8.5/GridMon/MsgCache.pm line 23.
-
Passive... 'org.sam.WNRepCr-ops'..., but the service could not be found!
Warning: Passive check result was received for service 'org.sam.WNRepCr-ops' on host 'localhost.localdomain', but the service could not be found!
-
ERROR: Failed to obtain Message Broker URI from ...
Need more debug info on what has happened on WN!
Check if provided MBs are working.
WARNING: Provided MB isn't accessible stomp://gridmsg002.cern.ch:6163/
Trying to obtain it from IS.
ERROR: Failed to obtain Message Broker URI from egee-bdii.cnaf.infn.it:2170.
Could not connect to BDII at egee-bdii.cnaf.infn.it:2170 at /gpor_proj/spagogrid/egee/home/crescoops004/gram_scratch_71nVDKzHgW/https_3a_2f_2fwmssamtest02.cern.ch_3a9000_2f9Ll_5fjnEXmasADcsUWm2rxQ/.nagios/bin/find_broker line 81, <DATA> line 225.
-
Message Broker URI:
empty; site problem - firewalling outgoing TCP:6163. Request to add TCP:{6162,6163} to M/W ports table was submitted to John White.
Check if provided MB is accessible [stomp://gridmsg002.cern.ch:6163/].
INFO : Testing Broker: stomp://gridmsg002.cern.ch:6163/
INFO : Couldn't connect to : stomp://gridmsg002.cern.ch:6163/
WARNING: Provided MB isn't accessible [stomp://gridmsg002.cern.ch:6163/].
Trying to obtain it from IS.
All found brokers [BDII topbdii01.ncg.ingrid.pt:2170]:
INFO : Getting info from BDII topbdii01.ncg.ingrid.pt:2170
INFO : Testing Broker: stomp://gridmsg101.cern.ch:6163/
INFO : Testing Broker: stomp://msg.cro-ngi.hr:6163/
ERROR: No Brokers found in 'PROD' network [BDII topbdii01.ncg.ingrid.pt:2170].
Exiting.
WMS
* - critical
- Probe
org.sam.WMS-probe
tests job submission to WMS
.
metricName |
metricDescription |
metricType |
metricLocality |
org.sam.WMS-JobState |
Submits grid job to CE |
status |
remote |
org.sam.WMS-JobMonit |
Monitors grid jobs submitted to CEs |
status |
remote |
org.sam.WMS-JobSubmit |
[Passive] (in Nagios sense) Holds final status of job submission |
status |
remote |
Running org.sam.* probes/metrics
Integration of WN checks
Default behavior of org.sam.CE-JobState metric
WN tarball is assembled by
org.sam/CE-probe -m org.sam.CE-JobState ...
metric at runtime.
By default only
org.sam
probes/checks (
org.sam/WN-probe
and
org.sam/wnjob/org.sam/probes/org.sam/*
) are
packed into the WN tarball. As on WNs Nagios is used as checks launcher it is also being packed to the
tarball, as well as respective Nagios configurations (located in
org.sam/wnjob/org.sam/etc/wn.d/org.sam/
) of the above
org.sam
WN probes/checks. Example of submission
of CE+WN tests:
/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H ce000.cern.ch \
-m org.sam.CE-JobState --mb-destination /a/b/c
Integration of your WN checks
The following CLI parameters to
org.sam.CE-JobState
metric are available:
--add-wntar-nag <d1,d2,..> Comma-separated list of top level directories with
Nagios compliant directories structure to be added
to tarball to be sent to WN.
--add-wntar-nag-nosam Instructs the metric not to include standard SAM WN
probes and their Nagios config to WN tarball.
(Default: WN probes are included)
--add-wntar-nag-nosamcfg Instructs the metric not to include Nagios
configuration for SAM WN probes to WN tarball. The
probes themselves and respective Python packages,
however, will be included.
./CE-probe -m org.sam.CE-JobState \
--add-wntar-nag /path/to/your/pobes/wnjob/org.my/,/path/to/org.sam.sec
This will add into WN tarball two sets of WN checks provided by
org.my
and
org.sam.sec
. NB!
org.sam
WN checks and
their respective Nagios configurations will still be added and launched on WN, as well!
Use --add-wntar-nag-nosam
if you neither want org.sam
probes to be run on WN nor want to use the org.sam
probes with your probable custom configurations. This will instruct org.sam.CE-JobState
metric not to include org.sam
WN probes and their Nagios configuration to WN tarball.
Example (only relevant CLI parameters are shown):
./CE-probe -m org.sam.CE-JobState --add-wntar-nag-nosam \
--add-wntar-nag /path/to/your/pobes/wnjob/org.my/
WN tarball will contain only your probes and Nagios configurations from /path/to/your/pobes/wnjob/org.my/
.
Use --add-wntar-nag-nosamcfg
if you don't want org.sam
probes to be run on WN, but still want the probes to be
included in the WN tarball. This will instruct org.sam.CE-JobState
metric to include org.sam
WN probes,
Python gridmon
and gridmonsam
packages with respective modules into the WN tarball. Nagios configuration
of the probes will not be included to WN tarball. This is done for your convenience:
- in case you want to use some of
org.sam
provided probes/wrappers (eg. $USER3$/org.sam/{nag,sam}test-run
),
though, you will have to include your custom Nagios objects configurations yourselves (can be taken directly from /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/*.cfg
and included into your *.cfg-s)
- you developed your Python WN probes using
gridmon
or gridmonsam
packages. The latter packages will be added to
$PYTHONPATH
before launching Nagios on WN, so, your probes can safely import required modules from them.
Example (only relevant CLI parameters are shown):
./CE-probe -m org.sam.CE-JobState --add-wntar-nag-nosamcfg \
--add-wntar-nag /path/to/your/pobes/wnjob/org.my/
WN tarball will contain your probes and Nagios configurations from /path/to/your/pobes/wnjob/org.my/
, as well
as org.sam
probes (but without their standard org.sam
Nagios configurations), gridmon
and gridmonsam
Python packages.
Providing JDL
The JDL used by the framework is the following:
[
Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
Arguments = "<jdlArguments>";
InputSandbox = {"<jdlInputSandboxExecutable>", "<jdlInputSandboxTarball>"};
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = other.GlueCEInfoHostName == "<jdlReqCEInfoHostName>";
]
and located in
/usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.gridJob.jdl.template
.
- Substitutable elements of the template are
-
<jdlExecutable>
- name of executable on WN (default set by the framework nagrun.sh
)
-
<jdlArguments>
- list of arguments to <jdlExecutable>
(set by the framework)
-
<jdlInputSandboxExecutable>
- path to executable on UI (set by the framework to /metric/work/dir/<jdlExecutable>
)
-
<jdlInputSandboxTarball>
- path to WN tarball (set by the framework to /metric/work/dir/gridjob.tgz
)
-
<jdlRetryCount>
- default 0 (can be modified via CLI parameter --jdl-retrycount
(see below))
-
<jdlShallowRetryCount>
- default 1 (can be modified via CLI parameter --jdl-shallowretrycount
(see below))
-
<jdlReqCEInfoHostName>
- CE host name (set by the framework)
"Third-party" JDL
One can provide its own JDL template with the following parameter to
org.sam.CE-JobState
metric:
--jdl-templ <file> JDL template file (full path). Default:
<org.sam.ProbesLocation>/wnjob/org.sam.gridJob.jdl.template
For better flexibility
RetryCount
and
ShallowRetryCount
ClassAdds were exposed as parameters to
org.sam.CE-JobState
metric:
--jdl-retrycount <val> JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val> JDL ShallowRetryCount (Default: 1).
Defining context for metric execution
When same resources (endpoints) are tested from the same Nagios instance under the same VO, but with different probes/metrics
and the metrics require working directories it's clearly not enough to have the following
/<probesWorkDir>/<VO>/<nameSpace>/<serviceAbbr>/<hostnameORendpoint>/
as a working directory.
This is due to the fact that
<nameSpace>
is defined by the provider of the probe. Such situation usually appears in
job submission tests.
Thus, the following parameter was added to CE and
CREAM CE probes
Metrics specific parameters:
--namespace <string> Name-space for the probe. (Default: org.sam)
If given, the metrics' working directory of the probe will be
/<probesWorkDir>/<VO>/<string>/<serviceAbbr>/<hostnameORendpoint>/
The following example clarifies the situation. Use-case of security checks (
org.sam.sec
).
Preconditions:
- a set of WN security checks must be submitted to WNs
- the checks submission wants to use
org.sam/{CREAM}CE-probe -m org.sam.CE-Job{State,Monit}
metrics to submit and monitor CE jobs
- submission will be carried out under the same VO as for normal
org.sam
CE test jobs (say ops VO)
- either
- the checks must not be submitted with the normal
org.sam
CE test jobs
- the required scheduling interval for the security checks is different from the normal
org.sam
CE ones
This effectively means that (if not otherwise defined) submissions of the separate CE tests from
org.sam
and
org.sam.sec
will use the same jobs working directories (e.g.,
/var/run/gridprobes/ops/org.sam/CE/hostname.to.test/
),
which is not desirable (will not properly work, in fact).
-
org.sam
WN checks, ops VO, to CE hostname.to.test
./CE-probe -H hostname.to.test -m org.sam.CE-JobState \
--vo ops ...
-
org.sam.sec
WN checks (must be taken from /path/to/probes/org.sam.sec
), org.sam
checks mustn't be executed on WN, ops VO, to CE hostname.to.test
./CE-probe -H hostname.to.test -m org.sam.CE-JobState \
--vo ops --add-wntar-nag-nosam \
--add-wntar-nag /path/to/probes/org.sam.sec ...
Instead, by using
--namespace org.sam.sec
one can separate the metrics' working contexts:
./CE-probe -H hostname.to.test -m org.sam.CE-JobState \
--namespace org.sam.sec --vo ops --add-wntar-nag-nosam \
--add-wntar-nag /path/to/probes/org.sam.sec ...
This instructs
CE-probe
to create
/var/run/gridprobes/ops/org.sam.sec/CE/hostname.to.test/
and perform job preparation/submission
from that directory.
There should be respective
org.sam.CE-JobMonit
check configured on the Nagios instance to monitor jobs in
org.sam.sec
context.
./CE-probe -m org.sam.CE-JobMonit --namespace org.sam.sec ...
Wrapper checks (for "complex" checks)
To preserve an order and a context in the execution of a sequence of metrics (so called
"complex" check
)
sometimes it is desirable to wrap the metrics execution in (Nagios) active check (eg.,
org.sam.SRM-All
,
org.sam.WN-Rep
metrics) and report wrapped metrics' results (to Nagios) as passive checkes.
NB! Check
this sub-section for running wrapper checks with
nagios-run-check
on Nagios instance.
Reporting passive check results from wrapper checks
In most cases wrapper checks will be used non-interactively and report wrapped metrics' results to Nagios.
Using Nagios command file or NSCA to report metrics results (to Nagios) as passive checks from a wrapper check
(eg, from
org.sam.SRM-All
for
org.sam.SRM-{GetSURLs,Put,...}
). Snippet from running a probe with
"-h"
option:
Reporting passive checks (when used with wrapper checks)
--pass-check-dest <config|nsca|nagcmd|active> (Default: config)
--pass-check-conf <path> Configuration file for reporting passive checks.
Used with '--pass-check-dest config'. Overrides
passive checks submission library default one.
--nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
is set to 'nsca'.
--nsca-port <port> Port NSCA is listening on (Default: 5667)
--send-nsca <path> NSCA client binary. (Default: /usr/sbin/send_nsca)
--send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)
--nagcmdfile <path> Nagios command file.
Order: $NAGIOS_COMMANDFILE, --nagcmdfile
(Default: /var/nagios/rw/nagios.cmd)
To report results of internally run (wrapped) metrics to Nagios as passive check results three options
<config|nsca|nagcmd>
to
--pass-check-dest
parameter are available.
- Nagios command file (if probe is runing on the Nagios box)
-
--pass-check-dest nagcmd [--nagcmdfile /path/to/named.fifo]
- NSCA
send_nsca
binary (if probe is running on UI box)
-
--pass-check-dest nsca --nsca-server <fqdn|ip> [--nsca-port port] [--send-nsca /path/to/send_nsca] [--send-nsca-conf /path/to/send_nsca.conf]
-
config
- NSCA or Nagios command file methods should be taken from a configuration file.
-
--pass-check-dest config [--pass-check-conf ]
By default, probes will use
config
option. This means that the module (
gridmon/nagios/nagios.py
from
python-GridMon
)
responsible for passive checks submission will use node global configuration defined in a configuration file (by default,
/etc/nagios-submit.conf
). If
nsca
or
nagcmd
are explicitly used, then, it's up to the user to supply correct options
for the submission method (ie., the configuration file is not used).
Reporting "active" check results from wrapper checks
To run a wrapper check from command line and see the "wrapped" metrics' results printed out to
stdout
set
--pass-check-dest
to
active
:
[kvs] ~ > /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -m org.sam.SRM-All \
-u lxdpm104.cern.ch --pass-check-dest active
metric >>> org.sam.SRM-GetSURLs
OK: Got SRM endpoint(s) and SAPath(s) from BDII
OK: Got SRM endpoint(s) and SAPath(s) from BDII
metric >>> org.sam.SRM-LsDir
OK: SAPath[/dpm/cern.ch/home/ops]-ok;
OK: SAPath[/dpm/cern.ch/home/ops]-ok;
metric >>> org.sam.SRM-Put
OK: File was copied to SRM.
OK: File was copied to SRM.
metric >>> org.sam.SRM-Ls
OK: listing [/dpm/cern.ch/home/ops/testfile-put-1231319739-3cbfe0c470ba.txt]-ok;
OK: listing [/dpm/cern.ch/home/ops/testfile-put-1231319739-3cbfe0c470ba.txt]-ok;
metric >>> org.sam.SRM-GetTURLs
OK: TURLs gsiftp, rfio
OK: TURLs gsiftp, rfio
metric >>> org.sam.SRM-Get
OK: File was copied from SRM. Diff successful.
OK: File was copied from SRM. Diff successful.
metric >>> org.sam.SRM-Del
OK: file was deleted from SRM.
OK: file was deleted from SRM.
OK: success.
OK: success.
[kvs] ~ > echo $?
0
[kvs] ~ >
This way it is possible to test the behavior of the probe/metrics without needing to have a writable named pipe (Nagios command file) or working NSCA.
Running wrapper checks with nagios-run-check
As wrapper checks by default report sub-checks' results to nagios command file as passive checks, when running them for testing purposes on a Nagios instance
it's advisable to disable the publication of the passive checks to Nagios. This is the order of how this must be done (note
-d
and
-v
options to
nagios-run-check
)
~> nagios-run-check -d -v -H axon-g05.ieeta.pt -s org.sam.SRM-All-/ops/Role=lcgadmin
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H
"axon-g05.ieeta.pt" -t 600 --vo ops --vo-fqan /ops/Role=lcgadmin
-x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin'
Copy the command and add
--pass-check-dest active
. Then run as
~> su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/SRM-probe -H
"axon-g05.ieeta.pt" -t 600 --vo ops --vo-fqan /ops/Role=lcgadmin
-x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin
--pass-check-dest active'
You can also add
--stdout
option
--stdout Detailed output of metrics will be printed to stdout as
it is being produced by metrics. The default is to store
the output in a container and, then, produce Nagios
compliant output.
gstat-validation, grid-monitoring-probes-org.bdii RPMs
sBDII
sBDII Results
show
hide
-
-
-- no values
-
/
-- no mapping
- [1] SAM
sBDII-performance
vs [2] SAM-Nag org.bdii.Entries
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-05-15; 14:00 |
# |
2009-05-19; 10:00 |
# |
2009-06-03; 08:20 |
# |
2009-06-09; 15:54 |
0/na |
6 |
11 |
# |
6 |
14 |
# |
- |
4 |
# |
2 |
5 |
10/ok |
250 |
270 |
# |
262 |
267 |
# |
271 |
284 |
# |
270 |
281 |
20/info |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
30/note |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
40/warn |
- |
- |
# |
- |
- |
# |
- |
- |
# |
- |
- |
50/error |
1 |
- |
# |
1 |
- |
# |
1 |
- |
# |
1 |
- |
60/crit |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
100/maint |
21 |
/ |
# |
12 |
/ |
# |
12 |
/ |
# |
8 |
/ |
nodes |
278 |
281 |
# |
281 |
281 |
# |
283 |
288 |
# |
281 |
286 |
- [1] SAM
sBDII-sanity
vs [2] SAM-Nag org.gstat.SanityCheck
STATUS |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
# |
[1] |
[2] |
- |
2009-05-15; 14:00 |
# |
2009-05-19; 10:00 |
# |
2009-06-03; 08:20 |
# |
2009-06-09; 15:54 |
0/na |
- |
- |
# |
- |
- |
# |
- |
- |
# |
- |
- |
10/ok |
245 |
209 |
# |
255 |
251 |
# |
262 |
248 |
# |
263 |
252 |
20/info |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
30/note |
2 |
/ |
# |
3 |
/ |
# |
1 |
/ |
# |
1 |
/ |
40/warn |
3 |
20 |
# |
4 |
19 |
# |
9 |
26 |
# |
7 |
22 |
50/error |
6 |
52 |
# |
7 |
11 |
# |
- |
14 |
# |
2 |
12 |
60/crit |
- |
/ |
# |
- |
/ |
# |
- |
/ |
# |
- |
/ |
100/maint |
21 |
/ |
# |
12 |
/ |
# |
11 |
/ |
# |
8 |
/ |
nodes |
277 |
281 |
# |
281 |
281 |
# |
283 |
288 |
# |
281 |
286 |
Probes/metrics by ch.cern and hr.srce
As of :
Thu Nov 6 15:07:47 CET 2008
The following probes/metrics are provided by
grid-monitoring-probes-ch.cern
and
grid-monitoring-probes-hr.srce
.
grid-monitoring-probes-ch.cern
serviceType |
probeName |
metricName |
metricDescription |
metricType |
metricLocality |
glite-FTS-WS |
ch.cern/FTS-probe |
- |
|
ch.cern.FTS-ChannelList |
(ks) list channels on FTS |
status |
remote |
glite-LFC |
ch.cern/LFC-probe |
|
ch.cern.LFC-ReadDli |
Do a read from a DLI |
status |
remote |
|
ch.cern.LFC-Write |
Test if we can update the modification time of an entry in the catalog |
status |
remote |
|
ch.cern.LFC-Read |
Test if we can read an entry in the catalog |
status |
remote |
|
ch.cern.LFC-Readdir |
Time how long it takes to read a directory (/grid) |
performance |
remote |
glite-RGMA |
ch.cern/RGMA-probe |
|
ch.cern.RGMA-ServiceStatus |
... |
status |
remote |
|
ch.cern.RGMA-CertLifetime |
... |
status |
remote |
show
hide
/usr/libexec/grid-monitoring/probes/ch.cern/FTS-probe
serviceType: glite-FTS-WS
metricName: ch.cern.FTS-ChannelList
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/ch.cern/LFC-probe
serviceType: glite-LFC
metricDescription: Do a read from a DLI
metricLocality: remote
metricType: status
metricName: ch.cern.LFC-ReadDli
EOT
serviceType: glite-LFC
metricDescription: Test if we can update the modification time of an entry in the catalog
metricLocality: remote
metricType: status
metricName: ch.cern.LFC-Write
EOT
serviceType: glite-LFC
metricDescription: Test if we can read an entry in the catalog
metricLocality: remote
metricType: status
metricName: ch.cern.LFC-Read
EOT
serviceType: glite-LFC
dataType: float
metricDescription: Time how long it takes to read a directory (/grid)
metricType: performance
metricLocality: remote
metricName: ch.cern.LFC-Readdir
EOT
/usr/libexec/grid-monitoring/probes/ch.cern/RGMA-probe
serviceType: glite-RGMA
metricName: ch.cern.RGMA-ServiceStatus
metricType: status
EOT
serviceType: glite-RGMA
metricName: ch.cern.RGMA-CertLifetime
metricType: status
EOT
grid-monitoring-probes-hr.srce
serviceType |
probeName |
metricName |
metricDescription |
metricType |
metricLocality |
CAdistribution |
hr.srce/CAdist-probe |
- |
|
hr.srce.CAdist-Version |
... |
status |
remote (?) |
DPM |
hr.srce/DPM-probe |
|
hr.srce.DPM-Query |
... |
status |
remote |
DPNS |
hr.srce/DPNS-probe |
|
hr.srce.DPNS-List |
... |
status |
remote (?) |
globus-GRAM |
hr.srce/GRAM-probe |
|
hr.srce.GRAM-CertLifetime |
... |
status |
remote |
|
hr.srce.GRAM-Auth |
... |
status |
remote |
|
hr.srce.GRAM-Command |
... |
status |
remote |
gsiftp |
hr.srce/GridFTP-probe |
|
hr.srce.GridFTP-Transfer |
... |
status |
remote |
! GridProxy |
hr.srce/GridProxy-probe |
|
! hr.srce.GridProxy-Valid |
... |
status |
local |
MyProxy |
hr.srce/MyProxy-probe |
|
hr.srce.MyProxy-CertLifetime |
... |
status |
remote |
|
hr.srce.MyProxy-ProxyLifetime |
... |
status |
remote |
|
hr.srce.MyProxy-Store |
... |
status |
remote |
ResourceBroker |
hr.srce/ResourceBroker-probe |
|
hr.srce.ResourceBroker-CertLifetime |
... |
status |
remote |
|
hr.srce.ResourceBroker-RunJob |
... |
status |
remote |
SRM |
hr.srce/SRM-probe |
|
hr.srce.SRM1-CertLifetime |
... |
status |
remote |
|
hr.srce.SRM1-Ping |
... |
status |
remote |
|
hr.srce.SRM2-CertLifetime |
... |
status |
remote |
|
hr.srce.SRM-Transfer |
... |
status |
remote |
org.glite.wms.WMProxy |
hr.srce/WMProxy-probe |
|
hr.srce.WMProxy-CertLifetime |
... |
status |
remote |
|
hr.srce.WMProxy-RunJob |
... |
status |
remote |
org.glite.wms.NetworkServer |
hr.srce/WMS-probe |
|
hr.srce.WMS-CertLifetime |
... |
status |
remote |
|
hr.srce.WMS-RunJob |
... |
status |
remote |
show
hide
/usr/libexec/grid-monitoring/probes/hr.srce/CAdist-probe
serviceType: CAdistribution
metricName: hr.srce.CAdist-Version
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/DPM-probe
serviceType: DPM
metricName: hr.srce.DPM-Query
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/DPNS-probe
serviceType: DPNS
metricName: hr.srce.DPNS-List
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/GRAM-probe
serviceType: globus-GRAM
metricName: hr.srce.GRAM-CertLifetime
metricType: status
EOT
serviceType: globus-GRAM
metricName: hr.srce.GRAM-Auth
metricType: status
EOT
serviceType: globus-GRAM
metricName: hr.srce.GRAM-Command
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/GridFTP-probe
serviceType: gsiftp
metricName: hr.srce.GridFTP-Transfer
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/GridProxy-probe
serviceType: GridProxy
metricName: hr.srce.GridProxy-Valid
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/MyProxy-probe
serviceType: MyProxy
metricName: hr.srce.MyProxy-CertLifetime
metricType: status
EOT
serviceType: MyProxy
metricName: hr.srce.MyProxy-ProxyLifetime
metricType: status
EOT
serviceType: MyProxy
metricName: hr.srce.MyProxy-Store
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/ResourceBroker-probe
serviceType: ResourceBroker
metricName: hr.srce.ResourceBroker-CertLifetime
metricType: status
EOT
serviceType: ResourceBroker
metricName: hr.srce.ResourceBroker-RunJob
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/SRM-probe
serviceType: SRM
metricName: hr.srce.SRM1-CertLifetime
metricType: status
EOT
serviceType: SRM
metricName: hr.srce.SRM1-Ping
metricType: status
EOT
serviceType: SRM
metricName: hr.srce.SRM2-CertLifetime
metricType: status
EOT
serviceType: SRM
metricName: hr.srce.SRM-Transfer
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/WMProxy-probe
serviceType: org.glite.wms.WMProxy
metricName: hr.srce.WMProxy-CertLifetime
metricType: status
EOT
serviceType: org.glite.wms.WMProxy
metricName: hr.srce.WMProxy-RunJob
metricType: status
EOT
/usr/libexec/grid-monitoring/probes/hr.srce/WMS-probe
serviceType: org.glite.wms.NetworkServer
metricName: hr.srce.WMS-CertLifetime
metricType: status
EOT
serviceType: org.glite.wms.NetworkServer
metricName: hr.srce.WMS-RunJob
metricType: status
EOT
Metrics naming:
Probe |
Metrics |
<nameSpace>.<serviceAbbreviation>-probe |
<nameSpace>.<serviceAbbreviation>-<metricName> |
org.sam.SRM-probe |
org.sam.SRM-LsDir |
Timeouts:
- metric global timeout:
metricTimeout
- timeouts on operations in a metric (set):
metricOperationTimeouts = {metricOperationTimeout_1, ..., metricOperationTimeout_N}
condition:
metricTimeout >= SUM_{i=1}^N (metricOperationTimeout_i) - N
Command line options:
-h|--help
-l|--list
-m|--metric metricName
-u|--uri serviceURI
-t|--timeout timeout (sec)
-n|--node hostname (FQDN)
-v|--vo VO
--wlcg [WLCG output instead of default Nagios]
-
- Metrics specific options (NB! all should be long options)
- see for each probe/metric specifically
Getting data for comparison from SAM DBs
Here is the Python script to get (latest) test data from two SAM DBs (SAM Prod / SAM-Nagios) and print them in tables for each test. You will need
cx_Oracle
Python module (available eg. on sam-val.cern.ch) to run it. You must know both DBs accounts and passwords

Tests, DBs, query etc. are set in the script itself. Modify them in-place. The script is attached to the page
sam-samnag-cmp_stat.py. Example of running the script:
show
hide
[kvs] src > ./sam-samnag-cmp_stat.py
>>> fetching SAM: CE-sft-job ... done
>>> fetching SAMNag: CE-org.sam.CE-JobSubmit ... done
>>> fetching SAM: CE-sft-lcg-rm ... done
>>> fetching SAMNag: CE-org.sam.WN-Rep ... done
>>> fetching SAM: SRMv2-get-SURLs ... done
>>> fetching SAMNag: SRMv2-org.sam.SRM-GetSURLs ... done
>>> fetching SAM: SRMv2-put ... done
>>> fetching SAMNag: SRMv2-org.sam.SRM-Put ... done
>>> fetching SAM: sBDII-sanity ... done
>>> fetching SAMNag: sBDII-org.gstat.SanityCheck ... done
Wed, 03 Jun 2009 06:19:57 +0000
===> {'SAM': 'CE-sft-job', 'SAMNag': 'CE-org.sam.CE-JobSubmit'}
na | - | - |
ok |367|351|
info | - | - |
note | - | - |
warn | - | 39|
error| 28| - |
crit | - | - |
maint| - | - |
|395|390|
===> {'SAM': 'CE-sft-lcg-rm', 'SAMNag': 'CE-org.sam.WN-Rep'}
na | - | 35|
ok |352|173|
info | - | - |
note | - | - |
warn | 1| 4|
error| 32| 59|
crit | - | - |
maint| - | - |
|385|271|
===> {'SAM': 'SRMv2-get-SURLs', 'SAMNag': 'SRMv2-org.sam.SRM-GetSURLs'}
na | - | - |
ok |323|323|
info | - | - |
note | - | - |
warn | - | - |
error| 14| 5|
crit | - | - |
maint| - | - |
|337|328|
===> {'SAM': 'SRMv2-put', 'SAMNag': 'SRMv2-org.sam.SRM-Put'}
na | - | 4|
ok |308|291|
info | - | - |
note | - | - |
warn | 14| 5|
error| 15| 28|
crit | - | - |
maint| - | - |
|337|328|
===> {'SAM': 'sBDII-sanity', 'SAMNag': 'sBDII-org.gstat.SanityCheck'}
na | - | - |
ok |262|248|
info | - | - |
note | 1| - |
warn | 9| 26|
error| - | 14|
crit | - | - |
maint| 11| - |
|283|288|
[kvs] src >
P.S.: worth reading
- having integral checks, which perform multiple operations on a service in "one go", in general, assumes that the check by itself already defines integral availability of a particular service. This comes from the fact that different functional operations on a service are, in fact, parts of one test. Then, the only this integral value reaches Metrics DB. Such an approach doesn't allow for a flexibility in service availability calculations (which in other case could at different times define different metrics to be taken for service availability calculations). Such approach reduces modularity ("plug-ability") of the probes, as well.
SAM MDDB Profiles
Follow the link -
SAM MDDB Profiles
--
KonstantinSkaburskas - 11 Oct 2008