Review of the EGEE/WLCG Critical Probes used for Availability Metrics Calculations
Critical probes defined in SAM for 'OPS' VO
GridView calculates the status of a site by looking at the SAM results of the critical tests defined for the CE, SRMv2 and sBDII services.
The critical tests defined for each of these services are the following:
Computing Element (CE)
- Tests executed from SAM UI.
- CE-host-cert-valid
- Test if the service/host certificate on the CE is valid.
- cvs
- This sensors returns
OK
if the host certificate of the CE
is valid, i.e. is not expired.
- CE-sft-job -
Job submission
- cvs
- This is a pseudo-test executed on the UI to publish the results of the test job submission and output retrieval. Sensor succeeds only if the job finished successfully and the output was retrieved.
- Tests executed from WN.
- CE-sft-brokerinfo
- BrokerInfo
- cvs
- This test is run on the WN. With it, we check if we can get the name of the CE where the job has been dispatched using
glite-brokerinfo
or edg-brokerinfo
commands. These commands read this information from a brokerinfo file (created by the RB) that is sent to the job working directory of the WN in the input sandbox.
- First, check if
BrokerInfo
file is defined in $GLITE_WMS_RB_BROKERINFO
, $GLITE_WL_RB_BROKERINFO
or $EDG_WL_RB_BROKERINFO
variables. If not, the test fails.
- Then, try to get
CE
host name using edg-brokerinfo getCE
or glite-brokerinfo getCE
commands.
- CE-sft-caver
- CA certs version
- cvs
- Check the version of
CA
RPMs which are installed on the WN and compare them with the reference ones. If for any reason RPM check fails (due to other installation method, for example) the test falls back to physical files test (MD5 checksum comparison for all CA
certs with the reference list). This sensor returns OK
if the installed CA
RPMs are identical to the references.
- CE-sft-csh
- csh test
- cvs
- Try to create and execute a very simple
csh
script which dumps environment variable to a file. This sensor fails if the csh
script is unable to execute and the dump file is missing.
- CE-sft-lcg-rm
- Replica Management
- cvs
- This is a super-test that succeeds only if all of the following tests succeed:
- CE-sft-lcg-rm-gfal
- GFAL Information System
- cvs
- Check if $LCG_GFAL_INFOSYS variable is set.
- CE-sft-lcg-rm-free
- Free space on default SE
- cvs
- Check if the default SE has any free space left according to the information system.
- CE-sft-lcg-rm-cr
- lcg-cr to local SE
- cvs
- Copy and register a short text file to the default
SE
using lcg-cr
command. Retrieve list of replicas with lcg-lr
command.
- CE-sft-lcg-rm-cp
- lcg-cp from local SE
- cvs
- Copy the file registered in test
CE-sft-lcg-rm-cr
to the WN
using lcg-cp
command.
- CE-sft-lcg-rm-rep
- lcg-rep to "central" SE
- cvs
- Replicate the file registered in test
CE-sft-lcg-rm-cr
to the chosen "central" SE
using lcg-rep
command.
- CE-sft-lcg-rm-del
- lcg-del
- cvs
- Delete replicas of all the files registered in previous tests using
lcg-del
command.
- CE-sft-softver
- Software Version (WN)
- cvs
- Detect the version of software which is really installed on the
WN
. To detect the version, lcg-version
command is used and if the command is not available (very old versions of LCG
) the test script checks only the version number of GFAL-client
RPM.
SRM version 2 (SRMv2)
Tests executed from SAM UI. Listed in the order of execution.
- SRMv2-host-cert-valid
- Test if the service/host certificate on the SRMv2 is valid.
- cvs
- This sensors returns
OK
if the host certificate of the SRM
is valid, i.e. is not expired.
- SRMv2-get-SURLs
- Get full SRM endpoints and space areas from BDII.
- cvs
- SRMv2-ls-dir
- Lists VO's top level space area(s) in SRM. Acts as a light-weight equivalent to 'srmping' test.
- cvs
- SRMv2-put
- Copy a local file to the SRM into default space area(s).
- cvs
- SRMv2-ls
- List (previously copied) file(s) on the SRM.
- cvs
- SRMv2-gt
- Get Transport URLs for the file copied to storage.
- cvs
- SRMv2-get
- Copy given remote file(s) from SRM to a local file.
- cvs
- SRMv2-del
- Delete given file(s) from SRM
- cvs
Site BDII (sBDII)
- sBDII-sanity
- GIIS Sanity Check
- cvs
- Performs the following syntax checks on
GIIS
:
- Check for non zero length blank lines: with spaces. This may cause probs.
- Check for entries that have no values.
- Check for line without ":". these should not exist.
- Check missing new line character between two attributes. This looks like two lines combined together.
- Check for duplicate
GlueCEStateWorstResponseTime
in each CE
.
- Performs the following logic checks (missing attributes) on
GIIS
:
- This sensor returns:
-
OK
- when there were no problems.
-
NOTE
- when blank lines exists.
-
WARN
- when blank values or invalid entries were found.
-
ERROR
- when the query failed.
- sBDII-performance
- GIIS Perf Check
- cvs
- This sensor shares the same agent
as the SanityCheck
sensor and uses the same ldapsearch
query results.
- The number of entries found, old entries (not modified within last 10 minutes) and the query response time(ms) are recorded.
- This sensor returns:
-
OK
- when there were no problems.
-
INFO
- when the response time > 10 seconds.
-
ERROR
- when no entries were found or when they were old.
Access to (SAM) Sites Availability metrics
There are several ways to visualize or pull the sites availability metrics. Some of them are described here:
--
DavidCollados - 03 Dec 2007