Review of the EGEE/WLCG Critical Probes used for Availability Metrics Calculations

Critical probes defined in SAM for 'OPS' VO

GridView calculates the status of a site by looking at the SAM results of the critical tests defined for the CE, SRMv2 and sBDII services.

The critical tests defined for each of these services are the following:

Computing Element (CE)

  • Tests executed from SAM UI.
    • CE-host-cert-valid - Test if the service/host certificate on the CE is valid. - cvs
      • This sensors returns OK if the host certificate of the CE is valid, i.e. is not expired.
    • CE-sft-job - Job submission - cvs
      • This is a pseudo-test executed on the UI to publish the results of the test job submission and output retrieval. Sensor succeeds only if the job finished successfully and the output was retrieved.
  • Tests executed from WN.
    • CE-sft-brokerinfo - BrokerInfo - cvs
      • This test is run on the WN. With it, we check if we can get the name of the CE where the job has been dispatched using glite-brokerinfo or edg-brokerinfo commands. These commands read this information from a brokerinfo file (created by the RB) that is sent to the job working directory of the WN in the input sandbox.
      • First, check if BrokerInfo file is defined in $GLITE_WMS_RB_BROKERINFO, $GLITE_WL_RB_BROKERINFO or $EDG_WL_RB_BROKERINFO variables. If not, the test fails.
      • Then, try to get CE host name using edg-brokerinfo getCE or glite-brokerinfo getCE commands.
    • CE-sft-caver - CA certs version - cvs
      • Check the version of CA RPMs which are installed on the WN and compare them with the reference ones. If for any reason RPM check fails (due to other installation method, for example) the test falls back to physical files test (MD5 checksum comparison for all CA certs with the reference list). This sensor returns OK if the installed CA RPMs are identical to the references.
    • CE-sft-csh - csh test - cvs
      • Try to create and execute a very simple csh script which dumps environment variable to a file. This sensor fails if the csh script is unable to execute and the dump file is missing.
    • CE-sft-lcg-rm - Replica Management - cvs
      • This is a super-test that succeeds only if all of the following tests succeed:
        • CE-sft-lcg-rm-gfal - GFAL Information System - cvs
          • Check if $LCG_GFAL_INFOSYS variable is set.
        • CE-sft-lcg-rm-free - Free space on default SE - cvs
          • Check if the default SE has any free space left according to the information system.
        • CE-sft-lcg-rm-cr - lcg-cr to local SE - cvs
          • Copy and register a short text file to the default SE using lcg-cr command. Retrieve list of replicas with lcg-lr command.
        • CE-sft-lcg-rm-cp - lcg-cp from local SE - cvs
          • Copy the file registered in test CE-sft-lcg-rm-cr to the WN using lcg-cp command.
        • CE-sft-lcg-rm-rep - lcg-rep to "central" SE - cvs
          • Replicate the file registered in test CE-sft-lcg-rm-cr to the chosen "central" SE using lcg-rep command.
        • CE-sft-lcg-rm-del - lcg-del - cvs
          • Delete replicas of all the files registered in previous tests using lcg-del command.
    • CE-sft-softver - Software Version (WN) - cvs
      • Detect the version of software which is really installed on the WN. To detect the version, lcg-version command is used and if the command is not available (very old versions of LCG) the test script checks only the version number of GFAL-client RPM.

SRM version 2 (SRMv2)

Tests executed from SAM UI. Listed in the order of execution.

  • SRMv2-host-cert-valid - Test if the service/host certificate on the SRMv2 is valid. - cvs
    • This sensors returns OK if the host certificate of the SRM is valid, i.e. is not expired.
  • SRMv2-get-SURLs - Get full SRM endpoints and space areas from BDII. - cvs
  • SRMv2-ls-dir - Lists VO's top level space area(s) in SRM. Acts as a light-weight equivalent to 'srmping' test. - cvs
  • SRMv2-put - Copy a local file to the SRM into default space area(s). - cvs
  • SRMv2-ls - List (previously copied) file(s) on the SRM. - cvs
  • SRMv2-gt - Get Transport URLs for the file copied to storage. - cvs
  • SRMv2-get - Copy given remote file(s) from SRM to a local file. - cvs
  • SRMv2-del - Delete given file(s) from SRM - cvs


  • sBDII-sanity - GIIS Sanity Check - cvs
    • Performs the following syntax checks on GIIS:
      • Check for non zero length blank lines: with spaces. This may cause probs.
      • Check for entries that have no values.
      • Check for line without ":". these should not exist.
      • Check missing new line character between two attributes. This looks like two lines combined together.
      • Check for duplicate GlueCEStateWorstResponseTime in each CE.
    • Performs the following logic checks (missing attributes) on GIIS:
      • Check if the GlueCEUnique and GlueSEUnique DNs specified in
        dn: GlueCESEBindGroupCEUniqueID=
      • Check if for srm_v1/edg-se SEs have consistent access rules between the GlueSARoot and GlueServiceURI DN entries.
      • Check if several critical DN entries and their attributes exist.
    • This sensor returns:
      • OK - when there were no problems.
      • NOTE - when blank lines exists.
      • WARN - when blank values or invalid entries were found.
      • ERROR - when the query failed.
  • sBDII-performance - GIIS Perf Check - cvs
    • This sensor shares the same agent as the SanityCheck sensor and uses the same ldapsearch query results.
    • The number of entries found, old entries (not modified within last 10 minutes) and the query response time(ms) are recorded.
    • This sensor returns:
      • OK - when there were no problems.
      • INFO - when the response time > 10 seconds.
      • ERROR - when no entries were found or when they were old.

Access to (SAM) Sites Availability metrics

There are several ways to visualize or pull the sites availability metrics. Some of them are described here:

