OLD CMS page

ATLAS Critical Tests FAQ

What is a "critical" test?

Each VO has the possibility to flag any SAM test executed by any VO (even a different one) as "critical". As SAM tests are organized into "sensors", where a sensor corresponds to a Grid service type (CE, SE, SRM, BDII, FTS, etc.), each VO defines zero or more critical tests for each Grid service type.

Which SAM tests are interesting for CMS?

The only SAM tests that can possibly have any relevance for CMS are those submitted with a cms certificate (obviously) and those submitted with an ops certificate. Among them, CMS is currently interested only in the tests for the CE, the SE, the SRM and FTS, but others might be added in future.

What is the ops VO?

The ops VO is the VO used by the EGEE Grid operations to run the SAM tests for the Grid infrastructure monitoring. The ops VO is guaranteed to exist at all EGEE sites, while typically does not exist on OSG sites. Therefore, normally the ops SAM tests are not run on OSG (with the exception of FNAL). This means that if a SAM ops test is selected as critical for CMS, this choice will be totally irrelevant for the average OSG site.

What happens when a test is flagged as critical?

There are two kinds of effects. The first effect is that if a Grid service instance fails a critical test, it will be flagged as "unavailable" in the WLCG monitoring. If all Grid services of a given type at a site fail one or more critical tests, the Grid service type (e.g. the "CE") is flagged as unavailable for the site (e.g. "site X has no working CEs"). If a site has a Grid service type considered to be essential (e.g. the CE, the SE, the SRM) which is unavailable, the site is marked as unavailable as a whole. This implies that extreme caution must be taken when defining which SAM tests are critical!

The second effect is relevant only for the CE and SE: the VO has the choice to automatically exclude a CE or a SE from the information system depending on if the CE or the SE has failed at least one critical test. The rationale is that if a Grid service has failed a critical test, is "useless" and should not even be visible to the user.

What is the difference between a CMS test and a CMS critical test?

There is a problem of terminology that never ceases to generate confusion. The following statements should hopefully make things clearer:

  • a CMS test is a SAM test developed internally in CMS, typically used to test very CMS-specific functionalities (the CMSSW installation, FroNtier, etc.); these tests are quite naturally run with a cms proxy
  • a CMS critical test is any test that CMS has defined as critical: it can be a test run by any VO, and actually at the moment most of the CMS critical tests are run with an ops proxy
  • an OPS test is a somewhat ambiguous concept: it may refer to any test run with an ops proxy, or to a test developed by the EGEE operations team, but possibly run also by other VOs. For example, in this sense the "job submission" test is an OPS test (i.e. developed by EGEE) but it is run both by OPS and CMS. The first meaning is certaintly the most appropriate for the expression
  • the WLCG availability for CMS uses all and only the CMS critical tests, irrespective of the VO that was used to run them.

What is the difference between the WLCG availability for CMS, the WLCG availability for OPS and the CMS availability?

This is another common source of confusion:
  • the WLCG availability for CMS is the WLCG availability calculated using the CMS critical tests
  • the WLCG availability for OPS is the WLCG availability calculated using the OPS critical tests
  • the CMS availability is the availability as calculated internally in CMS either with Brian Bockelman's script, or with the ARDA dashboard; actually they use different definitions, which is yet another source of confusion. What they have in common is that the CMS-specific tests are treated like critical tests.

This explains why there can be huge differences between different site availability calculations: they are inherently different, and none of them is "right" or "wrong".

What is FCR (Freedom of Choice for Resources)?

FCR is a web interface allowing the VO manager to decide:
  • which tests are critical for his VO and for each Grid service type
  • for every single CE and SE in the Grid supporting its VO, if the CE or SE should be a) always visible to the user, b) never visible to the user, or c) visible to the user only when it passes all the critical tests. Due to an outstanding bug, it is not possible to change this setting for OSG resources: they are always visible to the user.

What does "not visible to the user" mean?

For the CE, it means that in the top BDII the following attribute=value pair is removed, e.g.:

GlueCEAccessControlBaseRule: VO:cms

Similarly for the SE: for example:

GlueSAAccessControlBaseRule: cms

is removed. All the rest of the information is left untouched, so the CE or SE do not really "disappear" from the information system: they just cease to advertise that they support the VO CMS.

A BDII can choose to ignore completely FCR, in which case those attribute=value pairs are never removed. This is the case, for example, for the BDII used by the Resource Brokers which submit the SAM tests: CEs must always be reachable by SAM jobs!

What does really happen if a CE "disappears"?

The real effect of the "deletion" of a CE is that an LCG Resource Broker or a gLite WMS connected to a BDII that uses FCR will not "see" that CE anymore: it cannot submit any jobs to it! Again, the idea behind this is: if a CE failed a critical test (or has been blacklisted in FCR), user jobs should never be submitted there. It is an effective way to protect users from hopelessly trying to run their jobs on bad CEs.

What does really happen if a SE "disappears"?

In this case, the only practical effect would be when using the lcg_util commands provided by EGEE for simple data management operations: they may check if a SE where the user is trying to write a file supports his VO. As CMS is not using these commands, the effect is in practice none at all. FTS does not use that information, and will not be affected as well.

Just a note on SRM: for historical reasons, FCR cannot exclude SRMs, even if now in practice every SE is an SRM and viceversa. This can be seen as a bug in FCR, but a totally irrelevant one for CMS.

How to choose the CE critical tests?

The fact that the criticality of a test affects both the CE availability and the ability to reach the CE using the RB or the WMS poses a very strong constraint on the choice of the critical tests for the CE: only tests whose failure means the CE is practically unusable for CMS should be critical. As a matter of fact, only the "job submission" test is a reasonable choice, while the CMS-specific tests, like those for FroNTier, or the MC test, should not be set as critical because they are relevant only for some kinds of jobs. The implication is that in the WLCG site availability calculation sites will tend to look better than they are, because failures of the more CMS-specific tests will not be accounted for. This is the basic reason why CMS also calculates its own site availability, which may be radically different from the WLCG availability.

A possible way to unify the CMS and the WLCG availability is to configure FCR in order to ignore the SAM test results: doing so, no CE will ever disappear because it failed a critical test, while critical test failures will still be reflected in the WLCG availability. CMS could just choose which tests best represent the "goodness" of a site, without any side effect. However, this would put on CRAB and on the ProdAgent the responsibility to find out if a CE is working fine or not, which could be practically achieved by looking in real time at the results of the latest SAM tests, by querying either the ARDA dashboard or the SAM database.

How to choose the SE/SRM critical tests?

For the SE and the SRM, no side effects are expected anyway for CMS, so already now it is safe for CMS to define whatever SAM tests it prefers as critical, for the sake of the site availability calculation. As currently there is a one-to-one correspondence between SE and SRM, it is reasonable just not to set any critical test for the SE (and ignore the SE altogether) and to set critical tests for the SRM.

List of critical CE tests

Test Description Developed by Critical for (when run by) Reason
js Job submission ops ops (ops), cms (cms) if it fails, the CE cannot even run an ops job
ver Finds the version of gLite ops ops (ops) it must be possible to find the gLite version
ca Checks the CA certificates ops ops (ops), cms (ops) it if fails, authentication errors will occur
bi Checks that the brokerinfo command works ops ops (ops) the brokerinfo must work
csh Checks that the C-shell works ops ops (ops) C-shell scripts must work
rm Replica management ops ops (ops) the gLite data management must work
cert Host certificate ops ops (ops) the host certificate must be valid

List of critical SE tests

Test Description Developed by Critical for (when run by) Reason
cr Copies and registers a file to the SE ops ops (ops) it must be possible to copy & register a file to the SE
cp Copies a file from the SE ops ops (ops) it must be possible to copy a file from the SE
del Deletes a file from the SE ops ops (ops) it must be possible to delete a file from the SE

List of critical SRM v1 tests

Test Description Developed by Critical for (when run by) Reason
v1-put Stores a file to SRM (put) using srmcp cms cms (cms) it must be possible to copy a file to the SRM
put Stores a file to SRM (put) using lcg-cr ops ops (ops) it must be possible to copy a file to the SRM
get Copy a file back from the SRM (get) using lcg-cp ops ops (ops) it must be possible to copy back a file from the SRM
del Delete a file from the SRM (advisory-delete) using lcg-del ops ops (ops) it must be possible to delete a file from the SRM

List of critical FTS tests

Test Description Developed by Critical for (when run by) Reason
cert Host certificate ops ops (ops) the host certificate must be valid
ftschn FTS channels ops ops (ops) it must be possible to list the channels
ftsinfo FTS information ops ops (ops) the FTS must be correctly registered in the information system
-- AndreaSciaba - 06 Nov 2007

-- AleDiGGi - 13 May 2008

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-05-13 - AleDiGGi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback