OLD CMS page
ATLAS Critical Tests FAQ
What is a "critical" test?
Each VO has the possibility to flag any SAM test executed by any VO (even a different one) as "critical". As SAM tests are organized into "sensors", where a sensor corresponds to a Grid service type (CE, SE, SRM, BDII, FTS, etc.), each VO defines zero or more critical tests for each Grid service type.
Which SAM tests are interesting for CMS?
The only SAM tests that can possibly have any relevance for
CMS are those submitted with a
cms certificate (obviously) and those submitted with an
ops certificate. Among them,
CMS is currently interested only in the tests for the CE, the SE, the SRM and FTS, but others might be added in future.
What is the ops VO?
The
ops VO is the VO used by the EGEE Grid operations to run the SAM tests for the Grid infrastructure monitoring. The
ops VO is guaranteed to exist at all EGEE sites, while typically does not exist on OSG sites. Therefore, normally the
ops SAM tests are not run on OSG (with the exception of FNAL). This means that if a SAM
ops test is selected as critical for
CMS, this choice will be totally irrelevant for the average OSG site.
What happens when a test is flagged as critical?
There are two kinds of effects. The first effect is that if a Grid service instance fails a critical test, it will be flagged as "unavailable" in the WLCG monitoring. If all Grid services of a given type at a site fail one or more critical tests, the Grid service type (e.g. the "CE") is flagged as unavailable for the site (e.g. "site X has no working CEs"). If a site has a Grid service type considered to be essential (e.g. the CE, the SE, the SRM) which is unavailable, the site is marked as unavailable as a whole. This implies that extreme caution must be taken when defining which SAM tests are critical!
The second effect is relevant only for the CE and SE: the VO has the choice to automatically exclude a CE or a SE from the information system depending on if the CE or the SE has failed at least one critical test. The rationale is that if a Grid service has failed a critical test, is "useless" and should not even be visible to the user.
What is the difference between a CMS test and a CMS critical test?
There is a problem of terminology that never ceases to generate confusion. The following statements should hopefully make things clearer:
- a CMS test is a SAM test developed internally in CMS, typically used to test very CMS-specific functionalities (the CMSSW installation, FroNtier, etc.); these tests are quite naturally run with a cms proxy
- a CMS critical test is any test that CMS has defined as critical: it can be a test run by any VO, and actually at the moment most of the CMS critical tests are run with an ops proxy
- an OPS test is a somewhat ambiguous concept: it may refer to any test run with an ops proxy, or to a test developed by the EGEE operations team, but possibly run also by other VOs. For example, in this sense the "job submission" test is an OPS test (i.e. developed by EGEE) but it is run both by OPS and CMS. The first meaning is certaintly the most appropriate for the expression
- the WLCG availability for CMS uses all and only the CMS critical tests, irrespective of the VO that was used to run them.
What is the difference between the WLCG availability for CMS, the WLCG availability for OPS and the CMS availability?
This is another common source of confusion:
- the WLCG availability for CMS is the WLCG availability calculated using the CMS critical tests
- the WLCG availability for OPS is the WLCG availability calculated using the OPS critical tests
- the CMS availability is the availability as calculated internally in CMS either with Brian Bockelman's script, or with the ARDA dashboard; actually they use different definitions, which is yet another source of confusion. What they have in common is that the CMS-specific tests are treated like critical tests.
This explains why there can be huge differences between different site availability calculations: they are inherently different, and none of them is "right" or "wrong".
What is FCR (Freedom of Choice for Resources)?
FCR is a
web interface
allowing the VO manager to decide:
- which tests are critical for his VO and for each Grid service type
- for every single CE and SE in the Grid supporting its VO, if the CE or SE should be a) always visible to the user, b) never visible to the user, or c) visible to the user only when it passes all the critical tests. Due to an outstanding bug, it is not possible to change this setting for OSG resources: they are always visible to the user.
What does "not visible to the user" mean?
For the CE, it means that in the top BDII the following
attribute=value pair is removed, e.g.:
GlueCEAccessControlBaseRule: VO:cms
Similarly for the SE: for example:
GlueSAAccessControlBaseRule: cms
is removed. All the rest of the information is left untouched, so the CE or SE do not really "disappear" from the information system: they just cease to advertise that they support the VO
CMS.
A BDII can choose to ignore completely FCR, in which case those
attribute=value pairs are never removed. This is the case, for example, for the BDII used by the Resource Brokers which submit the SAM tests: CEs must always be reachable by SAM jobs!
What does really happen if a CE "disappears"?
The real effect of the "deletion" of a CE is that an LCG Resource Broker or a gLite WMS connected to a BDII that uses FCR will not "see" that CE anymore: it cannot submit any jobs to it! Again, the idea behind this is: if a CE failed a critical test (or has been blacklisted in FCR), user jobs should never be submitted there. It is an effective way to protect users from hopelessly trying to run their jobs on bad CEs.
What does really happen if a SE "disappears"?
In this case, the only practical effect would be when using the
lcg_util commands provided by EGEE for simple data management operations: they may check if a SE where the user is trying to write a file supports his VO. As
CMS is not using these commands, the effect is in practice none at all. FTS does not use that information, and will not be affected as well.
Just a note on SRM: for historical reasons, FCR cannot exclude SRMs, even if now in practice every SE is an SRM and viceversa. This can be seen as a bug in FCR, but a totally irrelevant one for
CMS.
How to choose the CE critical tests?
The fact that the criticality of a test affects both the CE availability and the ability to reach the CE using the RB or the WMS poses a very strong constraint on the choice of the critical tests for the CE: only tests whose failure means the CE is practically unusable for
CMS should be critical. As a matter of fact, only the "job submission" test is a reasonable choice, while the
CMS-specific tests, like those for FroNTier, or the MC test, should not be set as critical because they are relevant only for some kinds of jobs. The implication is that in the WLCG site availability calculation sites will tend to look better than they are, because failures of the more
CMS-specific tests will not be accounted for. This is the basic reason why
CMS also calculates its own site availability, which may be radically different from the WLCG availability.
A possible way to unify the
CMS and the WLCG availability is to configure FCR in order to ignore the SAM test results: doing so, no CE will ever disappear because it failed a critical test, while critical test failures will still be reflected in the WLCG availability.
CMS could just choose which tests best represent the "goodness" of a site, without any side effect. However, this would put on CRAB and on the
ProdAgent the responsibility to find out if a CE is working fine or not, which could be practically achieved by looking in real time at the results of the latest SAM tests, by querying either the ARDA dashboard or the SAM database.
How to choose the SE/SRM critical tests?
For the SE and the SRM, no side effects are expected anyway for
CMS, so already now it is safe for
CMS to define whatever SAM tests it prefers as critical, for the sake of the site availability calculation. As currently there is a one-to-one correspondence between SE and SRM, it is reasonable just not to set any critical test for the SE (and ignore the SE altogether) and to set critical tests for the SRM.
List of critical CE tests
Test |
Description |
Developed by |
Critical for (when run by) |
Reason |
js |
Job submission |
ops |
ops (ops), cms (cms) |
if it fails, the CE cannot even run an ops job |
ver |
Finds the version of gLite |
ops |
ops (ops) |
it must be possible to find the gLite version |
ca |
Checks the CA certificates |
ops |
ops (ops), cms (ops) |
it if fails, authentication errors will occur |
bi |
Checks that the brokerinfo command works |
ops |
ops (ops) |
the brokerinfo must work |
csh |
Checks that the C-shell works |
ops |
ops (ops) |
C-shell scripts must work |
rm |
Replica management |
ops |
ops (ops) |
the gLite data management must work |
cert |
Host certificate |
ops |
ops (ops) |
the host certificate must be valid |
List of critical SE tests
Test |
Description |
Developed by |
Critical for (when run by) |
Reason |
cr |
Copies and registers a file to the SE |
ops |
ops (ops) |
it must be possible to copy & register a file to the SE |
cp |
Copies a file from the SE |
ops |
ops (ops) |
it must be possible to copy a file from the SE |
del |
Deletes a file from the SE |
ops |
ops (ops) |
it must be possible to delete a file from the SE |
List of critical SRM v1 tests
Test |
Description |
Developed by |
Critical for (when run by) |
Reason |
v1-put |
Stores a file to SRM (put) using srmcp |
cms |
cms (cms) |
it must be possible to copy a file to the SRM |
put |
Stores a file to SRM (put) using lcg-cr |
ops |
ops (ops) |
it must be possible to copy a file to the SRM |
get |
Copy a file back from the SRM (get) using lcg-cp |
ops |
ops (ops) |
it must be possible to copy back a file from the SRM |
del |
Delete a file from the SRM (advisory-delete) using lcg-del |
ops |
ops (ops) |
it must be possible to delete a file from the SRM |
List of critical FTS tests
Test |
Description |
Developed by |
Critical for (when run by) |
Reason |
cert |
Host certificate |
ops |
ops (ops) |
the host certificate must be valid |
ftschn |
FTS channels |
ops |
ops (ops) |
it must be possible to list the channels |
ftsinfo |
FTS information |
ops |
ops (ops) |
the FTS must be correctly registered in the information system |
--
AndreaSciaba - 06 Nov 2007
--
AleDiGGi - 13 May 2008