Pilot Startup Site Test(under developing)
The Pilot Startup Site Test, PSST, checks the functionality needed for successful CMS production and analysis job execution on a computer. It is a unified version of the worker node test from the Glide-inWMS pilot, SAM, and HC tests. In case the test is not successful, the return code identifies the missing functionality/broken component. The test reports the result into the pilot job classad and can be seen on
Kibana
List of Functionality Checks Performed
Functionality is checked sequentially and only the first missing functoinality

is reported to the Elastic Search. In case of warning

, only last one is reported.
PSST tests:
- Connectivity + clock skew test.
-
Trying to get HTTP header from google.com or cern.ch.
-
If HTTP header was fetched successfully, timestamp from header is compared with timestamp from the worker node. Maximum allowed clock skew - 60 seconds.
- CPU load test
-
If number of physical CPUs is lower than CPU load of last minute or CPU load of last minute + cores assigned to pilot is higher than number of CPUs, warning is reported.
- Software area test
-
If CMS software area (usually /cvmfs/cms.cern.ch) directory exists and readable.
-
If $SW_DIR/cmsset_default.sh was sourced successfully.
- If CVMFS is available, free and maximum cache space is added to the report.
-
If CMS_PATH variable is defined and directory exists.
-
If scramv1_version was found.
-
selecting 10 random files from the specific CMSSW release and checking if ls -l is equal to wc -c.
- Space test.
-
If worker node has 20 GB per core of free scratch space. VO CARD requirements
.
-
If worker node has 5 GB per core of free scratch space.
-
if worker node has at least 10 MB of free space in /tmp
- Proxy test.
-
If CERT directory exists
-
If /tmp/x509up_u**** file exists
-
If $ voms-proxy-info was executed successfully
-
If proxy validity ends in less than 6 hours
- Site config validation. *Checks if JobConfig directory and site-local-config.xml exists and specific strings are defined in siteconf.
- Squid test
- Singularity test
-
if there's no singularity found on system
-
if it's failed to execute singularity command
- StageOut test
- get primary and secondary stage-out method from site-local-config.xml
- check tools used for primary/secondary stage-out are working from within CMS Singularity environments
Metrics
Metrics are added in pilots' Startd ClassAD
Metric Name |
Description |
PSST_TIMESTAMP |
the time that PSST starts running, in test case generated by PSST |
PSST_ISSUE_CODES |
list all test exit codes |
PSST_ISSUE_TEXT |
list all test summary issue details |
PSST_ISSUE_SEVERITY |
final test severity |
Metrics can be gotten from pilots' Startd ClassAD
List of Return Codes Generated
PSST uses range(10000 - 19999) of
standard CMS job exit codes that defines failures related to the environment setup. OK severity uses 0, NOTICE severity uses range(10001 - 10020), WARNING severity uses range(10021 - 10090), ERROR severity uses range(10091 - 10230).
- 0: Success
- 10001: ERROR: connectivity problems
- 10002: ERROR: cpu load is too high
- 10003: ERROR: CMS software initialization script cmsset_default.sh failed
- 10004: ERROR: CMS_PATH not defined
- 10005: ERROR: CMS_PATH directory does not exist
- 10006: ERROR: scramv1 command not found
- 10007: ERROR: ls -l and wc -c byte_count was not equal' (Randomly selecting 10 files from some CMSSW release)
- 10008: ERROR: scratch directory was not found
- 10009: ERROR: less than 5 GB/core of free space in scratch dir
- 10010: ERROR: could not find X509 certificate directory
- 10011: ERROR: could not find X509 proxy certificate
- 10012: ERROR: Unable to locate the glidein configuration file
- 10013: ERROR: No sitename string was not find in site-local-config.xm
- 10014: ERROR: No PhEDEx node name found for local or fallback stageout in site-local-config.xml
- 10015: ERROR: No LOCAL_STAGEOUT section in site-local-config.xml
- 10016: ERROR: No frontier-connect section in site-local-config.xml
- 10017: ERROR: No callib-data section in site-local-config.xml
- 10018: ERROR: site-local-config.xml was not found
- 10019: ERROR: TrivialFileCatalog string missing in site-local-config.xml
- 10020: ERROR: event_data section is missing in site-local-config.xml
- 10021: ERROR: no proxy string in site-local-config.xml
- 10022: ERROR: failed squid test
- 10023: ERROR: Clock skew is bigger than 60 seconds
- 10031: ERROR: Can not find CMS software dir
- 10032: ERROR: No singularity on system
- 10033: ERROR: Failed to exec singularity command
- 10050: WARNING: test_squid.py: One of the load balance Squid proxies
- 10051: WARNING: less than 20 GB of free space in scratch directory
- 10052: WARNING: less than 10MB free in /tmp
- 10053: WARNING: CPU load of last minutes + pilot cores is higher than number of physical CPUs
- 10054: WARNING: proxy shorther than 6 hours'
.
https://gitlab.cern.ch/rmaciula/PSST/blob/master/exit_codes.txt
Reporting the results
The results generated in the progress of testing should be added into startd's classad, which startd is started up by pilot. This is implemented with the function 'add_condor_vars_line', more info could be found in
http://glideinwms.fnal.gov/doc.prd/factory/custom_scripts.html
. In future soon, all monitoring info could be found in ElasticSearch and Kibana.
In the old version, the results of the PSST are reported to the Elastic Search and can be seen in
kibana
.
PSST configuration for Frontend
Lines below should be added to the /etc/gwms-frontend/frontend.xml in frontend machine:
<file absfname="/data/rmaciula/psst.tgz" after_entry="True" after_group="True" const="True" executable="False" period="0" prefix="GLIDEIN_PS_" untar="True" wrapper="False">
<untar_options absdir_outattr="CMS_PSST" cond_attr="TRUE"/>
</file>
<file absfname="/data/rmaciula/psst_wrapper.sh" after_entry="True" after_group="True" const="True" executable="True" period="0" prefix="GLIDEIN_PS_" untar="False" wrapper="False">
<untar_options cond_attr="TRUE"/>
</file>
To apply changes, run $ service gwms-frontend reconfig
Links to Related Information
Responsible:
RokasMaciulaitis