Pilot Startup Site Test(under developing)

The Pilot Startup Site Test, PSST, checks the functionality needed for successful CMS production and analysis job execution on a computer. It is a unified version of the worker node test from the Glide-inWMS pilot, SAM, and HC tests. In case the test is not successful, the return code identifies the missing functionality/broken component. The test reports the result into the pilot job classad and can be seen on Kibana

List of Functionality Checks Performed

Functionality is checked sequentially and only the first missing functoinality Stop is reported to the Elastic Search. In case of warning ALERT!, only last one is reported.

PSST tests:

  1. Connectivity + clock skew test.
    • Stop Trying to get HTTP header from google.com or cern.ch.
    • Stop If HTTP header was fetched successfully, timestamp from header is compared with timestamp from the worker node. Maximum allowed clock skew - 60 seconds.
  2. CPU load test
    • ALERT! If number of physical CPUs is lower than CPU load of last minute or CPU load of last minute + cores assigned to pilot is higher than number of CPUs, warning is reported.
  3. Software area test
    • Stop If CMS software area (usually /cvmfs/cms.cern.ch) directory exists and readable.
    • Stop If $SW_DIR/cmsset_default.sh was sourced successfully.
    • If CVMFS is available, free and maximum cache space is added to the report.
    • Stop If CMS_PATH variable is defined and directory exists.
    • Stop If scramv1_version was found.
    • Stop selecting 10 random files from the specific CMSSW release and checking if ls -l is equal to wc -c.
  4. Space test.
    • ALERT! If worker node has 20 GB per core of free scratch space. VO CARD requirements.
    • Stop If worker node has 5 GB per core of free scratch space.
    • ALERT! if worker node has at least 10 MB of free space in /tmp
  5. Proxy test.
    • Stop If CERT directory exists
    • Stop If /tmp/x509up_u**** file exists
    • ALERT! If $ voms-proxy-info was executed successfully
    • ALERT! If proxy validity ends in less than 6 hours
  6. Site config validation. *Checks if JobConfig directory and site-local-config.xml exists and specific strings are defined in siteconf.
  7. Squid test
  8. Singularity test
    • Stop if there's no singularity found on system
    • Stop if it's failed to execute singularity command
  9. StageOut test
    • get primary and secondary stage-out method from site-local-config.xml
    • check tools used for primary/secondary stage-out are working from within CMS Singularity environments

Metrics

Metrics are added in pilots' Startd ClassAD
Metric Name Description
PSST_TIMESTAMP the time that PSST starts running, in test case generated by PSST
PSST_ISSUE_CODES list all test exit codes
PSST_ISSUE_TEXT list all test summary issue details
PSST_ISSUE_SEVERITY final test severity

Metrics can be gotten from pilots' Startd ClassAD

Metrics in Glidein ClassAD Description
HOSTNAME Machine worker node on which PSST was running
CMS_SITENAME GLIDEIN_CMSSite CMS site name

List of Return Codes Generated

PSST uses range(10000 - 19999) of standard CMS job exit codes that defines failures related to the environment setup. OK severity uses 0, NOTICE severity uses range(10001 - 10020), WARNING severity uses range(10021 - 10090), ERROR severity uses range(10091 - 10230).
  • 0: Success
  • 10001: ERROR: connectivity problems
  • 10002: ERROR: cpu load is too high
  • 10003: ERROR: CMS software initialization script cmsset_default.sh failed
  • 10004: ERROR: CMS_PATH not defined
  • 10005: ERROR: CMS_PATH directory does not exist
  • 10006: ERROR: scramv1 command not found
  • 10007: ERROR: ls -l and wc -c byte_count was not equal' (Randomly selecting 10 files from some CMSSW release)
  • 10008: ERROR: scratch directory was not found
  • 10009: ERROR: less than 5 GB/core of free space in scratch dir
  • 10010: ERROR: could not find X509 certificate directory
  • 10011: ERROR: could not find X509 proxy certificate
  • 10012: ERROR: Unable to locate the glidein configuration file
  • 10013: ERROR: No sitename string was not find in site-local-config.xm
  • 10014: ERROR: No PhEDEx node name found for local or fallback stageout in site-local-config.xml
  • 10015: ERROR: No LOCAL_STAGEOUT section in site-local-config.xml
  • 10016: ERROR: No frontier-connect section in site-local-config.xml
  • 10017: ERROR: No callib-data section in site-local-config.xml
  • 10018: ERROR: site-local-config.xml was not found
  • 10019: ERROR: TrivialFileCatalog string missing in site-local-config.xml
  • 10020: ERROR: event_data section is missing in site-local-config.xml
  • 10021: ERROR: no proxy string in site-local-config.xml
  • 10022: ERROR: failed squid test
  • 10023: ERROR: Clock skew is bigger than 60 seconds
  • 10031: ERROR: Can not find CMS software dir
  • 10032: ERROR: No singularity on system
  • 10033: ERROR: Failed to exec singularity command

  • 10050: WARNING: test_squid.py: One of the load balance Squid proxies
  • 10051: WARNING: less than 20 GB of free space in scratch directory
  • 10052: WARNING: less than 10MB free in /tmp
  • 10053: WARNING: CPU load of last minutes + pilot cores is higher than number of physical CPUs
  • 10054: WARNING: proxy shorther than 6 hours'
. https://gitlab.cern.ch/rmaciula/PSST/blob/master/exit_codes.txt

Reporting the results

The results generated in the progress of testing should be added into startd's classad, which startd is started up by pilot. This is implemented with the function 'add_condor_vars_line', more info could be found in http://glideinwms.fnal.gov/doc.prd/factory/custom_scripts.html. In future soon, all monitoring info could be found in ElasticSearch and Kibana. In the old version, the results of the PSST are reported to the Elastic Search and can be seen in kibana.

PSST configuration for Frontend

Lines below should be added to the /etc/gwms-frontend/frontend.xml in frontend machine:

      <file absfname="/data/rmaciula/psst.tgz" after_entry="True" after_group="True" const="True" executable="False" period="0" prefix="GLIDEIN_PS_" untar="True" wrapper="False">
         <untar_options absdir_outattr="CMS_PSST" cond_attr="TRUE"/>
      </file>
      <file absfname="/data/rmaciula/psst_wrapper.sh" after_entry="True" after_group="True" const="True" executable="True" period="0" prefix="GLIDEIN_PS_" untar="False" wrapper="False">
         <untar_options cond_attr="TRUE"/>
      </file>

To apply changes, run $ service gwms-frontend reconfig

Links to Related Information

Responsible: RokasMaciulaitis

Edit | Attach | Watch | Print version | History: r39 < r38 < r37 < r36 < r35 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r39 - 2019-04-04 - XiaoweiJiang
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback