CMS SAM tests

SAM tests status

WLCG Site availability

  • The definition is here. To make a long story short:
    • the daily service availability is the fraction of time when all critical tests were ok
    • the daily site availability is the fraction of time when all services were ok
      • if a site has multiple CEs, it is enough that one of them is available
      • if a site has multiple SEs, all of them must be available

In SAM one can define several profiles, which correspond to different sets of critical tests. For CMS there are two profiles:

  • CMS_CRITICAL: it contains only org.sam.CONDOR-JobSubmit, org.cms.WN-env, org.cms.WN-isolation, org.cms.SE-GSIftp-9summary, and org.cms.SE-WebDAV-9summary
  • CMS_CRITICAL_FULL: it contains (almost) all the CMS SAM tests.
The former is used to produce reports for the WLCG management, where issues depending on experiment configuration should not be visible, while CMS_CRITICAL_FULL should always be used in the context of CMS computing operations.

Current test list

Name Description Maintainer Status
org.cms.WN-env Verify basic WN setup, CMS-independent and print some useful information A. Sciabà Running
org.cms.WN-basic minimal "site is alive" test. Verify local site configuration, SW installation and TFC A. Sciabà Running
org.cms.WN-mc verify site is OK for MC Production. Check stage out and clean up J. Hernandez Running
org.cms.WN-remotestageout verify site can stageout to a remote site. Check stage out and clean up N. Magini Running
org.cms.WN-squid verify Squid is working and can fetch from ORCOFF L. Linares Running
org.cms.WN-frontier verify CMSSW jobs access non-event data via Frontier L. Linares Running
org.cms.WN-analysis site is validated for user/organised Data Analysis S. Belforte Running
org.cms.WN-isolation This test runs the Singularity and the glexec tests and passes only if one of the tests passes B. Bockelman Running
org.cms.WN-CVMFS check that CVMFS is deployed and working correctly A. Sciabà Running
org.cms.WN-xrootd-access Data access via xrootd B. Bockelman Running
org.cms.WN-xrootd-fallback Access fallback mechanism B. Bockelman Running
org.cms.SE-xrootd-* verify site xrootd enpoint(s), without relying on other sites St.Lammel Running
org.cms.SE-WebDAV-* test WebDAV protocol access of SE St.Lammel Running
org.cms.SE-GSIftp-* test GSIftp protocol access of SE St.Lammel Running
org.cms.WN-swinst WN-CVMFS predecessor C. Wissing Not Running
SRMv2-* verify site SRMv2 from the SAM UI, w/o relying on other sites N. Magini Not Running

Links to the source code:

Links to useful documentation:

Site-specific functionality to test

  • CMS can send a job to the site with all the relevant VOMS (Virtual Organization Membership Service) FQANs
  • The WN fulfills the VO card prescriptions
  • The CVMFS (CERN Virtual Machine File System) CMS software area exists and is readable
  • The required middleware libraries are installed
    • lcg_utils (obsolete) and the GFAL2 client. Others?
  • It is possible to write from the WN to the local SE with the protocol(s) chosen by CMS in /store/unmerged and /store/temp
  • It is possible to read data in the local SE from the WN with the protocol(s) chosen by CMS in /store/unmerged and /store/temp
    • Again, the reading should be done using WMCore code
  • The xrootd fallback works
  • The local squid server works
  • The connectivity to all the relevant central services works
    • Currently this includes:
      • the DQM GUI (Data Quality Monitoring) for harvesting jobs
      • cmsweb (which automatically covers all services running behind it)
  • (Obsolete) The remote stageout to a (set of) reference SE(s) via lcg-cp works
    • (Obsolete) We need to have at least a few reference SEs, on both sides of the Atlantic. Tier-1 sites are not an option due to access restrictions.
  • singularity works as required by glideins
  • It is possible to copy files from the Nagios server to the site SE (in /store/merged and /store/temp) and viceversa
  • It is possible to delete files from the site SE

CMS-specific functionality to test

  • The CMS software is correctly installed (assuming that the software area exists and is readable)(obsolete)
  • The software tags are consistent with the installed software(obselete)
  • The site local configuration is complete and consistent with the GIT version
  • The LFN-to-PFN (Logical File Name to Physical File Name) translation works well
    • At the very least we check that a PFN is generated, as today. It might be extended to test different typical LFNs
  • The trivial file catalogue is complete and consistent with the GIT version and the PhEDEx version
  • The Frontier system is working (assuming that the local squid server works)
  • It is possible to run a test analysis on a local dataset (assuming that read access works)
  • It is possible to locally stage out data using the relevant WMCore code and the fallback mechanism

SAM tests guidelines

  • Site-related tests may be CMS-specific in their implementation but should be generic in their meaning (i.e. any LHC VO could develop equivalent tests). Whenever possible the implementation should also be generic to avoid duplication of effort and improve consistency among VOs.
  • The test output must be in text format and the last line before exiting must have the format summary: ERROR_STRING, where ERROR_STRING is replaced by a machine-parseable string that gives some information about why the test failed.
  • The text output should be as concise as possible and use key: value structures whenever practical.

SAM CE/Worker Node Timeouts/Limits

  • SAM ETF has a 1410 minutes (23.5 hours) timeout limit for jobs and 1380 minutes (23 hours) limit for jobs in idle state.
  • Jobs are submitted to HTCondor-CE and ARC-CE with a 30 minute wall time request
  • The SAM job executes two worker node test simultaneously, each having a 570 sec timeout; when all tests completed or after 600 sec (10 min) the SAM job shuts down sending any remaining tests a SIGTERM followed by a SIGILL 25 sec later.

Requirements for sites to receive SAM tests

  • site needs to be registered with CMS
  • CE resources, i.e. HTCondor-CE or ARC-CE services, need to have a glide-in WMS factory entry
    • if a production factory has an entry the resource is considered in production and SAM results of it will be included in the site evaluation
    • if there is only an entry in an integration factory, the resource will be flagged as test and results disregarded in the site evaluation
  • storage endpoints, i.e. GSI/GridFTP and/or WebDAV services, need to have a protocol entry in the Rucio Storage Element, RSE
    • sites specify the prefix/LFN-to-PFN translation in their SITECONF storage.json file
    • if the RSE is set to auto-update, new protocol definitions and updates to existing protocols are automatically propagated to Rucio (protocol removals are never automatically removed and always require a GGUS ticket to "CMS Datatransfers"), if auto-update is not enabled a GGUS ticket to "CMS Datatranfers" is required to propagate the storage.json update to Rucio
  • XRootD endpoints, i.e. the site redirector(s), are handled separately and require a GGUS ticket (or email) to "CMS Site Support"
  • perfSONAR endpoints are not yet probed but the ones of OSG sites are already defined; to update an endpoint please submit a GGUS ticket (or email) to "CMS Site Support".
  • we recommend to register production computing resources with the grid middleware provider, i.e. EGI or OSG, so downtime information can be considered in SAM reliability calculations
  • the VO-feed checks and updates endpoint information from the various sources every half hour
  • the Experiment Test Frameworks, ETF, service schedules/runs the SAM tests and updates it's configuration from the VO-feed twice a day

Failure reasons from production experience

  • Transfers (Stefan P.)
    • missing input files due to storage inconsistency, causing massive job failures, resolved by a BDV check and invalidations
  • Prompt Skimming (Diego B.)
    • I/O errors due to
      • Files not available for reading due to storage system errors
      • Corrupted copies of the files
      • Errors when writing out the produced files
    • Software errors Problems running the CMSSW setup scripts
  • Processing and Production (Edgar F.)
    • see Prompt Skimming
    • permission problems for glideins when they stage out (via lcg-cp) to /store/unmerged or /store/mc or when reading files from SE while sometimes admins do not see any problem
    • problems related to memory limits: it might be useful to have test glidein jobs that go to the memory limit and see what happens
    • insufficient level of replication of MinBias datasets (shows up only when there are hundreds of concurrent jobs)

What to do if tests fail

Hints for site admin on how to find/fix the reasons for SAM CMS test failures

Running SAM tests manually

If a SAM test is failing and you want to run the tests locally on your machine:

  • Create a Proxy
    • voms-proxy-init -voms cms
    • Enter Grid Password
    • Copy the path from the output of the command, for example: "Created proxy in /tmp/x509up_u124228"
    • Paste it after "X509_USER_PROXY=" in the next step.

  • Define and update the location of the proxy
    • export X509_CERT_DIR=/etc/grid-security/certificates
    • export X509_USER_PROXY=/tmp/x509up_u124228

  • Run the SAM test for example:
    • python cmssam/SiteTests/testjob/tests/CE-cms-xrootd-fallback

How to resubmit the SAM tests

To run an on demand CondorG submission test (and consequently re-run all the WN tests) one can:
  • go to the SAM ETF web page
  • expand "Services" in the toolbar on the left and select "All Services"
  • find your CE in the window and in the third, Icons, column for "org.sam.CONDOR-JobState-/cms/Role=..." (choose the desired role) click on the green forward-backward arrow pair to trigger a submit

For the SRM tests the procedure is identical but the test to resubmit is org.cms.SRM-AllCMS-/cms/Role=production.

How to retransfer the SAM test dataset

If files in a SAM dataset got lost or corrupted, please open a GGUS ticket with the transfer/data management team (CMS Support Unit = "CMS Datatransfers") and ask them to locally invalidate the dataset. Rucio will then automatically re-transfer the dataset. (Corrupted files don't need to be deleted locally, Rucio will take care of removal/override any existing file.) The active SAM datastes are
  • old SAM dataset: /GenericTTbar/SAM-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO (4 files, 3.4 GB)
  • new SAM dataset: /GenericTTbar/SAM-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM (2 files, 2.3 GB)
  • newest SAM dataset: /GenericTTbar/SAM-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/AODSIM (3 files, 2.3 GB)
Edit | Attach | Watch | Print version | History: r57 < r56 < r55 < r54 < r53 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r57 - 2022-06-01 - StephanLammel
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback