CMS SAM tests
SAM tests status
WLCG Site availability
- The definition is here. To make a long story short:
- the daily service availability is the fraction of time when all critical tests were ok
- the daily site availability is the fraction of time when all services were ok
- if a site has multiple CEs, it is enough that one of them is available
- if a site has multiple SEs, all of them must be available
In SAM one can define several profiles, which correspond to different sets of critical tests. For CMS there are two profiles:
- CMS_CRITICAL: it contains only org.sam.CONDOR-JobSubmit, org.cms.WN-env, org.cms.WN-isolation, org.cms.SRM-GetPFNFromTFC, org.cms.SRM-VOPut and org.cms.SRM-VOGet
- CMS_CRITICAL_FULL: it contains (almost) all the CMS SAM tests.
The former is used to produce reports for the WLCG management, where issues depending on experiment configuration should not be visible, while CMS_CRITICAL_FULL should always be used in the context of CMS computing operations.
Current test list
Name |
Description |
Maintainer |
Status |
SRMv2-* |
verify site SRMv2 from the SAM UI, w/o relying on other sites |
N. Magini |
Running |
SRMv2 |
org.cms.WN-xrootd-fallback |
Access fallback mechanism |
B. Bockelman |
Running |
org.cms.WN-xrootd-access |
Data access via xrootd |
B. Bockelman |
Running |
org.cms.WN-swinst |
verification of SW installation and that CMSSW can be installed remotely |
C. Wissing |
Running |
org.cms.WN-squid |
verify Squid is working and can fetch from ORCOFF |
L. Linares |
Running |
org.cms.WN-remotestageout |
verify site can stageout to a remote site. Check stage out and clean up |
N. Magini |
Running |
org.cms.WN-mc |
verify site is OK for MC Production. Check stage out and clean up |
J. Hernandez |
Running |
org.cms.WN-isolation |
This test runs the Singularity and the glexec tests and passes only if one of the tests passes |
B. Bockelman |
Running |
org.cms.WN-frontier |
verify CMSSW jobs access non-event data via Frontier |
L. Linares |
Running |
org.cms.WN-env |
Verify basic WN setup, CMS-independent and print some useful information |
A. Sciabà |
Running |
org.cms.WN-CVMFS |
check that CVMFS is deployed and working correctly |
A. Sciabà |
Running |
org.cms.WN-basic |
minimal "site is alive" test. Verify local site configuration, SW installation and TFC |
A. Sciabà |
Running |
org.cms.WN-analysis |
site is validated for user/organised Data Analysis |
S. Belforte |
Running |
org.cms.SE-xrootd-* |
verify site xrootd enpoint(s), without relying on other sites |
St.Lammel |
Testing |
CE |
Links to the source code:
Links to useful documentation:
Site-specific functionality to test
- CMS can send a job to the site with all the relevant VOMS (Virtual Organization Membership Service) FQANs
- The WN fulfills the VO card prescriptions
- The CVMFS (CERN Virtual Machine File System) CMS software area exists and is readable
- The required middleware libraries are installed
- lcg_utils (obsolete) and the GFAL2 client. Others?
- It is possible to write from the WN to the local SE with the protocol(s) chosen by CMS in /store/unmerged and /store/temp
- As different directories are writeable by different roles, the test will need to be run multiple times with different roles. The writing should be done using WMCore ( Workflow Management Core) code
- It may be appropriate to clone the test and give it different names depending on the FQAN/output directory
- note from SB as charged during the meeting: should rather use/respect CMS LFN space as defined/described here (and make sure those two twikis are consistent)
- It is possible to read data in the local SE from the WN with the protocol(s) chosen by CMS in /store/unmerged and /store/temp
- Again, the reading should be done using WMCore code
- The xrootd fallback works
- The local squid server works
- The connectivity to all the relevant central services works
- Currently this includes:
- the DQM GUI (Data Quality Monitoring) for harvesting jobs
- cmsweb (which automatically covers all services running behind it)
- (Obsolete) The remote stageout to a (set of) reference SE(s) via lcg-cp works
- (Obsolete) We need to have at least a few reference SEs, on both sides of the Atlantic. Tier-1 sites are not an option due to access restrictions.
- singularity works as required by glideins
- It is possible to copy files from the Nagios server to the site SE (in /store/merged and /store/temp) and viceversa
- It is possible to delete files from the site SE
CMS-specific functionality to test
- The CMS software is correctly installed (assuming that the software area exists and is readable)(obsolete)
- The software tags are consistent with the installed software(obselete)
- The site local configuration is complete and consistent with the GIT version
- The LFN-to-PFN (Logical File Name to Physical File Name) translation works well
- At the very least we check that a PFN is generated, as today. It might be extended to test different typical LFNs
- The trivial file catalogue is complete and consistent with the GIT version and the PhEDEx version
- The Frontier system is working (assuming that the local squid server works)
- It is possible to run a test analysis on a local dataset (assuming that read access works)
- It is possible to locally stage out data using the relevant WMCore code and the fallback mechanism
SAM tests guidelines
- Site-related tests may be CMS-specific in their implementation but should be generic in their meaning (i.e. any LHC VO could develop equivalent tests). Whenever possible the implementation should also be generic to avoid duplication of effort and improve consistency among VOs.
- The test output must be in text format and the last line before exiting must have the format
summary: ERROR_STRING
, where ERROR_STRING is replaced by a machine-parseable string that gives some information about why the test failed.
- The text output should be as concise as possible and use
key: value
structures whenever practical.
Requirements for sites to receive SAM tests
- EGI and NorduGrid sites
- Site
- Register the site in GOCDB
- Register the site in CMS SiteDB. Make sure that the "LCG Name" of your site in SiteDB is the same as your site name in GOCDB
- SRMv2
- Register the storage element in GOCDB
- Register the storage element in SiteDB
- To also pass the tests, run PhEDEx Production agents publishing the storage element in the TrivialFileCatalog.
- CREAM-CE/ARC-CE
- Register the computing element in GOCDB
- Publish the computing element in BDII
- Make sure that the publication is propagated to the top-level BDII lcg-bdii.cern.ch
- The computing element should be published as "close" to your storage element
- Make sure that the attribute GlueCEImplementationName is published correctly: "CREAM" for CREAM-CE, "ARC-CE" for ARC-CE
- OSG sites
- Site
- Register the site in OIM
- Register the site in CMS SiteDB. Make sure that the "LCG Name" of your site in SiteDB is the same as your "Resource Group" name in MyOSG
- OSG-SRMv2
- Register the storage element in OIM
- Register the storage element in SiteDB
- To also pass the tests, run PhEDEx Production agents publishing the storage element in the TrivialFileCatalog.
- OSG-CE
- Register the computing element in OIM
- Publish the computing element in BDII
- Make sure that the publication is propagated to the top-level BDII lcg-bdii.cern.ch
- The computing element should be published as "close" to your storage element
- If you have followed the steps above correctly, your services should appear in the link below. Please also verify that they appear with the correct flavour:
- If the services are also published correctly in GOCDB/OIM, you should then start to receive SAM tests automatically.
Failure reasons from production experience
- Transfers (Stefan P.)
- missing input files due to storage inconsistency, causing massive job failures, resolved by a BDV check and invalidations
- Prompt Skimming (Diego B.)
- I/O errors due to
- Files not available for reading due to storage system errors
- Corrupted copies of the files
- Errors when writing out the produced files
- Software errors Problems running the CMSSW setup scripts
- Processing and Production (Edgar F.)
- see Prompt Skimming
- permission problems for glideins when they stage out (via lcg-cp) to /store/unmerged or /store/mc or when reading files from SE while sometimes admins do not see any problem
- problems related to memory limits: it might be useful to have test glidein jobs that go to the memory limit and see what happens
- insufficient level of replication of MinBias datasets (shows up only when there are hundreds of concurrent jobs)
What to do if tests fail
Hints for site admin on how to find/fix the reasons for SAM CMS test failures
Running SAM tests manually
If a SAM test is failing and you want to run the tests locally on your machine:
- Clone the cmssam repository from gitlab using:
- Create a Proxy
- voms-proxy-init -voms cms
- Enter Grid Password
- Copy the path from the output of the command, for example: "Created proxy in /tmp/x509up_u124228"
- Paste it after "X509_USER_PROXY=" in the next step.
- Define and update the location of the proxy
- export X509_CERT_DIR=/etc/grid-security/certificates
- export X509_USER_PROXY=/tmp/x509up_u124228
- Run the SAM test for example:
- python cmssam/SiteTests/testjob/tests/CE-cms-xrootd-fallback
How to resubmit the SAM tests
To run an on demand CondorG submission test (and consequently re-run all the WN tests) one can:
- go to the SAM ETF web page
- expand "Services" in the toolbar on the left and select "All Services"
- find your CE in the window and in the third, Icons, column for "org.sam.CONDOR-JobState-/cms/Role=..." (choose the desired role) click on the green forward-backward arrow pair to trigger a submit
For the SRM tests the procedure is identical but the test to resubmit is
org.cms.SRM-AllCMS-/cms/Role=production
.
How to retransfer the SAM test dataset
Site data admins can re-transfer the SAM test dataset themselves. Go to the
PhEDEx web page, i.e.
https://cmsweb.cern.ch/phedex/prod/Request::Create?type=delete
Select 'remove subscriptions'= no , select your site, and type the dataset name in 'Data Items':
/GenericTTbar/SAM-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO
/GenericTTbar/SAM-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM
After you approve the request,
PhEDEx will automatically delete and
retransfer the dataset.