CMS SAM tests

SAM tests status

WLCG Site availability

  • The definition is here. To make a long story short:
    • the daily service availability is the fraction of time when all critical tests were ok
    • the daily site availability is the fraction of time when all services were ok
      • if a site has multiple CEs, it is enough that one of them is available
      • if a site has multiple SEs, all of them must be available

In SAM one can define several profiles, which correspond to different sets of critical tests. For CMS there are two profiles:

  • CMS_CRITICAL: it contains only org.sam.CONDOR-JobSubmit, org.cms.WN-env, org.cms.WN-isolation, org.cms.SRM-GetPFNFromTFC, org.cms.SRM-VOPut and org.cms.SRM-VOGet
  • CMS_CRITICAL_FULL: it contains (almost) all the CMS SAM tests.
The former is used to produce reports for the WLCG management, where issues depending on experiment configuration should not be visible, while CMS_CRITICAL_FULL should always be used in the context of CMS computing operations.

Current test list

Name Description Maintainer Status
CE
org.cms.WN-env Verify basic WN setup, CMS-independent and print some useful information A. Sciabà Running
org.cms.WN-basic minimal "site is alive" test. Verify local site configuration, SW installation and TFC A. Sciabà Running
org.cms.WN-swinst verification of SW installation and that CMSSW can be installed remotely C. Wissing Running
org.cms.WN-mc verify site is OK for MC Production. Check stage out and clean up J. Hernandez Running
org.cms.WN-remotestageout verify site can stageout to a remote site. Check stage out and clean up N. Magini Running
org.cms.WN-squid verify Squid is working and can fetch from ORCOFF L. Linares Running
org.cms.WN-frontier verify CMSSW jobs access non-event data via Frontier L. Linares Running
org.cms.WN-analysis site is validated for user/organised Data Analysis S. Belforte Running
org.cms.WN-isolation This test runs the Singularity and the glexec tests and passes only if one of the tests passes B. Bockelman Running
org.cms.WN-CVMFS check that CVMFS is deployed and working correctly A. Sciabà Running
org.cms.WN-xrootd-access Data access via xrootd B. Bockelman Running
org.cms.WN-xrootd-fallback Access fallback mechanism B. Bockelman Running
SRMv2
SRMv2-* verify site SRMv2 from the SAM UI, w/o relying on other sites N. Magini Running
org.cms.SE-xrootd-* verify site xrootd enpoint(s), without relying on other sites St.Lammel Testing

Links to the source code:

Links to useful documentation:

Site-specific functionality to test

  • CMS can send a job to the site with all the relevant VOMS (Virtual Organization Membership Service) FQANs
  • The WN fulfills the VO card prescriptions
  • The CVMFS (CERN Virtual Machine File System) CMS software area exists and is readable
  • The required middleware libraries are installed
    • lcg_utils (obsolete) and the GFAL2 client. Others?
  • It is possible to write from the WN to the local SE with the protocol(s) chosen by CMS in /store/unmerged and /store/temp
  • It is possible to read data in the local SE from the WN with the protocol(s) chosen by CMS in /store/unmerged and /store/temp
    • Again, the reading should be done using WMCore code
  • The xrootd fallback works
  • The local squid server works
  • The connectivity to all the relevant central services works
    • Currently this includes:
      • the DQM GUI (Data Quality Monitoring) for harvesting jobs
      • cmsweb (which automatically covers all services running behind it)
  • (Obsolete) The remote stageout to a (set of) reference SE(s) via lcg-cp works
    • (Obsolete) We need to have at least a few reference SEs, on both sides of the Atlantic. Tier-1 sites are not an option due to access restrictions.
  • singularity works as required by glideins
  • It is possible to copy files from the Nagios server to the site SE (in /store/merged and /store/temp) and viceversa
  • It is possible to delete files from the site SE

CMS-specific functionality to test

  • The CMS software is correctly installed (assuming that the software area exists and is readable)(obsolete)
  • The software tags are consistent with the installed software(obselete)
  • The site local configuration is complete and consistent with the GIT version
  • The LFN-to-PFN (Logical File Name to Physical File Name) translation works well
    • At the very least we check that a PFN is generated, as today. It might be extended to test different typical LFNs
  • The trivial file catalogue is complete and consistent with the GIT version and the PhEDEx version
  • The Frontier system is working (assuming that the local squid server works)
  • It is possible to run a test analysis on a local dataset (assuming that read access works)
  • It is possible to locally stage out data using the relevant WMCore code and the fallback mechanism

SAM tests guidelines

  • Site-related tests may be CMS-specific in their implementation but should be generic in their meaning (i.e. any LHC VO could develop equivalent tests). Whenever possible the implementation should also be generic to avoid duplication of effort and improve consistency among VOs.
  • The test output must be in text format and the last line before exiting must have the format summary: ERROR_STRING, where ERROR_STRING is replaced by a machine-parseable string that gives some information about why the test failed.
  • The text output should be as concise as possible and use key: value structures whenever practical.

Requirements for sites to receive SAM tests

  • EGI and NorduGrid sites
    • Site
      • Register the site in GOCDB
      • Register the site in CMS SiteDB. Make sure that the "LCG Name" of your site in SiteDB is the same as your site name in GOCDB
    • SRMv2
      • Register the storage element in GOCDB
      • Register the storage element in SiteDB
      • To also pass the tests, run PhEDEx Production agents publishing the storage element in the TrivialFileCatalog.
    • CREAM-CE/ARC-CE
      • Register the computing element in GOCDB
      • Publish the computing element in BDII
        • Make sure that the publication is propagated to the top-level BDII lcg-bdii.cern.ch
        • The computing element should be published as "close" to your storage element
        • Make sure that the attribute GlueCEImplementationName is published correctly: "CREAM" for CREAM-CE, "ARC-CE" for ARC-CE
  • OSG sites
    • Site
      • Register the site in OIM
      • Register the site in CMS SiteDB. Make sure that the "LCG Name" of your site in SiteDB is the same as your "Resource Group" name in MyOSG
    • OSG-SRMv2
      • Register the storage element in OIM
      • Register the storage element in SiteDB
      • To also pass the tests, run PhEDEx Production agents publishing the storage element in the TrivialFileCatalog.
    • OSG-CE
      • Register the computing element in OIM
      • Publish the computing element in BDII
        • Make sure that the publication is propagated to the top-level BDII lcg-bdii.cern.ch
        • The computing element should be published as "close" to your storage element
  • If you have followed the steps above correctly, your services should appear in the link below. Please also verify that they appear with the correct flavour:
  • If the services are also published correctly in GOCDB/OIM, you should then start to receive SAM tests automatically.

Failure reasons from production experience

  • Transfers (Stefan P.)
    • missing input files due to storage inconsistency, causing massive job failures, resolved by a BDV check and invalidations
  • Prompt Skimming (Diego B.)
    • I/O errors due to
      • Files not available for reading due to storage system errors
      • Corrupted copies of the files
      • Errors when writing out the produced files
    • Software errors Problems running the CMSSW setup scripts
  • Processing and Production (Edgar F.)
    • see Prompt Skimming
    • permission problems for glideins when they stage out (via lcg-cp) to /store/unmerged or /store/mc or when reading files from SE while sometimes admins do not see any problem
    • problems related to memory limits: it might be useful to have test glidein jobs that go to the memory limit and see what happens
    • insufficient level of replication of MinBias datasets (shows up only when there are hundreds of concurrent jobs)

What to do if tests fail

Hints for site admin on how to find/fix the reasons for SAM CMS test failures

How to resubmit the SAM tests

To run an on demand CondorG submission test (and consequently re-run all the WN tests) one can:
  • go to the SAM ETF web page
  • expand "Services" in the toolbar on the left and select "All Services"
  • find your CE in the window and in the third, Icons, column for "org.sam.CONDOR-JobState-/cms/Role=..." (choose the desired role) click on the green forward-backward arrow pair to trigger a submit

For the SRM tests the procedure is identical but the test to resubmit is org.cms.SRM-AllCMS-/cms/Role=production.

How to retransfer the SAM test dataset

Site data admins can re-transfer the SAM test dataset themselves. Go to the PhEDEx web page, i.e.

https://cmsweb.cern.ch/phedex/prod/Request::Create?type=delete

Select 'remove subscriptions'= no , select your site, and type the dataset name in 'Data Items':

/GenericTTbar/SAM-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO /GenericTTbar/SAM-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM

After you approve the request, PhEDEx will automatically delete and retransfer the dataset.

Edit | Attach | Watch | Print version | History: r47 < r46 < r45 < r44 < r43 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r47 - 2019-06-27 - AkankshaAhuja1
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback