SAM CMS SRMv2 test suite

Maintainer: N. Magini

Overview

The purpose of these tests is to verify that a SRMv2 server at a site is up, running and usable by CMS, following these guidelines:
  • do this without interacting with SRM's at other sites
  • use the CMS endpoint at the site and CMS grid credentials
  • make the test as similar as possible to the PhEDEx operations activities:
    • by using the same endpoint (server)
    • by using the same TFC
    • by using a CMS proxy.
Some differences with real PhEDEx operations are unavoidable:
  • the tests must not write to a location in the storage which is mapped to tape (i.e. use T0D0, not T1Dx)
  • the credentials will not be the same used by the PhEDEx agents (which run with local site admin credentials).
Each test performs the operation with the space token specified in the site TFC, if any.

Test procedure

  • The tests are run from the SAM UI every 2 hours
    • this is not the 3rd party transfer used in PhEDEx, that involves transfer to/from another SRM
    • we do this intentionally to decouple test at a site from another site's SRM
    • could still later on add 3rd party SRM transfer to a common reference site
  • Nicolo.Magini@cernNOSPAMPLEASE.ch is the author of the test scripts
  • The tests copy files to/from a "standard" location (in CMS name space). This location should be
    • known to site admins
    • writable by a proper VOMS role (or everybody)
    • site admins are free to delete files there as they like (sort of D0T0 storage class)
    • not migrated/copied to tape (files are written and deleted here)
    • it is currently the same area as for the CE-cms-mc stageout test (/store/unmerged/SAM). Should it be changed?
    • it should be "on the same hardware" as standard PhEDEx production transfer as it makes sense for the sites, given above constraints
  • The test uses the same TFC from TMDB used by local PhEDEx agents in order to construct the PFN
  • The SRMv2 test uses the lcg-utils python API in order to perform the operations under test. The clients are used in No-BDII mode, disabling any communication with the BDII to find out published space tokens, storage areas etc. : only the basic SRMv2 client functionality of the lcg-utils is used.
  • The tests are executed in a chain.
    • If the initial TFC test fails, all other tests are skipped and return WARNING.
    • If the initial TFC test succeeds but the put test fails (file not copied to storage), all other following tests are skipped and return WARNING.
    • If the TFC and put tests succeed, all other tests are always executed, including the final del test which should clean up the test file from the storage.
  • The test file is created during the put test.
  • The tests currently implemented are:
Test name Critical Description
org.cms.SRM-AllCMS no test chain to execute all following tests in a sequence
org.cms.SRM-GetPFNFromTFC yes use TFC to perform LFN to PFN matching
org.cms.SRM-VOLsDir no list directory entry on SRMv2 (Ls -d)
org.cms.SRM-VOPut yes copy local file to target PFN with lcg-cp (Put sequence)
org.cms.SRM-VOLs no list file on SRMv2 (Ls)
org.cms.SRM-VOGetTURLs no get a TURL usable to retrieve file back from SRMv2 with gridftp
org.cms.SRM-VOGet yes copy remote file back to local disk, compare with original copy (Get sequence)
org.cms.SRM-VODel no delete remote file (Rm)

  • we will leave FTS out of this test, since it couples too many things (FTS server, FTS channel configuration) thare are also specific of a channel (link) and not of a site. It would make things way too much complicated. Better to address it at FTS level.
  • similarly we plan not to restart the PhEDEx heatbeat and stick to this basic "is site up" test

Source code

Troubleshooting

Currently known reasons for failure, and how to solve them:

Invalid argument

lcg-cp may fail with an obscure "Invalid argument" error, both in Put and in Get, for example:

srm://dgc-grid-34.brunel.ac.uk:8446/srm/managerv2?SFN=/dpm/brunel.ac.uk/home/cms/store/unmerged/SAM/testSRM/lcg-util/testfile-cp-20080413-205658.txt: Invalid argument
lcg_cp: Invalid argument

This usually does not mean that the SURL is incorrect or invalid - normally it means that the srmPrepareToPut or srmPrepareToGet request failed. Usually the command will eventually succeed after a few retries. Possible reasons for the failures include:

  • for srmPrepareToPut - issues with disk servers
  • for srmPrepareToGet - file not yet available on gridftp server

An effective way to get a more meaningful error message is to run by hand a file copy using srmcp instead of lcg-cp. One can use the output of the failed org.cms.SRM-VOPut test to find the destination SURL, create a proxy with the production role and execute the command (preferrably from LXPLUS):

srmcp -2 file:////etc/hosts dest_surl

CGSI-gSOAP: Error reading token data header: Connection closed
An instance of this error was seen at a site where the cause for it was a big time difference on the SRM server due to an NTP misconfiguration.

No tests submitted
SRMV2 tests cannot be submitted if the SRM endpoint is not registered in the CMS VOFeed. Typically this happens when the site/endpoint is not published properly in SiteDB.

-- NicoloMagini - 20-Jun-2012

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2013-05-29 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback