Test and Release process for Castor SRM2
This page outlines the testing process currently performed to certify a new Castor SRM2 release for production deployment. For the time being, it also tracks work in progress and/or plans to achieve the desired test process.
Definitions
- Major version: a software release where any digit of the version number can change with respect to the previous release. A major version upgrade may require new Castor libraries and/or schema changes and requires an intrusive intervention.
- Minor release: a software release where only the last digit is changed with respect to the previous release. A minor version upgrade does not require new Castor libraries nor schema changes, and it may or may not be performed in a transparent (rolling) manner.
Functional test steps
To run any of the following tests, you need to have a valid grid certificate and a set of shell environment variables. For example, from
lxplus
it is advised to run:
source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid-env.sh
grid-proxy-init
before initiating any test session. Moreover, it is required to use a directory in the Castor namespace where the pool account you're mapped to (usually
dteam001
) has write access. The S2 test suite uses by default
/castor/cern.ch/grid/dteam/S2-test-results
.
For any new version, the following functionality tests are performed against certification endpoints,
lxsrmdev0N.cern.ch
for N = 1, 2, 3, 4 (see
SrmDev for the actual deployment):
- The SAM based test
- The S2 test suite
- Extra Castor SRM tests
- GFAL prestaging
SAM-based test
A SAM-like test using lcg_utils is provided in
svn
.
Typical usage:
> srm2_testlcgutils.sh
Usage: ./srm2_testlcgutils.sh endpoint-name [spacetoken] [castor path]
> ./srm2_testlcgutils.sh srm-pps srm2_d1t0
#
# Executing "lcg-cp --verbose --nobdii -D srmv2 --vo dteam --dst srm2_d0t1 file:///etc/group srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663"
#
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: dteam
Checksum type: None
Destination SE type: SRMv2
Destination SRM Request Token: 9145153
Source URL: file:/etc/group
File size: 2443
Source URL for copy: file:/etc/group
Destination URL: gsiftp://lxfsre5303.cern.ch:20886/7e8e0dad-e0fc-3105-e040-8a89c180035b
# streams: 1
2443 bytes 3.42 KB/sec avg 3.42 KB/sec inst
Transfer took 1070 ms
#
# Executing "lcg-ls --verbose --nobdii -D srmv2 --vo dteam -l srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663"
#
SE type: SRMv2
-rw-r----- 1 2 2 2443 ONLINE /castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663
* Checksum: ()
* Space tokens: 48f34339-0000-1000-926f-8fd2f86a7650
#
# Executing "lcg-cp --verbose --nobdii -D srmv2 --vo dteam srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663 file:///tmp/test-group"
#
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: dteam
Checksum type: None
Trying SURL srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663 ...
Source SE type: SRMv2
Source SRM Request Token: 9145156
Source URL: srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663
File size: 2443
Source URL for copy: gsiftp://lxfsrl6306.cern.ch:20024/7e8ce0cc-ae74-d981-e040-8a89c180035d
Destination URL: file:/tmp/test-group
# streams: 1
0 bytes 0.00 KB/sec avg 0.00 KB/sec inst
Transfer took 1010 ms
#
# Executing "lcg-gt --verbose --nobdii -D srmv2 --st srm2_d0t1 srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663 gsiftp"
#
gsiftp://lxfsre5303.cern.ch:20622/7e8e0da9-f72d-715b-e040-8a89c1800363
9145159
#
# Executing "lcg-gt --verbose --nobdii -D srmv2 --st srm2_d0t1 srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663 rfio"
#
rfio://castorpublic.cern.ch:9002//castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663?svcClass=default&castorVersion=2
9145162
#
# Executing "lcg-gt --verbose --nobdii -D srmv2 --st srm2_d0t1 srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663 xroot"
#
root://castorpublic.cern.ch//castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663?svcClass=default
9145168
#
# Executing "lcg-del --verbose --nobdii -D srmv2 --nolfc --vo dteam srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663"
#
VO name: dteam
SE type: SRMv2
srm://srm-pps:8443/srm/managerv2?SFN=/castor/cern.ch/grid/dteam/castordev/test-srm-pps_8443-srm2_d0t1-ed6b7013-5329-4f5b-aaba-0e1341f30663 - DELETED
#
# Done!
#
S2 test suite
The S2 test suite has been developed by Flavia and can be executed on a 32-bit-enabled box. Details on S2 are on the
SRMDev twiki.
A basic run of S2 is:
cd ~itglp/testsuite/srm/S2
source env.sh
cd basic
make test
The certification process includes running both the basic and the usecase test families. Note that they can take a substantial time to complete!
--
To be done -- Recompile the S2 framework
To interpret the outcome you must take into account that a number of SRM requests are not supported by Castor SRM, thus the correspondent basic tests fail; also a number of use case tests exercise special boundary conditions which are known to break! This is ok as long as the impact is known to be negligible for the users.
Extra Castor SRM tests
These tests should be part of S2 at some stage. For the time being they are run by using the Castor SRM
srm2_test*
command-line clients.
- srmPrepareToGet|BoL of 2 files
- srmPrepareToGet|BoL of a directory
- srmPrepareToGet|BoL passing a space token
- GetStatusPartial{Ex,Ne} also for bringOnline
- srmPurgeFromSpace using a 'predefined' space token (so to not rely on srmReserveSpace)
- srmPrepareToPut|Get cycles with a different protocol than gsiftp (rfio, xroot)
- srmPrepareToPut|Get (lcg-getturls) with a list of protocols, checking the order is respected
Testing GFAL prestage
-- work in progress --
See
https://twiki.cern.ch/twiki/bin/view/Sandbox/PreStagingTestsReferenceImplementationArch
Testing srmcp
-- work in progress --
For example:
export SRM_PATH=/afs/cern.ch/project/gd/LCG-share/3.2.8-0/d-cache/srm
$SRM_PATH/bin/srmcp -debug -srm_protocol_version 2 -space_token <spacetoken> SURL1 SURL2
Testing VOMS Roles
-- work in progress --
itglp@lxcastordev02:user/i/itglp> source /afs/cern.ch/project/gd/LCG-share/current_3.2/etc/profile.d/grid-env.sh
itglp@lxcastordev02:user/i/itglp> voms-proxy-init -voms dteam:/dteam/cern/Role=lcgadmin
Enter GRID pass phrase:
Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lopresti/CN=626027/CN=Giuseppe Lo Presti
Creating temporary proxy .................................................................................... Done
Contacting lcg-voms.cern.ch:15004 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch] "dteam" Done
Creating proxy ...................................................................................... Done
Your proxy is valid until Sat Jan 15 03:13:42 2011
itglp@lxcastordev02:user/i/itglp> voms-proxy-info -all
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lopresti/CN=626027/CN=Giuseppe Lo Presti/CN=proxy
issuer : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lopresti/CN=626027/CN=Giuseppe Lo Presti
identity : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lopresti/CN=626027/CN=Giuseppe Lo Presti
type : proxy
strength : 1024 bits
path : /tmp/x509up_u22103
timeleft : 11:58:22
=== VO dteam extension information ===
VO : dteam
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lopresti/CN=626027/CN=Giuseppe Lo Presti
issuer : /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch
attribute : /dteam/cern/Role=lcgadmin/Capability=NULL
attribute : /dteam/cern/Role=NULL/Capability=NULL
attribute : /dteam/Role=NULL/Capability=NULL
timeleft : 11:58:22
uri : lcg-voms.cern.ch:15004
Stress tests
Only for
major versions: tests are run against the
castordev/srmcert5
CDB cluster (to become the
srm-itdc.cern.ch
endpoint); database schema is
srm_itdc@srm-itdc-db
.
- The load/stress test family of the S2 test suite
- The FTS load tests
The S2 stress test family
This test is was performed in collaboration with Flavia. The scripts are being adapted to be able to use
lxtest
machines as clients and without any dependency on Flavia's grid certificate.
Description of the test
The test runs from multiple clients and aims at loading the endpoint with a large number of concurrent requests. Typical rates over a day follow:
[root@lxbrb2910 castor]# grep 'New Req' srmfed.log | awk '{print $10}' | sort | uniq -c
60 Type="srm__srmGetSpaceTokens"
2790 Type="srm__srmLs"
733 Type="srm__srmMkdir"
14011 Type="srm__srmPrepareToGet"
9226 Type="srm__srmPrepareToPut"
8210 Type="srm__srmPutDone"
4846 Type="srm__srmRm"
182 Type="srm__srmRmdir"
1792332 Type="srm__srmStatusOfGetRequest"
1427412 Type="srm__srmStatusOfPutRequest"
With a hourly rate between 100K and 220K reqs/h.
Moreover, the same set of SURLs is reused from many clients in order to exercise race conditions: any given SURL is re-written, re-read and aborted many times concurrently.
- To be added to the stress test: multiple srmBringOnline requests on top of the prepareToGet|Put to unveil potential race conditions and/or deadlocks across all asynchronous stager requests.
When the stress test is ongoing, the standard S2 basic and use-case suites are run on top to assess whether the system continues to behave correctly under load.
How to assess the outcome of the test
A stress test does not provide a red/green flag by its nature. Typical things to observe include:
- Core dumps due to race conditions
- Memory or socket leaks: check the lemon page for the box
- Oracle errors: check both the frontend and the backend daemons' logs and monitor the Oracle EM for bad execution plans, deadlocks, etc.
- High rate of INTERNAL_ERRORs
- Abnormally high processing times
- etc...
FTS load tests
To be done. We need to agree and setup FTS channels between the
srm-itdc
and
srm-pps
endpoints.
srm-public
could be involved depending on availability and other concurrent production activities in its Castor backend (
castorpublic
).
History
In the
FIO wiki.
--
GiuseppeLoPresti - Aug 2009