Preliminary results of the prestaging tests

Tests run so far:

Run name DATE SURL list site-storage Ls optimization space token VO
CERN-1 May 5 2009 surl_cerndev.txt CERN - Castor yes - CMS
CERN-2 May 6 2009 surl_cerndev.txt. CERN - Castor no - CMS
IN2P3-1 May 6-7 2009 surl_in2p3.txt IN2P3 - dCache no - CMS
IN2P3-2 May 8 2009 surl_in2p3.txt IN2P3 - dCache no CMS_DEFAULT CMS
IN2P3-3 May 13 2009 lhcb-in2p3.txt IN2P3 - dCache no - LHCb
CERN-3 May 8 2009 surl_cerndev.txt CERN - Castor no - CMS
CERN-4 May 8 2009 surl_cerndev.txt CERN - Castor with CMS stager version 2.1.7 no - CMS
pic-1 May 18 2009 surl-pic-cms-100gb.txt dCache no - CMS
pic-2 May 19 2009 surl-pic-cms-100gb.txt dCache no - CMS
CNAF-1 May 21 2009 CNAF/surl_1_1TB.txt CASTOR no - CMS
CNAF-2 May 22 2009 CNAF/surl_1_1TB.txt CASTOR no - CMS

Comments on every run:

run CERN-1

Tests run on 428 files of CMS, 166 on disk and 262 on tape. srmLs reports that the locality of files become ONLINE with, and after 6800 s (less than 2 hours) all the 428 files are ONLINE. Whereas, the StatusOf command reports the status 'done' only for the 166 files which were already on disk before starting the tests. For the remaining 262 files the status is still 'pending' after 14 hours. Then we stopped the test.

run CERN-2

All files are deleted from disk pool and the test is run again. srmLs correctly reports a number of files with locality ONLINE increasing from 0 to 428 (total number of files in the request) in 1807.255 s (about 30 minutes). StatusOf reports all files 'pending' until the end of the run (stopped after about 40 minutes).

run IN2P3-1

Test run on 428 files of CMS job robot initially on disk. At first polling, StatusOf correctly reports 428 pending and 0 done. Strangely, srmLs reports that only 411 have locality ONLINE and 17 are NEARLINE. Why? Later, statusOf evolves as expected, reporting that more and more files are done until the point where only 16 are pending and 412 are done. After some time, the 16 pending file turn into error (magenta points in the plot below). On the other side, srmLs keeps on reporting the same result given at the first polling, and later, 4 hours after the start of the test, it evolves in the opposite direction: files start to be released and their locality turns from ONLINE to NEARLINE. Eventually, it remains stable with 47 NEARLINE and 381 ONLINE, until the end of the run.
What we don't understand here is the behavior of srmLs: why 17 files are initially NEARLINE? And then, why some files become NEARLINE after 4 hours. The answer of the second question is that most probably they are garbage collected, because the duration of the request is longer that the pin life time of the files.

Done and online vs time.in2p3-1.png

run IN2P3-3

Done and online vs time.IN2P3-3.png

run CERN-3

Giuseppe corrected a bug in the development instance. The test is run again and now the statusOf really updates the information. Of course, there is a slight delay wrt the information given by the srmLs, but this is intrinsic because of the double polling mechanism necessary to transmit the information from CASTOR backend to SRM for the StatusOf command.

Done and online vs time.CERN-3.png

Giuseppe provided explanation about this bug. Initially, it was present in SRM version 2.7, and then it was corrected in 2.8. But then it happened that the new version of Castor reintroduced this problem. So, the only valid combination of SRM and Castor which presently does NOT show up this problem is Castor 2.1.7 with SRM 2.8. See the matrix below to clarify the situation:

Castor version SRM version statusOf is fine?
2.1.7 2.7 buggy
2.1.7 2.8 works fine!
2.1.8 2.7 buggy
2.1.8 2.8 buggy, fix is under test

run CERN-4

In order to confirm the line 2 of the matrix we made a run with CMS stager with Castor version 2.1.7 and SRM 2.8. As the pictures shows, it is correct.

Done and online vs time.CERN-4.jpg

run pic-1

First attempt to run the test gives an error:
Cannot submit request
[SE][BringOnline] httpg:// CGSI-gSOAP:
Error reading token data: Connection reset by peer

[SE][AbortRequest] httpg:// request
contains no request token

Second attempt: All 58 files are staged but at last polling, when the status is 'done' for all of them, the srmLs gives error:

[SE][Ls] httpg:// CGSI-gSOAP: Error
reading token data: Connection reset by peer

NB. the files actually are staged. At the previous polling srmLs reported 58 files ONLINE. This looks like an intermittent error of SRM front-end.

Done and online vs time.pic-1.png stagedGB vs time.pic-1.png rateOfStagedData vs time.pic-1.png

run pic-2

Same conditions than pic-1 run, just to confirm the results: the staging procedure works fine, at a staging rate of about 100 MB/s. But the sporadic error due a refused connection at gSoap level is observed again.

stagedGB vs time.pic-2.png rateOfStagedData vs time.pic-2.png

run CNAF-1

First official run at CNAF in the framework of STEP09 testing activity. See here for the results.

run CNAF-2

Second official test at CNAF in the frame of STEP09. See here for the results.

-- ElisaLanciotti - 08 May 2009

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2020-08-19 - TWikiAdminUser
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback