Castor at CNAF

  • Fri Oct 5 2007

After the upgrade of last wednesday, no apparent improvement noticed. Repeat the test to confirm the result of run castor-cnaf.test3.

Analysis of run castor-cnaf.test4 from Fri Oct 5 15:40 to Fri Oct 5 23:21 2007
Num of parallel proc.: 100 each with 100 requests
Increasing polling time, starting with t0=8, timeout for a request: 8176.0 s (136.3 min.)
Mean time per request in each process: 229.8 s ( 3.8 min.)
Average over all the run: (total req. done)/(total duration)= 0.362 s (frequency= 2.76 Hz)
Total failures getting the request token: 0
Total failures getting the TURL: 489
Total failures in ptp => 489 over 10000 (4.89 %)
Comments:
90 cases of pure timeout (no error).
In 399 cases again the error: "stage_put error" found.
The upgrade seems not to have improved the performance.

  • Wed Oct 3 2007

Analysis of run castor-cnaf.test3 from Wed Oct 3 14:56 to Thu Oct 4 03:32 2007
Num of parallel proc.: 100 each with 100 requests
Increasing polling time, starting with t0=4, Max. timeout for a request: 4088.0 s ( 68.1 min.)
Mean time per request in each process: 312.0 s ( 5.2 min.)
Average over all the run: (total req. done)/(total duration)= 0.220 s (frequency= 4.54 Hz)
Total failures getting the request token: 50
Total failures getting the TURL: 403
Total failures in ptp => 453 over 10000 (4.53 %)
Comments: The 50 failures getting the token for the requests are due to another problem with the database. The output of the 'ptp' command is reported:

Sending PtP request to: httpg://srm-v2.cr.cnaf.infn.it:8443/
============================================================
Request status:
  statusCode="SRM_INTERNAL_ERROR"(14)
  explanation="Not able to find the version of castor in the database Original error was ORA-00904: "SCHEMAVERSION": invalid identifier
"
============================================================
SRM Response:
============================================================

About the 403 failures to get a TURL: 189 out of 403 are due to a pure timeout (no error occurred). Then 217 times there were failures in the put cycles (same error observed in all the previous runs): stage_put error: Unknown internal error". See below run castor-cnaf.test2 to see the whole error message.

Castor upgrade h 14:40 Giuseppe Lore notifies that Castor has been upgraded (to which version?). The problem with the mismatch in database should be fixed.

  • Tue Oct 2 2007

Analysis of run castor-cnaf.test2 from Tue Oct 2 19:11:03 to Wed Oct 3 09:11:08 2007
Num of parallel proc.: 100 each with 100 requests
Increasing polling time, starting with t0=4, Max. timeout for a request: 4088.0 s ( 68.1 min.)
Mean time per request in each process: 363.3 s ( 6.1 min.)
Average over all the run: (total req. done)/(total duration)= 0.198 s (frequency= 5.04 Hz)
Total failures getting the request token: 48
Total failures getting the TURL: 505
Total failures in ptp => 553 over 10000 (5.53 %)
Comments: The 48 failures in getting the token are all due to a mismatch of database. Below the output of the ptp command is reported:

Request status:
  statusCode="SRM_INTERNAL_ERROR"(14) 
  explanation="Version mismatch between the database and the code : "2_1_2_4" versus "2_1_3_8""
About the 505 failures in getting the TURL:
in 233 cases out of 505 the failure is due to some SRM internal error. The output of 'statusptp' command is reported below:
Sending StatusPtP request to: httpg://srm-v2.cr.cnaf.infn.it:8443/
============================================================
Request status:
statusCode="SRM_FAILURE"(1)
explanation="No subrequests succeeded"
============================================================
 SRM Response:
remainingTotalRequestTime=0
arrayOfFileStatuses (size=1)
[0] SURL="srm://srm-v2.cr.cnaf.infn.it:8443/castor/cnaf.infn.it/grid/lcg/lhcb/elisa/response_tests/99/84"
[0] status: statusCode="SRM_FAILURE"(1)
                   explanation="stage_put error: Unknown internal error"

In 271 cases out of 505 the TURL is not got because the timeout elapsed. Actually, no error occurred from the SRM point of view, as the request is in status 'SRM_REQUEST_QUEUED' after the last polling.

  • Fri 28 Sep 2007

still problems with GPFS file system.

  • Thu 27 Sep 2007

problems with the GPFS file system at CNAF. Results of tests are not reliable.

  • Wed 26 Sep

Analysis of run castor-cnaf.test1 Num of parallel proc.: 100 each with 10 requests
Increasing polling time, starting with t0=2
Mean time per request: 228.068 s ( 3.801 min.)
in average over all the run: (total req. done)/(total duration)= 0.218 s (frequency= 4.587 Hz)
Timeout for a request is: 2044.000 s ( 34.067 min.)
Total failures getting the TURL: 53
Total failures in ptp => 53 over 1000 (5.3 %)
comments: all failures due to timeout. Repeat the same run with a longer timeout.

-- ElisaLanciotti - 26 Sep 2007

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2007-10-08 - ElisaLanciotti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback