Service functional tests after 31-Oct-2005 / 2-Nov-2005 intervention

Testing status of sites after intervention.

Unless specified, the channels are running third party copy.

Success rates are quoted on single file transfer attempts only (no retry).

CERN-CERN channel

Over 90% jobs completing first time. Most failures are timeouts due to stage-in (on FTS).

10.00 Wednesday

~100% success rate.

CERN-PIC channel

99% success rate.

10.00 Wednesday

~100% success rate.

CERN-ASCC channel

80% success rate. Remaining 20% mostly due to stage-in timeouts on FTS.

10.00 Wednesday

100% success rate.

CERN-INFN channel

70% success rate. One bad server by the looks of it.

FINAL:TRANSPORT: Transfer failed. ERROR the server sent an error response: 553 553 diskserv-rfio-1:/storage/fast900-1_sd3/zp/stage/esd.0002._010.1.18990: Host not known.

10.00 Wednesday

94% success rate. Remaining errors same as before.

CERN-NDGF channel

NDGF is currently scheduled to be off.

CERN-BNL channel

75% success rate using third party copy. Error on other 25% is mostly:

FINAL:TRANSPORT: Transfer failed. ERROR the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Illegal Object received : dmg.cells.nucleus.NoRouteToCellException

Changed config to use SRM copy. After 1 hour delay transfers come back with:

FINAL:TRANSFER: Failed on SRM copy: Failed SRM copy context. put  on httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv1 ; id=-2147281782 Error is number of retries exceeded: org.dcache.srm.scheduler.NonFatalJobFailure: java.io.IOException: both from and to url are not local srm

10.00 Wednesday

Problems on SRM copy - switching back to 3rd party copy for now.

CERN-TRIUMF

50% success rate. All failures timing out on transfer. Dropped rate from 30 to 15.

10.00 Wednesday

74% success rate. Almost all errors are:

FINAL:TRANSPORT: Transfer failed. ERROR a system call failed (Connection timed out)%

CERN-SARA channel

Connectivity problems with host srm.grid.sara.nl from WAN area. CS see some problem with new router and are investigating.

[root@lxshare026d glite-url-copy-sc3]# telnet srm.grid.sara.nl 8443 Trying
145.100.3.91...
[ hanging ]

[pcitgm01] /home/mccance > telnet srm.grid.sara.nl 8443 Trying
145.100.3.91...
Connected to srm.grid.sara.nl.

17.00 Tuesday

Network problem seems fixed.

But.. Problems with SARA SRM:

Many requests timeout on put. Then:

FINAL:SRM_DEST: Failed on SRM put: Empty request status gotten. No protocol support?; also failing to do 'advisoryDelete' on target.

Overall 40% success rate.

10.00 Wednesday

~100% success rate.

15.00 Wednesday

Network contact lost to SARA.

CERN-RAL channel

100% failure rate. Almost all errors are:

FINAL:SRM_DEST: Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success

17.00 Tuesday

Problem fixed (not quite sure the cause?) 50% success rate, including previous failures.

10.00 Wednesday

~100% success rate.

CERN-IN2P3 channel

100% failure rate. SRM is contactable, but times out on giving a destination TURL.

10.00 Wednesday

Still ~100% failiure rate.

12.00 Wednesday

Found networking problem : route had not been anabled on new router. Fixed.

Success rate is 83% on first pass. Remaining errors are:

FINAL:TRANSPORT: Transfer failed. ERROR the server sent an error response: 426 426 Transfer aborted, closing connection :java.util.ConcurrentModificationException

various put errors, and some get errors (on Castor).

CERN-GRIDKA channel

Original LHCb files (still in system from last week) failed with:

FINAL:TRANSPORT: Transfer failed. ERROR the server sent an error response: 451 451 Local resource failure: malloc: Cannot allocate memory.

using the standard 5 streams used on other channels.

The test files used for other channels don't have this error, but all fail on the gridFTP transfer, after a long timeout.

Changed config to use SRM copy. SRM copy transfers hanging in Pending (dCache reports estimated start time for transfers 9.08 tomorrow morning).

10.00 Wednesday

Problems with SRM copy - switching back to 3rd party copy for now to get better logging.

See ~90% success rate. Remaining errors are malloc as above, a few timeouts on gridFTP and failures where we try retrying a write to a file that already exists (but failed on the transfer, and then failed on the delete attempt).

-- GavinMcCance - 01 Nov 2005

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2005-11-02 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback