Main FTS Pages |
---|
FtsRelease22 |
Install |
Configuration |
Administration |
Procedures |
Operations |
Development |
Previous FTSes |
FtsRelease21 |
FtsRelease21 |
All FTS Pages |
FtsWikiPages |
Last Page Update |
SteveTraylen 2007-10-10 |
The process serving the transfer (status = RUNNING) is no longer active (could not open file /proc/5734/cmdline)A bug in the Globus client causes it to die sometimes (an explicit
abort
inside the library routine) when receiving
bad messages from one of the gridFTP servers. This problem is currently affecting the CERN-PIC
export channel.
Status: understood - the bad transfers are killed cleanly and should be retried.
Reported by: LHCb, CMS
Impact: cosmetic: it's a bad error message.
Action: the message needs improving. The underlying problem of the abort
being called inside Globus client may be fixed in VDT 1.6 (to be tested).
> ftscp(6196): Reason: The process serving the transfer (status = > RUNNING) is no longer active (could not open file > /proc/5734/cmdline) 2007-06-20 13:49:44 ftscp(6196): The error is the same error as you get on FTS 1.5 servers: "No job found for this ID" (we have to pending improvement to make the error message useful..) What it means is that the dCache gridFTP server at PIC sent a non-gridFTP protocol compliant message to the globus client running the transfer (actually, I think there is a race condition in the girdFTP 1 protocol itself). There is a bug (or mis-feature) in the the current VDT/Globus library we use, in that if it receives such a bad message, it calls abort() in the library routine (rather than just failing) - and this causes our transfer process to die uncleanly without being able to log much useful information. We should make this more obvious in the error message. This is currently associated with one particular gridFTP server door at PIC. (dc003 I think).
Ready
- channel 'stuck'
Status: intermittent problem that causes one of the transfer channels to become blocked, currently affecting CERN-PIC
transfers. Bug fix in testing (trivial fix).
Reported by: LHCb, CMS
Impact: when the problem occurs, the bad channel stops transferring data until a manual intervention is made. The other export channels continue fine.
Presumably similar to "Bad error message for dead transfers" issue. Another bad (but not quite the same) message from one of the gridFTP servers triggers another bug in the Globus gridFTP client library we use, causing it to go into an infinite loop deep inside the library copy call. This would normally not be bad (since we should just kill the bad process and retry), but a bug in our agent causes it to block while trying to kill such processes. Our immediate course of action is to fix the problem in our agent - so these bad transfers gets killed cleanly and we don’t block the channel. Then we can look at the problems with the Globus client library we use - some of the bugs we know are fixed in the latest VDT 1.6 version of Globus, so we plan to migrate to that - as soon as we've run some stress tests with it on the FTS pilot service.
fts-support
Also, I tried to cancel a long-lasting transfer. It is still pending: glite-transfer-status -s https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer 94df5e40-1f30-11dc-840f-a3448bce587f -l about 10 minutes ago (2007-06-20 17:09:46 CERN Time), it is still trying to cancel it (Canceling). Could you please have a look on this? Thanks again, Yujun