Show Children Hide Children

Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
SteveTraylen
2007-10-10

Current issues on CERN-PROD FTS 2.0 service

There has been some issues noticed on the new FTS 2.0 service. These are tracked below.

Bad error message for dead transfers

The process serving the transfer (status = RUNNING) is no longer active (could not open file /proc/5734/cmdline)

A bug in the Globus client causes it to die sometimes (an explicit abort inside the library routine) when receiving bad messages from one of the gridFTP servers. This problem is currently affecting the CERN-PIC export channel.

Status: understood - the bad transfers are killed cleanly and should be retried.

Reported by: LHCb, CMS

Impact: cosmetic: it's a bad error message.

Action: the message needs improving. The underlying problem of the abort being called inside Globus client may be fixed in VDT 1.6 (to be tested).

> ftscp(6196): Reason: The process serving the transfer (status = 
> RUNNING) is no longer active (could not open file
> /proc/5734/cmdline) 2007-06-20 13:49:44 ftscp(6196):

The error is the same error as you get on FTS 1.5 servers:

"No job found for this ID"

(we have to pending improvement to make the error message useful..)


What it means is that the dCache gridFTP server at PIC sent a non-gridFTP protocol compliant message 
to the globus client running the transfer (actually, I think there is a race condition in the girdFTP 1 
protocol itself). There is a bug (or mis-feature) in the the current VDT/Globus library we use, 
in that if it receives such a bad message, it calls abort() in the library routine (rather than just failing) 
- and this causes our transfer process to die uncleanly without being able to log much useful 
information. We should make this more obvious in the error message. This is currently 
associated with one particular gridFTP server door at PIC. (dc003 I think).

Agent hangup problem

FIXED IN PATCH 1232 (the first patch available for general release to T1 sites).

Symptom: files stuck in Ready - channel 'stuck'

Status: intermittent problem that causes one of the transfer channels to become blocked, currently affecting CERN-PIC transfers. Bug fix in testing (trivial fix).

Reported by: LHCb, CMS

Impact: when the problem occurs, the bad channel stops transferring data until a manual intervention is made. The other export channels continue fine.

Presumably similar to "Bad error message for dead transfers" issue.

Another bad (but not quite the same) message from one of the gridFTP 
servers triggers another bug in the Globus gridFTP client library we use, causing 
it to go into an infinite loop deep inside the library copy call. This would 
normally not be bad (since we should just kill the bad process and retry), 
but a bug in our agent causes it to block while trying to kill such processes.


Our immediate course of action is to fix the problem in our agent 
- so these bad transfers gets killed cleanly and we don’t block the channel.

Then we can look at the problems with the Globus client library we use 
- some of the bugs we know are fixed in the latest VDT 1.6 version of Globus, 
so we plan to migrate to that - as soon as we've run some stress tests with it 
on the FTS pilot service.

Cancellation issue

Some problems noticed in canceling jobs in a timely manner.

Status: still to be investigated.

Reported by: CMS

Impact: uncertain.

Mail from Yujun [FNAL] on fts-support
Also, I tried to cancel a long-lasting transfer. It is still pending:

glite-transfer-status -s
https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer
94df5e40-1f30-11dc-840f-a3448bce587f -l

about 10 minutes ago (2007-06-20 17:09:46 CERN Time), it is still trying to cancel it (Canceling).

Could you please have a look on this?


Thanks again,
Yujun


Last edit: SteveTraylen on 2007-10-10 - 10:19
Number of topics: 1

Maintainer: GavinMcCance , SteveTraylen


Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2007-10-10 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback