Show Children Hide Children

Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
GavinMcCance
2008-03-27

FTS version 2.0 Known Issues

This is where current known issues are tracked for the FTS release 2.0.

Could NOT load client credentials

Bug tracked at BUG:33449.

Symptoms: all the transfers from a certain user fail with the error 'SOURCE error during PREPARATION phase: [PERMISSION] [SrmPing] failed: SOAP-ENV:Client - CGSI-gSOAP: Could NOT load client credentials'.

Cause: corruption of the proxy certificate on disk.

Resolution:

  • delete credentials from the database

This can be done by the user himself running:

glite-delegation-destroy -s https://<server>:<port>/glite-data-transfer-fts/services/gridsite-delegation -v

or by the admin, deleting the rows for the user from the T_CREDENTIAL and T_CREDENTIAL_CACHE tables in the db.

  • delete credentials from disk

From all FTS agents machines, check if the /tmp folder contains an x509up_ file for the user and delete it.

  • submit a new job

Explanation The proxy is only delegated if required (the condition is lifetime < 4 hours). The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be redelegated is the one that does it - the proxy then stays on the server for ~8 hours or so (default lifetime is 12 hours). We found a race condition in the delegation - if two clients (as is likely) detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the DB is the certificate from one request and the key from the other. We don’t detect this and the proxy remains invalid for the next ~8 hours.

The real fix requires a server side update (ongoing).

The quick fix. There are two options:

a) Use the legacy myproxy mode that the 2.0 sever still supports. Upload the proxy to myproxy-fts.cern.ch and add -p to the submit, as before. I see CMS have started to do this on some jobs.

b) Run, ~every hour, per FTS server instance:

/opt/glite/bin/glite-delegation-init -f -s https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/gridsite-delegation

Where the URL is the same as the FileTransfer one except for sed 's/FileTransfer/gridsite-delegation/'.

Make sure you run only one instance of this per server at a time, or you'll be open to the same race condition. It will ensure you always have a newish proxy on the server, so the transfer-submit commands will never attempt a delegation.

Two active transfers for the same request

This problem has shown up very rarely, most likely after changing the channel type from urlcopy to srmcopy or viceversa (check the correct procedure to perform this operation at FtsChangeChannelType20). Software bug is tracked at BUG:31161.

Symptoms: all the jobs on a particular channel remain in the 'Submitted' state.

Cause: the agent found two active transfers for the same file and disabled the CheckState action.

To verify if this is actually what is happening, look at the channel agent log under /var/log/glite; you should see something like:

WARN   channel-action-tx-cache - File <12317106> has already an active transfer <CERN-GRIDKAPPS__2007-09-10-1637_JMB1YW:0>
ERROR  channel-action-CheckState - Logic Error in filling Active transfers cache. Reason is two active transfers for the same file
ERROR  channel-action-CheckState - LogicError in Executing action: two active transfers for the same file
WARN   glite-transfer-scheduler-transfer-channel-agent - LogicError in Executing Action. Reason is two active transfers for the same file
INFO   glite-transfer-scheduler-transfer-channel-agent - Action glite:CheckState will be Disabled for 300 seconds

Resolution:

Resolution requires a manual intervention in the database. This example shows how to solve the problem on the CERN-GRIDKAPPS channel whose logs are reported above.

  • stop the agent:
service transfer-agents --instance glite-transfer-channel-agent-srmcopy-CERN-GRIDKAPPS stop

  • find the request ids and file ids for files having more than one active transfer running this query in the database (thanks to Krzys):
select file_id,request_id from (
    select file_id,  transfer_state, request_id, (count(request_id) over (partition by file_id order by TRANSFER_STATE)) ACTIVE_TRANSFERS  
    from t_transfer
    where TRANSFER_STATE='Processing'
) where ACTIVE_TRANSFERS > 1;
The query gives the results:
FILE_ID REQUEST_ID
12317106 CERN-GRIDKAPPS__2007-09-10-1637_JMB1YW
12317106 CERN-GRIDKAPPS__2007-09-10-1631_Ue58ym

Looking at the logs, the agent was trying to check the state of CERN-GRIDKAPPS__2007-09-10-1637_JMB1YW, so the 'bad' transfer is the other one.

  • set the status of the 'bad' transfers to failed:

update t_transfer set 
  TRANSFER_STATE  ='Failed',
  REASON          ='Transfer state check failure',
  REASON_CLASS    ='INTERNAL_ERROR',
  ERROR_SCOPE     ='AGENT',
  ERROR_PHASE     ='TRANSFER_SERVICE'
  where REQUEST_ID='CERN-GRIDKAPPS__2007-09-10-1631_Ue58ym';

  • restart the agent:
service transfer-agents --instance glite-transfer-channel-agent-srmcopy-CERN-GRIDKAPPS start

Configuration issues

SRM Copy timeout configuration

There is a known issue with the SRM copy timeout configuration. i.e. the length of time that FTS waits for an open SRM copy request (on the dCache) before canceling it.

In FTS 1.5 the YAIM setting for channel CHANNEL-NAME:

   FTA_CHANNEL_NAME_GUC_TRANSFERTIMEOUT=3600

would modify the value from its default of 1800 seconds (30 minutes).

Due to a bug in the deprecation of another variable, you should now add an extra line:

   FTA_CHANNEL_NAME_GUC_SRMCOPYTIMEOUT=1

So for a single agent for channel CHANNEL-NAME, the configuration is:

   FTA_CHANNEL_NAME_GUC_TRANSFERTIMEOUT=3600
   FTA_CHANNEL_NAME_GUC_SRMCOPYTIMEOUT=1

To do this for all SRMCOPY agents, the configuration is:

   FTA_TYPEDEFAULT_SRMCOPY_GUC_TRANSFERTIMEOUT=3600
   FTA_TYPEDEFAULT_SRMCOPY_GUC_SRMCOPYTIMEOUT=1

Software bug is tracked here: BUG:29390

MyProxy error in bind()

The init.d scripts for the agent daemons source far too much stuff into the daemons' environments (e.g. user profile scripts). If GLOBUS_TCP_PORT_RANGE ends up in the agent server's environment, this is typically mis-parsed by the MyProxy library and the library ends up always trying to always connect to MyProxy using the same outgoing port (often 20000). If more than one server tries this at the same time, you'll end up with failures like:

'Failed to get proxy certificate from myproxy-fts.cern.ch . Reason is Error in bind()'

Option 1 - try this first

To get round this, you should add into the file:

/etc/sysconfig/glite-data-transfer-agents

the following line:

export MYPROXY_TCP_PORT_RANGE=20000,25000

(note the comma).

See bug reference BUG:31169.

Note that this file is overwritten every time YAIM runs.

Option 2 - try this next

Some sites report that option 1 does not work - since the environment scripts are not fully under the FTS's control.

Neither MYPROXY_TCP_PORT_RANGE nor GLOBUS_TCP_PORT_RANGE actually need to be set at all.

Edit the script:

/opt/glite/sbin/glite-data-config-service-wrapper

(owned by the glite-data-config-service RPM)

and add before the final exec:

   unset GLOBUS_TCP_PORT_RANGE

   unset MYPROXY_TCP_PORT_RANGE

The fix

Will be in the upcoming SL4 version. It will not source any user profile scripts.


Last edit: Main.GavinMcCance on 2008-03-27 - 14:58
Number of topics: 1

Maintainer: Main.PaoloTedesco


Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2008-03-27 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback