FTS version 2.0 Known Issues
This is where current known issues are tracked for the FTS release 2.0.
Could NOT load client credentials
Bug tracked at
BUG:33449
.
Symptoms: all the transfers from a certain user fail with the error 'SOURCE error during PREPARATION phase: [PERMISSION] [SrmPing] failed: SOAP-ENV:Client - CGSI-gSOAP: Could NOT load client credentials'.
Cause: corruption of the proxy certificate on disk.
Resolution:
- delete credentials from the database
This can be done by the user himself running:
glite-delegation-destroy -s https://<server>:<port>/glite-data-transfer-fts/services/gridsite-delegation -v
or by the admin, deleting the rows for the user from the T_CREDENTIAL and T_CREDENTIAL_CACHE tables in the db.
- delete credentials from disk
From all FTS agents machines, check if the
/tmp
folder contains an
x509up_
file for the user and delete it.
Explanation
The proxy is only delegated if required (the condition is lifetime < 4 hours). The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be redelegated is the one that does it - the proxy then stays on the server for ~8 hours or so (default lifetime is 12 hours). We found a race condition in the delegation - if two clients (as is likely) detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the DB is the certificate from one request and the key from the other. We don’t detect this and the proxy remains invalid for the next ~8 hours.
The real fix requires a server side update (ongoing).
The quick fix. There are two options:
a) Use the legacy myproxy mode that the 2.0 sever still supports. Upload the proxy to myproxy-fts.cern.ch and add -p to the submit, as before. I see CMS have started to do this on some jobs.
b) Run, ~every hour, per FTS server instance:
/opt/glite/bin/glite-delegation-init -f -s
https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/gridsite-delegation
Where the URL is the same as the FileTransfer one except for sed 's/FileTransfer/gridsite-delegation/'.
Make sure you run only one instance of this per server at a time, or you'll be open to the same race condition. It will ensure you always have a newish proxy on the server, so the transfer-submit commands will never attempt a delegation.
Two active transfers for the same request
This problem has shown up very rarely, most likely after changing the channel type from urlcopy to srmcopy or viceversa (check the correct procedure to perform this operation at
FtsChangeChannelType20). Software bug is tracked at
BUG:31161
.
Symptoms: all the jobs on a particular channel remain in the 'Submitted' state.
Cause: the agent found two active transfers for the same file and disabled the CheckState action.
To verify if this is actually what is happening, look at the channel agent log under
/var/log/glite
; you should see something like:
WARN channel-action-tx-cache - File <12317106> has already an active transfer <CERN-GRIDKAPPS__2007-09-10-1637_JMB1YW:0>
ERROR channel-action-CheckState - Logic Error in filling Active transfers cache. Reason is two active transfers for the same file
ERROR channel-action-CheckState - LogicError in Executing action: two active transfers for the same file
WARN glite-transfer-scheduler-transfer-channel-agent - LogicError in Executing Action. Reason is two active transfers for the same file
INFO glite-transfer-scheduler-transfer-channel-agent - Action glite:CheckState will be Disabled for 300 seconds
Resolution:
Resolution requires a manual intervention in the database. This example shows how to solve the problem on the CERN-GRIDKAPPS channel whose logs are reported above.
service transfer-agents --instance glite-transfer-channel-agent-srmcopy-CERN-GRIDKAPPS stop
- find the request ids and file ids for files having more than one active transfer running this query in the database (thanks to Krzys):
select file_id,request_id from (
select file_id, transfer_state, request_id, (count(request_id) over (partition by file_id order by TRANSFER_STATE)) ACTIVE_TRANSFERS
from t_transfer
where TRANSFER_STATE='Processing'
) where ACTIVE_TRANSFERS > 1;
The query gives the results:
FILE_ID |
REQUEST_ID |
12317106 |
CERN-GRIDKAPPS__2007-09-10-1637_JMB1YW |
12317106 |
CERN-GRIDKAPPS__2007-09-10-1631_Ue58ym |
Looking at the logs, the agent was trying to check the state of
CERN-GRIDKAPPS__2007-09-10-1637_JMB1YW
, so the 'bad' transfer is the other one.
- set the status of the 'bad' transfers to failed:
update t_transfer set
TRANSFER_STATE ='Failed',
REASON ='Transfer state check failure',
REASON_CLASS ='INTERNAL_ERROR',
ERROR_SCOPE ='AGENT',
ERROR_PHASE ='TRANSFER_SERVICE'
where REQUEST_ID='CERN-GRIDKAPPS__2007-09-10-1631_Ue58ym';
service transfer-agents --instance glite-transfer-channel-agent-srmcopy-CERN-GRIDKAPPS start
Configuration issues
SRM Copy timeout configuration
There is a known issue with the SRM copy timeout configuration. i.e. the length of time that FTS waits for an open SRM copy request (on the dCache) before canceling it.
In FTS 1.5 the YAIM setting for channel
CHANNEL-NAME
:
FTA_CHANNEL_NAME_GUC_TRANSFERTIMEOUT=3600
would modify the value from its default of 1800 seconds (30 minutes).
Due to a bug in the deprecation of another variable, you should now
add an extra line:
FTA_CHANNEL_NAME_GUC_SRMCOPYTIMEOUT=1
So for a single agent for channel
CHANNEL-NAME
, the configuration is:
FTA_CHANNEL_NAME_GUC_TRANSFERTIMEOUT=3600
FTA_CHANNEL_NAME_GUC_SRMCOPYTIMEOUT=1
To do this for
all SRMCOPY agents, the configuration is:
FTA_TYPEDEFAULT_SRMCOPY_GUC_TRANSFERTIMEOUT=3600
FTA_TYPEDEFAULT_SRMCOPY_GUC_SRMCOPYTIMEOUT=1
Software bug is tracked here:
BUG:29390
MyProxy error in bind()
The init.d scripts for the agent daemons source far too much stuff into the daemons' environments (e.g. user profile scripts). If
GLOBUS_TCP_PORT_RANGE
ends up in the agent server's environment, this is typically mis-parsed by the MyProxy library and the library ends up always trying to always connect to MyProxy using the same outgoing port (often 20000). If more than one server tries this at the same time, you'll end up with failures like:
'Failed to get proxy certificate from myproxy-fts.cern.ch . Reason is Error in bind()'
Option 1 - try this first
To get round this, you should add into the file:
/etc/sysconfig/glite-data-transfer-agents
the following line:
export MYPROXY_TCP_PORT_RANGE=20000,25000
(note the comma).
See bug reference
BUG:31169
.
Note that this file is
overwritten every time YAIM runs.
Option 2 - try this next
Some sites report that option 1 does not work - since the environment scripts are not fully under the FTS's control.
Neither MYPROXY_TCP_PORT_RANGE nor GLOBUS_TCP_PORT_RANGE actually
need to be set at all.
Edit the script:
/opt/glite/sbin/glite-data-config-service-wrapper
(owned by the
glite-data-config-service
RPM)
and add before the final exec:
unset GLOBUS_TCP_PORT_RANGE
unset MYPROXY_TCP_PORT_RANGE
The fix
Will be in the upcoming SL4 version. It will not source any user profile scripts.
Last edit: Main.GavinMcCance on 2008-03-27 - 14:58
Number of topics: 1
Maintainer: Main.PaoloTedesco