Show Children Hide Children

Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
GavinMcCance
2008-02-20

FTS proxy corruption issue

Impact and Symptom

Total service failure for a given user on all channels on the FTS server: all transfers for a given user fail with the message "Could NOT load client credentials".

Cause

We now believe we understand the cause:

The proxy is only delegated if required (the condition is lifetime < 4 hours). The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be redelegated is the one that does it. The proxy then stays on the server for ~8 hours or so (default lifetime is 12 hours).

We found a race condition in the delegation: if two submit clients for the same user detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the database is the certificate from one request and the key from the other (i.e. the proxy is corrupted). We don’t detect this and the proxy remains invalid for the next ~8 hours (i.e. the proxy certificate expires, whereupon another delegation is attempted).

Fix

The real fix requires a server side update.

The is being tracked in savannah: https://savannah.cern.ch/bugs/?33641

Workaround on the client side

There are two options:

Use the legacy myproxy mode

Use the legacy myproxy mode that the FTS 2.0 sever still supports. Upload the proxy to myproxy-fts.cern.ch and add the -p option to the glite-transfer-submit CLI, as before. The problem with this option is that only plain grid proxies can be used - i.e. the proxy the FTS gets will not be a VOMS proxy.

Delegate the certificate separately from the job submission

This is the recommended workaround.

Run, ~every hour, per FTS server instance, per user:

/opt/glite/bin/glite-delegation-init -f -s https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/gridsite-delegation

where the URL is the same as the FileTransfer one except for sed 's/FileTransfer/gridsite-delegation/'.

Make sure you run only one instance of this per server, per user at a time, or you'll be open to the same race condition.

This will ensure you always have a fairly up-to-date proxy on the FTS server, so the transfer-submit commands will never attempt a delegation.

Workaround on the server side

We can implement a (nasty) cron on the server side looking for corrupted proxies and deleting them from the disk and the DB.

This is not nice, because all the jobs will fail until you submit another (since you've still got no valid proxy) - and then when you submit another, you risk the same race condition.

Assuming you continue to submit jobs most of the time, it will limit the damage of a bad delegation to several minutes.

Wednesday 20/02/08: This cron job has now been implemnted on CERN-PROD's FTS-T0-EXPORT and FTS-T2-SERVICE.

References

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-02-20 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback