FTS Notes

Introduction

The File Transfer service provides a controlled channel for reliable delivery of files from one SE to another. It maximises utilisation of lines and aims to avoid overloading servers with excessive parallel requests.

See FtsServerDeploy13 for more information on the FTS installation.

The FTS participates in the following flows

Data

The FTS consists of

  • An oracle database GRID2

Configuration

It is planned to have two FTS servers for the CERN site with a single database/queue. Since the data does not transfer via the FTS, it is not overloaded but is more of a scheduler.

High Availability

The FTS server architecture is split into multiple decoupled components. Most components can continue working independent of the rest (although the components are chained). For example, if all the VO agents go down, the channel agents will continue to service Pending jobs on the network - but will eventually run out of jobs to service, since they rely on the VO agents to assign jobs to the given channel.

The three main classes are the database, web-service and the agents.

Database down

If the FTS database is down, the impact is the following:

  • The same as the web-service being down [see below] (no new requests can be accepted and existing ones cannot be queryed)
  • No new transfer requests will be able to be put onto the network until the DB is back
  • Should an currently active transfer fail it will not be retried until the DB is back

Note that existing active transfer attempts (i.e. those already running on the network) will finish (though failed ones will not be retried). No jobs will be lost.

Web-service down

If the FTS web-service is down, the impact is the following:

  • No new client jobs for transfers can be accepted
  • Clients cannot query the status of existing jobs.

Note that provided the agents are running, existing transfers will continue to process as normal. No existing jobs will be lost.

VO agents down

If a VO agent is down, the impact is as follows, for the given VO:

  • The VO's transfers, though they can be submitted, will not be assigned to channels
  • The VO transfers will not be retried
  • VO specific actions (e.g. catalog update) will not happen

Note that transfers may still be submitted into the system and queried. Existing transfers already assigned to a channel will be processed normally (provided the channel agents are running). The system should recover once the VO agents are back - i.e. no jobs will be lost - some jobs will just stall, recovering and completing normally when the agents return.

Channel agents down

If a Channel agent is down, the impact is as follows, for the given channel:

  • All new transfers on that channel will stop, including retries (i.e. bandwidth will be lost)

Note that transfers may still be submitted into the system and queried. Jobs that have already finished transfer will complete normally with whatever VO action is defined (provided the VO agents are runniung). The system should recover once the channel agents are back - i.e. no jobs will be lost - some jobs will just stall, recovering and completing normally when the agents return.

Approach FTS 1

The FTS will be constructed as

  • Front end servers performing the application
  • Backend databases stored on Oracle RAC providing reliable database services
  • DNS Load Balancing of the fts alias

Equipment required

Approach 1

Assuming 2 live FTSes,

Component Number Purpose
Midrange Server 2 masters and standby machines
FC HBA 4 Fibre channel connectivity
FC Switch Ports 4 Connectivity for the two servers
FC Disk space 1000 Storage for FTS database on different disk subsystems)

Engineering required

DevelopmentSorted ascending Purpose
Capacity Metric Capacity metrics defined for
Number of transfers/second
Lemon components availability test A lemon aware sensor which can be used for reporting availability. Tests for GridFTP processes would be required.
Lemon FTS availability test A lemon aware sensor which can be used for reporting availability.
Linux Heartbeat availability test A Linux-HA aware sensor which would activate the procedure for automatic switch from master to slave
Quattor configuration for Linux-HA NCM component to configure Linux-HA/Heartbeat
Start/Stop/Status procedure Scripts for FTS operations
Switch procedure Automatic switch from master to slave changing the DNS alias, disabling the master, enabling the slave in its new master role

Questions

Nr Description Status Open Date Who Log
1 Is round robin feasible ? Is the only state information in the database open 2005/09/21 Tim  

-- TimBell - 21 Sep 2005

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2005-10-13 - TimBell
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback