FTS Notes
Introduction
The File Transfer service provides a controlled channel for reliable delivery of files from one SE to another. It maximises utilisation of lines and aims to avoid overloading servers with excessive parallel requests.
See
FtsServerDeploy13 for more information on the FTS installation.
The FTS participates in the following flows
Data
The FTS consists of
Configuration
It is planned to have two FTS servers for the CERN site with a single database/queue. Since the data does not transfer via the FTS, it is not overloaded but is more of a scheduler.
High Availability
The FTS server architecture is split into multiple decoupled components. Most components can continue working independent of the rest (although the components are chained). For example, if all the VO agents go down, the channel agents will continue to service Pending jobs on the network - but will eventually run out of jobs to service, since they rely on the VO agents to assign jobs to the given channel.
The three main classes are the database, web-service and the agents.
Database down
If the FTS database is down, the impact is the following:
- The same as the web-service being down [see below] (no new requests can be accepted and existing ones cannot be queryed)
- No new transfer requests will be able to be put onto the network until the DB is back
- Should an currently active transfer fail it will not be retried until the DB is back
Note that existing active transfer attempts (i.e. those already running on the network) will finish (though failed ones will not be retried). No jobs will be lost.
Web-service down
If the FTS web-service is down, the impact is the following:
- No new client jobs for transfers can be accepted
- Clients cannot query the status of existing jobs.
Note that provided the agents are running, existing transfers will continue to process as normal. No existing jobs will be lost.
VO agents down
If a VO agent is down, the impact is as follows, for the given VO:
- The VO's transfers, though they can be submitted, will not be assigned to channels
- The VO transfers will not be retried
- VO specific actions (e.g. catalog update) will not happen
Note that transfers may still be submitted into the system and queried. Existing transfers already assigned to a channel will be processed normally (provided the channel agents are running). The system should recover once the VO agents are back - i.e. no jobs will be lost - some jobs will just stall, recovering and completing normally when the agents return.
Channel agents down
If a Channel agent is down, the impact is as follows, for the given channel:
- All new transfers on that channel will stop, including retries (i.e. bandwidth will be lost)
Note that transfers may still be submitted into the system and queried. Jobs that have already finished transfer will complete normally with whatever VO action is defined (provided the VO agents are runniung). The system should recover once the channel agents are back - i.e. no jobs will be lost - some jobs will just stall, recovering and completing normally when the agents return.
Approach FTS 1
The FTS will be constructed as
- Front end servers performing the application
- Backend databases stored on Oracle RAC providing reliable database services
- DNS Load Balancing of the fts alias
Equipment required
Approach 1
Assuming 2 live FTSes,
Component |
Number |
Purpose |
Midrange Server |
2 |
masters and standby machines |
FC HBA |
4 |
Fibre channel connectivity |
FC Switch Ports |
4 |
Connectivity for the two servers |
FC Disk space |
1000 |
Storage for FTS database on different disk subsystems) |
Engineering required
Development |
Purpose |
Start/Stop/Status procedure |
Scripts for FTS operations |
Lemon FTS availability test |
A lemon aware sensor which can be used for reporting availability. |
Lemon components availability test |
A lemon aware sensor which can be used for reporting availability. Tests for GridFTP processes would be required. |
Linux Heartbeat availability test |
A Linux-HA aware sensor which would activate the procedure for automatic switch from master to slave |
Switch procedure |
Automatic switch from master to slave changing the DNS alias, disabling the master, enabling the slave in its new master role |
Capacity Metric |
Capacity metrics defined for Number of transfers/second |
Quattor configuration for Linux-HA |
NCM component to configure Linux-HA/Heartbeat |
Questions
--
TimBell - 21 Sep 2005