FTS service review
A review of the FTS describing how to deploy the service to obtain maximum load-balancing and availability. The service impact of interventions and problems on different components is described.
Service components and service impact of outages
The FTS is split into three distinct components, so there are three main classes of availability to consider:
- (web-service): The ability to submit new jobs, query their status and administer the channels. This is determined by the availability of the FTS web-service.
- (data transfer): That the file transfer jobs currently in the system are being processed correctly. This is determined by the availability of the FTS agents and the other external services upon which the FTS depends.
- (monitoring) The monitoring system - in preparation for the FTS 2.0 branch. This is determined by the availability of the node (and apache server) that exports the monitoring information.
All three parts of the service are well factorised from each other, so one part of the service can be down while the others stays up.
All three components synchronise on the backend database - if this database in unavailable, all parts of the service are down.
FTS web-service
The FTS web-service is used:
- by clients to submit new jobs and query the status of existing ones
- by VO administrators to cancel jobs and change the priorities of existing ones
- by site administrators to change the properties of the transfer channels affecting their site
It is a stateless SOAP-based RPC web-service running inside the Tomcat 5.0
J2EE container.
Because it is stateless, it can be trivially load-balanced and this procedure also increases the availability. The currently recommended deployment scheme is to use DNS load-balancing. Upon loss of one node, the DNS will be made to point at remaining nodes only; this is either automatic or an operator procedure depending on the type of failure.
Impact of downtime of one of the nodes:
- Class 1 service (web-service): will run in degraded mode (potentially exhibiting overload on the remaining nodes in the cluster); excess requests should be cleanly refused.
- Class 2 service (data transfer): no impact
- Class 3 service (monitoring): no impact.
Intervention type:
- Automatic - the DNS load-balancing dropout should be set up to happen automatically if a node fails or the web-service on it become unresponsive. Subsequent recovery of the failed node is manual.
Resilience to glitches:
- Poor. Short glitches will be noticed by clients since the DNS propagation is not fast enough to hide it.
Recovery:
- Upon restart of the problematic node (after DNS propagation) the service is fully back up. No state will be lost (job submission requests are atomic).
FTS agents
The FTS agent daemons are responsible for processing the job (i.e.for doing the file copies). There are three different types of agent:
- Channel agent. You run one of these daemons for every channel defined in the system. Each is responsible for running the transfers on its associated channel. Typically several 10's.
- VO agent. You run one of these daemons for every VO defined in the system. Each is responsible for handling the requests belonging to its VO. Typically under 10.
- Proxyrenewal agent. This renews expired proxies from MyProxy. You run one of these.
For load-balancing the configuration allows the agent daemons to be spread arbitrarily over multiple nodes.
The agents are completely factorised from each other - loss of one of them will not affect the correct operation of the others.
Channel agents
Downtime of a channel agent will result in all transfers on that channel being suspended. The currently
running transfers will finish, but no new transfers will be put on the network. Work queued on the channel will stay queued until the channel agent is back up to serve it again. Symptom: unstarted jobs will stay in
Ready
state, partially served jobs will stay in
Active
state.
Impact of downtime o a channel agent:
- Class 1 service (web-service): no impact.
- Class 2 service (data transfer): 100% stoppage for the given channel only. There is no impact on the service provided by the other channel agents.
- Class 3 service (monitoring): no impact.
Intervention type:
- Manual. There is no automatic fail-over supported by the software.
Resilience to glitches:
- a few minutes - currently running (i.e. on the network) transfers will continue to run while the agent is down, so a short stoppage of the agent will result in no noticeable impact on the hourly throughput.
Recovery:
- once the agent restarts, the transfers will start again; no jobs or job-state will be lost.
VO agents
Downtime of a VO agent will result in no new jobs being assigned for that VO to any channel, and jobs which have finished not being moved to their
Finished
state. Jobs currently assigned to a channel will continue running (so the data export will continue) but eventually the channel queues will become exhausted (for the given VO) since no new jobs from that VO are being assigned to them. The syptom is that all new jobs will be stuck in
Submitted
state and all running jobs will be stuck in
Done
or
Failed
state.
Impact of downtime of a VO agent:
- Class 1 service (web-service): no impact.
- Class 2 service (data transfer): gradual degradation for the given VO only. The other VOs are not affected. State machine for transfers which have finished will not be updated while the agent will down, so clients will not 'see' finished jobs as finished. Job cancellation will not function reliably.
- Class 3 service (monitoring): no impact.
Intervention type:
- Manual. There is no automatic fail-over supported by the software.
Resilience to glitches:
- several 10s of minutes to several hours, depending on the depth of the VOs queue in the FTS: provided there are jobs assigned to channels, they will process at the normal export rate.
Recovery:
- once the VO agent restarts, the agent will assign all new jobs to the correct channel and update the state of 'finished' jobs. No jobs or job state will be lost.
FTS monitoring component
The (to be deployed) FTS monitoring component will run on a standard apache2 (httpd) server.
Impact of downtime of :
- Class 1 service (web-service): no impact.
- Class 2 service (data transfer): no impact.
- Class 3 service (monitoring): no impact.
Intervention type:
Resilience to glitches:
Recovery:
Service review template from SCM
For each service need current status of:
- Power supply (redundant including power feed? Critical?)
- Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?)
- Network (are servers connected to separate network switches?)
- Middleware? (can middleware transparently handle loss of one of more servers?)
- Impact (what is the impact on other services and / or users of a loss / degradation of service?)
- Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?)
- Tested (have interventions been made transparently using the above features?)
- Documented (operations procedures, service information)
--
GavinMcCance - 10 Jul 2007