Show Children Hide Children FtsServiceReview20CERNPROD
Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
GavinMcCance
2007-07-10

FTS service review

A review of the FTS describing how to deploy the service to obtain maximum load-balancing and availability. The service impact of interventions and problems on different components is described.

Service components and service impact of outages

The FTS is split into three distinct components, so there are three main classes of availability to consider:

  1. (web-service): The ability to submit new jobs, query their status and administer the channels. This is determined by the availability of the FTS web-service.
  2. (data transfer): That the file transfer jobs currently in the system are being processed correctly. This is determined by the availability of the FTS agents and the other external services upon which the FTS depends.
  3. (monitoring) The monitoring system - in preparation for the FTS 2.0 branch. This is determined by the availability of the node (and apache server) that exports the monitoring information.

All three parts of the service are well factorised from each other, so one part of the service can be down while the others stays up.

All three components synchronise on the backend database - if this database in unavailable, all parts of the service are down.

FTS web-service

The FTS web-service is used:

  • by clients to submit new jobs and query the status of existing ones
  • by VO administrators to cancel jobs and change the priorities of existing ones
  • by site administrators to change the properties of the transfer channels affecting their site

It is a stateless SOAP-based RPC web-service running inside the Tomcat 5.0 J2EE container.

Because it is stateless, it can be trivially load-balanced and this procedure also increases the availability. The currently recommended deployment scheme is to use DNS load-balancing. Upon loss of one node, the DNS will be made to point at remaining nodes only; this is either automatic or an operator procedure depending on the type of failure.

In a load-balanced configuration, upon loss of one of the nodes, the class 1 service (web-service) will run in degraded mode (potentially exhibiting overload on the remaining nodes in the cluster); excess requests should be cleanly refused. Neither the class 2 service (data transfer) nor the class 3 service (monitoring) will be affected at all by the loss of a web-service node.

FTS agents

The FTS agent daemons are responsible for processing the job (i.e.for doing the file copies). There are three different types of agent:

  • Channel agent. You run one of these daemons for every channel defined in the system. Each is responsible for running the transfers on its associated channel. Typically several 10's.
  • VO agent. You run one of these daemons for every VO defined in the system. Each is responsible for handling the requests belonging to its VO. Typically under 10.
  • ProxyRenewal agent. This renews expired proxies from MyProxy. You run one of these.

For load-balancing the configuration allows the agent daemons to be spread arbitrarily over multiple nodes.

The agents are completely factorised from each other - loss of one of them will not affect the correct operation of the others.

Channel agents

Downtime of a channel agent will result in all transfers on that channel being suspended. The currently running transfers will finish, but no new transfers will be put on the network. Work queued on the channel will stay queued until the channel agent is back up to serve it again. Symptom: unstarted jobs will stay in Ready state, partially served jobs will stay in Active state.

Recovery: once the agent restarts, the transfers will start agan; no jobs or job-state will be lost.

Impact of downtime: 100% stoppage on the class-2 service (data transfer) for the given channel only. There is no impact on the service provided by the other channel agents. There is no impact on the class 1 (web-service) or class 3 (monitoring) service.

VO agents

Downtime of a VO agent will result in...

Service review template from SCM

For each service need current status of:
Power supply (redundant including power feed? Critical?)
Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?)
Network (are servers connected to separate network switches?)
Middleware? (can middleware transparently handle loss of one of more servers?)
Impact (what is the impact on other services and / or users of a loss / degradation of service?)
Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?)
Tested (have interventions been made transparently using the above features?)
Documented (operations procedures, service information)

Previous notes

Dump from previous note on the subject:

FTS Availability

Availability class

There are two classes of availability with FTS:

1.   The ability to submit new jobs and query their status. This is determined by the availability of the FTS web-service.
2.   That the jobs currently in the system are being processed correctly. This is determined by the availability of the FTS agents and the other services upon which the FTS depends.

The components are fully decoupled, so a failure in one component will not immediately affect the other.

Failure impact of Tier-0 to Tier-1 export service

Database: if the DB goes down, all components of the FTS will stop working, leading to 100% service unavailability in both classes. All existing transfer state is persistently cached until it can be committed to the database once it is back. Running transfers will finish but the state will not be updated until the DB is back.

The DB is running on an Oracle 10g RAC cluster. All applications are using Transparent Application Failover drivers, consequently if the currently used node in the RAC goes down, the driver should failover automatically to another node without losing any client state. 


FTS web-service: the web-service runs on multiple nodes (currently two) load-balanced with DNS. Upon loss of one node, the DNS will be made to point at one node only; this is either automatic or an operator procedure depending on the type of failure. The class 1 service will then potentially run in degraded mode due to the excess load on the single FTS web-service – some requests may be cleanly refused if the service is overloaded. It also takes some time for the DNS changes to reach all clients (up to a few hours) during which the service will exhibit a 50% failure for these clients. The class 2 service will not be affected at all.


FTS VO agents: the VO agents currently run on a single machine. Upon loss of this node, no new jobs will be assigned to channels. This will lead to a 100% class 2 service failure around 2 hours or so after the incident due to queue exhaustion (i.e. when the channels agents run out of work). There is a manual expert procedure to reconfigure the spare agent node as a backup for the agent node. There is currently no automatic failover to a pre-configured ‘spare’ node. There is no impact on the class 1 service.

FTS transfer agents: the Tier0 to Tier1 export transfer agents are split over two nodes. If one of these nodes is lost, 50% of the transfers will stop immediately; the others will continue unaffected. There is a manual expert procedure to reconfigure the spare agent node as a backup for one of the transfer agent nodes. There is currently no automatic failover to a pre-configured ‘spare’ node. There is no impact on the class 1 service.



Edit | Attach | Watch | Print version | History: r5 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2007-07-10 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback