Main FTS Pages |
---|
FtsRelease22 |
Install |
Configuration |
Administration |
Procedures |
Operations |
Development |
Previous FTSes |
FtsRelease21 |
FtsRelease21 |
All FTS Pages |
FtsWikiPages |
Last Page Update |
GavinMcCance 2007-07-10 |
Ready
state, partially served jobs will stay in Active
state.
Recovery: once the agent restarts, the transfers will start agan; no jobs or job-state will be lost.
Impact of downtime: 100% stoppage on the class-2 service (data transfer) for the given channel only. There is no impact on the service provided by the other channel agents. There is no impact on the class 1 (web-service) or class 3 (monitoring) service.
For each service need current status of: Power supply (redundant including power feed? Critical?) Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?) Network (are servers connected to separate network switches?) Middleware? (can middleware transparently handle loss of one of more servers?) Impact (what is the impact on other services and / or users of a loss / degradation of service?) Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?) Tested (have interventions been made transparently using the above features?) Documented (operations procedures, service information)
Dump from previous note on the subject: FTS Availability Availability class There are two classes of availability with FTS: 1. The ability to submit new jobs and query their status. This is determined by the availability of the FTS web-service. 2. That the jobs currently in the system are being processed correctly. This is determined by the availability of the FTS agents and the other services upon which the FTS depends. The components are fully decoupled, so a failure in one component will not immediately affect the other. Failure impact of Tier-0 to Tier-1 export service Database: if the DB goes down, all components of the FTS will stop working, leading to 100% service unavailability in both classes. All existing transfer state is persistently cached until it can be committed to the database once it is back. Running transfers will finish but the state will not be updated until the DB is back. The DB is running on an Oracle 10g RAC cluster. All applications are using Transparent Application Failover drivers, consequently if the currently used node in the RAC goes down, the driver should failover automatically to another node without losing any client state. FTS web-service: the web-service runs on multiple nodes (currently two) load-balanced with DNS. Upon loss of one node, the DNS will be made to point at one node only; this is either automatic or an operator procedure depending on the type of failure. The class 1 service will then potentially run in degraded mode due to the excess load on the single FTS web-service – some requests may be cleanly refused if the service is overloaded. It also takes some time for the DNS changes to reach all clients (up to a few hours) during which the service will exhibit a 50% failure for these clients. The class 2 service will not be affected at all. FTS VO agents: the VO agents currently run on a single machine. Upon loss of this node, no new jobs will be assigned to channels. This will lead to a 100% class 2 service failure around 2 hours or so after the incident due to queue exhaustion (i.e. when the channels agents run out of work). There is a manual expert procedure to reconfigure the spare agent node as a backup for the agent node. There is currently no automatic failover to a pre-configured ‘spare’ node. There is no impact on the class 1 service. FTS transfer agents: the Tier0 to Tier1 export transfer agents are split over two nodes. If one of these nodes is lost, 50% of the transfers will stop immediately; the others will continue unaffected. There is a manual expert procedure to reconfigure the spare agent node as a backup for one of the transfer agent nodes. There is currently no automatic failover to a pre-configured ‘spare’ node. There is no impact on the class 1 service.