FTS service review for CERN PROD
A review of the FTS describing how to deploy the service to obtain maximum load-balancing and availability. The service impact of interventions and problems on different components is described.
The general considerations are described in
FtsServiceReview20, including the effect to the overall service of loss or glitches of sub-components or dependencies.
Service review template from SCM
For each service need current status of:
- Power supply (redundant including power feed? Critical?)
- Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?)
- Network (are servers connected to separate network switches?)
- Middleware? (can middleware transparently handle loss of one of more servers?)
- Impact (what is the impact on other services and / or users of a loss / degradation of service?)
- Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?)
- Tested (have interventions been made transparently using the above features?)
- Documented (operations procedures, service information)
Specifics for the CERN-PROD FTS service
For the Tier-0 export service.
Server layout
For CERN-PROD is described in
FtsTier0Deployment.
The general FTS deployment suggestion is described in
FtsServiceDeploymentModel.
Power supply redundancy.
Power supply (redundant including power feed? Critical?)
Servers
Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?)
- The web-services are DNS load-balanced over 3 nodes. Service monitoring checks each node every minute and will drop a problematic node out of DNS load-balancing.
- The channel agent daemons are balanced over 3 nodes. There is no redundancy. The agents are monitored and standard (automatic) procedure is to restart them if something looks bad (this is backed by a more detailed procedure in OPM). The service is well partitioned, so one channel agent being unavailable does not affect the transfers on any other channel. In case of node failure, the spare node should be brought up in its place (a manual procedure).
- The VO agent daemons are all on one node. There is no redundancy although the nodes are monitored as for channel agents. In case of node failure, the spare node should be brought up in its place (a manual procedure).
- The monitoring server is on a single node, but is not critical to the service operations.lcgfts
- The Oracle database uses the LCG Oracle RAC.
Refer to
FtsServiceReview20 for more details on what happens if different parts of the service go down.
Network
Network (are servers connected to separate network switches?)
- The service daemons are randomly distributed over the hardware, which are connected to a number of switches (bad).
- We should re-distribute them for higher resilience to internal switch failure, particualrly the web-service nodes. See FtsServiceReview20 for more details.
Middleware
Middleware? (can middleware transparently handle loss of one of more servers?)
- Yes, it can. The service components are well de-coupled from each other, so will keep running even if another part of the service is down. See FtsServiceReview20 for more details.
Recovery
Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?)
- In case of SRM failure, a channel can (and should) be paused (set
Inactive
) to avoid fruitlessly attempting transfers.
- In case of internal component failure, generally no state is lost. See FtsServiceReview20 'Recovery' notes.
- There are procedures (automatic, triggered by the LAS monitoring and backed by manual procedures in OPM) to attempt to 'restart' misbehaving servers.
'Transparent' interventions
Tested (have interventions been made transparently using the above features?)
Documentation
Documented (operations procedures, service information).
--
GavinMcCance - 18 Jul 2007