Show Children Hide Children

Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
PaoloTedesco
2009-03-11

FTS service review for CERN PROD

A review of the FTS describing how to deploy the service to obtain maximum load-balancing and availability. The service impact of interventions and problems on different components is described.

The general considerations are described in FtsServiceReview20, including the effect to the overall service of loss or glitches of sub-components or dependencies.

Service review template from SCM

For each service need current status of:

  • Power supply (redundant including power feed? Critical?)
  • Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?)
  • Network (are servers connected to separate network switches?)
  • Middleware? (can middleware transparently handle loss of one of more servers?)
  • Impact (what is the impact on other services and / or users of a loss / degradation of service?)
  • Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?)
  • Tested (have interventions been made transparently using the above features?)
  • Documented (operations procedures, service information)

Specifics for the CERN-PROD FTS service

For the Tier-0 export service.

Server layout

For CERN-PROD is described in FtsTier0Deployment.

The general FTS deployment suggestion is described in FtsServiceDeploymentModel.

Power supply redundancy.

Power supply (redundant including power feed? Critical?)

  • Unknown on node level.

  • Unknown on rack level.

Servers

Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?)

  • The web-services are DNS load-balanced over 3 nodes. Service monitoring checks each node every minute and will drop a problematic node out of DNS load-balancing.
  • The channel agent daemons are balanced over 3 nodes. There is no redundancy. The agents are monitored and standard (automatic) procedure is to restart them if something looks bad (this is backed by a more detailed procedure in OPM). The service is well partitioned, so one channel agent being unavailable does not affect the transfers on any other channel. In case of node failure, the spare node should be brought up in its place (a manual procedure).
  • The VO agent daemons are all on one node. There is no redundancy although the nodes are monitored as for channel agents. In case of node failure, the spare node should be brought up in its place (a manual procedure).
  • The monitoring server is on a single node, but is not critical to the service operations.lcgfts
  • The Oracle database uses the LCG Oracle RAC.

Refer to FtsServiceReview20 for more details on what happens if different parts of the service go down.

Network

Network (are servers connected to separate network switches?)

  • The service daemons are randomly distributed over the hardware, which are connected to a number of switches (bad).
  • We should re-distribute them for higher resilience to internal switch failure, particualrly the web-service nodes. See FtsServiceReview20 for more details.

Middleware

Middleware? (can middleware transparently handle loss of one of more servers?)

  • Yes, it can. The service components are well de-coupled from each other, so will keep running even if another part of the service is down. See FtsServiceReview20 for more details.

Recovery

Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?)

  • In case of SRM failure, a channel can (and should) be paused (set Inactive) to avoid fruitlessly attempting transfers.
  • In case of internal component failure, generally no state is lost. See FtsServiceReview20 'Recovery' notes.
  • There are procedures (automatic, triggered by the LAS monitoring and backed by manual procedures in OPM) to attempt to 'restart' misbehaving servers.

'Transparent' interventions

Tested (have interventions been made transparently using the above features?)

Documentation

Documented (operations procedures, service information).

-- GavinMcCance - 18 Jul 2007

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2009-03-11 - PaoloTedesco
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback