Show Children Hide Children

Main FTS Pages
FtsRelease22
Install
Configuration
Administration
Procedures
Operations
Development
Previous FTSes
FtsRelease21
FtsRelease21
All FTS Pages
FtsWikiPages
Last Page Update
GavinMcCance
2008-09-16

Moving the FTA agents to another box for Release 2.0

What is it?

Assuming you want to shut down a box for maintenance. It is also the manual failover procedure.

This is the manual procedure that describes how to move the agents from that machine to the backup machine, such that no job state is lost, whilst minimising the agent downtime.

The procedure for an emergency move is also described.

Service impact

There is some state held on the server file system - the state of transfers which are currently running. The channel agent is resposible for sweeping up the state of finished or failed transfer attempts into the database. Consequently, you have to wait until all active transfers are finished (i.e. have either completed or failed) on a channel before moving the channel agent to another machine.

The average time for an SC4 job with 1 gigabyte files is around 3-4 minutes. Consequently, draining a server of active jobs leads to on around a 5 minute downtime for the given channel during which no new transfers will run on the network. As soon as the agent is stopped, it may be started on the backup machine, and will immediately start serving jobs again.

The emergency move procedure incurs no significant downtime, but all currently running transfers will be marked as failed (the transfers will subsequently be retried accroding the given VOs retry policy).

See FtsServiceReview20 for more details on the impact of unscheduled interventions.

Procedure for a full clean move

This procedure is for moving all agents on one node to the backup node. It is not optimal, since it waits until all daemons are finished before moving them, rather than moving them one by one.

1. The procedure assumes you have configured the agents to run on the backup machine. It also assumes you have reconfigured to primary box to remove all the agent configuration files.

2. Find which channels are running on the box.

ps aux | grep glite-transfer-channel-agent

3. Set these channels Inactive, so that new transfers will not be started. The downtime starts here. e.g.

for i in `ps aux | grep glite-transfer-channel-agent | grep edguser | \
  awk '{print $11}' | sed 's/glite-transfer-channel-agent-urlcopy-//g'` ; \
do 
   glite-transfer-channel-set -S Inactive $i ; \
done

4. Wait until there are no jobs running. i.e. grep the process table for processes of the form CHANNEL-NAME__*:

ps aux | grep CERN-CERN__

5. Disable the agents such that they will not start on the next reboot:

ps aux | grep glite-transfer-channel-agent | grep edguser | awk '{print $11}' > /etc/glite-data-transfer-agents.disabled

6. Stop the agents:

service transfer-agents stop

7. Move to the backup machine. Check that the file /etc/glite-data-transfer-agents.disabled does not exist and start the agents:

rm -f /etc/glite-data-transfer-agents.disabled service transfer-agents start

8. Set all the channels Active again. The downtime ends here.

for i in `ps aux | grep glite-transfer-channel-agent | grep edguser | \
  awk '{print $11}' | sed 's/glite-transfer-channel-agent-urlcopy-//g'` ; \
do 
   glite-transfer-channel-set -S Active $i ; \
done

Procedure for a partial or staged clean move

This is the procedure to move a single agent to another machine. It may be iterated to perform a staged move of all agents to another machine - doing this minimses the service downtime, but takes a lot longer to perform.

1. The procedure assumes you have configured the agent to run on the backup machine. It also assumes you have reconfigured the primary box to remove the agent configuration file.

2. Identify the agent you want to move.

ps aux | grep glite-transfer-channel-agent-urlcopy-CERN-CERN

3. Set the associated channel Inactive. The agent downtime starts here.

glite-transfer-channel-set -S Inactive CERN-CERN

4. Wait until all the transfer processes for this agent have finshed. e.g.

ps aux | grep CERN-CERN__

5. Disable this agent by adding its name to the /etc/glite-data-transfer-agents.disabled file:

echo "glite-transfer-channel-agent-urlcopy-CERN-CER" > /etc/glite-data-transfer-agents.disabled

6. Stop the agent (the instance name is the full process name).

service transfer-agents stop --instance glite-transfer-channel-agent-urlcopy-CERN-CERN

7. Move to the backup machine and start the new agent, checking that the file /etc/glite-data-transfer-agents.disabled (if it exists) does not conmtain the agent name you are about to start.

grep glite-transfer-channel-agent-urlcopy-CERN-CERN /etc/glite-data-transfer-agents.disabled service transfer-agents start --instance glite-transfer-channel-agent-urlcopy-CERN-CERN

8. Set the channel Active again. The downtime stops here.

glite-transfer-channel-set -S Active CERN-CERN

Procedure for an emergency move

This is the procedure for an emergency move to another machine.

In all cases below, the procedure assumes that the backup machine is configured with the agents ready to run.

If you can log onto the problematic machine

1. Cat the agent process names to the FTA disabled file to prevent them from being restarted should the box reboot:

ps aux | grep glite-transfer-channel-agent | grep edguser | awk '{print $11}' > /etc/glite-data-transfer-agents.disabled

2. Stop the agents:

service transfer-agents stop

3. Start the agents on the backup machine.

service transfer-agents start

If you cannot log onto the machine, but it is still up

The critical thing is to get the running agents to drop the DB locks, since you will not be able to start the same agent on a different machine until this has happened (the lock is present in order to prevent DB corruption).

1. Attempt to start the agents on the backup machine.

service transfer-agents start

The startup script will block for 1 minute attempting to take the lock from the agent running on the primary machine.

2. You should help it along - kill all the DB sessions from the primary machine, with particular emphasis upon the agent that is currently trying to start.

To do this log onto the DB session manager - see FtsProcedureDropDBLock20. There is only one DB session per agent. Soon after you kill the DB session from the primary machine, the agent should start OK on the backup.

3. Attempt to stop cleanly the agent on the primary machine as soon as possible. It will continue to attempt to re-establish the connection to the DB while it is still running.

If the machine is not up

1. Start the agents on the backup machine.

service transfer-agents start

If an agent daemon does not start within a few seconds, do the procedure above for "If you cannot log onto the machine, but it is still running".


Maintainer: GavinMcCance


Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2008-09-16 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback