Moving the FTA agents to another box

What is it?

Assuming you want to shut down a box for maintenance. It is also the manual failover procedure.

This is the manual procedure that describes how to move the agents from that machine to the backup machine, such that no job state is lost, whilst minimising the agent downtime.

The procedure for an emergency move is also described.

Service impact

There is some state held on the server file system - the state of transfers which are currently running. The channel agent is resposible for sweeping up the state of finished or failed transfer attempts into the database. Consequently, you have to wait until all active transfers are finished (i.e. have either completed or failed) on a channel before moving the channel agent to another machine.

The average time for an SC4 job with 1 gigabyte files is around 3-4 minutes. Consequently, draining a server of active jobs leads to on around a 5 minute downtime for the goiven channel during which no new transfers will run on the network. As soon as the agent is stopped, it may be started on the backup machine, and will immediately start serving jobs again.

The emergency move procedure incurs no significant downtime, but all currently running transfers will be marked as failed (the transfers will subsequently be retried accroding the given VOs retry policy).

Procedure for a full clean move

This procedure is for moving all agents on one node to the backup node. It is not optimal, since it waits until all daemons are finished before moving them, rather than moving them one by one.

1. The procedure assumes you have configured the agents to run on the backup machine. It also assumes you have reconfigured to primary box to remove all the agent configuration files.

2. Find which channels are running on the box.

ps aux | grep glite-transfer-channel-agent

3. Set these channels Inactive, so that new transfers will not be started. The downtime starts here. e.g.

for i in `ps aux | grep glite-transfer-channel-agent | grep edguser | \
  awk '{print $11}' | sed 's/glite-transfer-channel-agent-urlcopy-//g'` ; \
do 
   glite-transfer-channel-set -S Inactive $i ; \
done

4. Wait until there are no jobs running. i.e. grep the process table for processes of the form CHANNEL-NAME__*

ps aux | grep CERN-CERN__

5. Stop the agents.

service transfer-agents stop

6. Move to the backup machine. Start the agents:

service transfer-agents start

7. Set all the channels Active again. The downtime ends here.

for i in `ps aux | grep glite-transfer-channel-agent | grep edguser | \
  awk '{print $11}' | sed 's/glite-transfer-channel-agent-urlcopy-//g'` ; \
do 
   glite-transfer-channel-set -S Active $i ; \
done

Procedure for a partial or staged clean move

This is the procedure to move a single agent to another machine. It may be iterated to perform a staged move of all agents to another machine - doing this minimses the service downtime, but takes a lot longer to perform.

1. The procedure assumes you have configured the agent to run on the backup machine. It also assumes you have reconfigured the primary box to remove the agent configuration file.

2. Identify the agent you want to move.

ps aux | grep glite-transfer-channel-agent-urlcopy-CERN-CERN

3. Set the associated channel Inactive. The agent downtime starts here.

glite-transfer-channel-set -S Inactive CERN-CERN

4. Wait until all the transfer processes for this agent have finshed. e.g.

ps aux | grep CERN-CERN__

5. Stop the agent (the instance name is the full process name).

service transfer-agents stop --instance glite-transfer-channel-agent-urlcopy-CERN-CERN

6. Move to the backup machine and start the new agent.

service transfer-agents start --instance glite-transfer-channel-agent-urlcopy-CERN-CERN

7. Set the channel Active again. The downtime stops here.

glite-transfer-channel-set -S Active CERN-CERN

Procedure for an emergency move

This is the procedure for an emergency move to another machine.

In all cases below, the procedure assumes that the backup nmachine is configured with the agents ready to run.

If you can log onto the problematic machine

1. Cat the agent process names to the FTA disabled file to precent them from being restarted should the box reboot:

ps aux | grep glite-transfer-channel-agent | grep edguser | awk '{print $11}' > /etc/glite-data-transfer-agents.disabled

2. Stop the agents:

service transfer-agents stop

3. Start the agents on the backup machine.

service transfer-agents start

If you cannot log onto the machine, but can shut down the machine (e.g. from the console).

1. Shut down the machine.

2. Start the agents on the backup machine:

service transfer-agents start

3. Be aware that when you reboot the problematic machine, the agents will attempt to restart but will fail since they will detect that the same agent is already running from a different machine.

If you cannot log onto the node and you cannot or do not want to shut down the machine

The critical thing is to get the running agents to drop the DB lock, since you will not be able to start the sanme agent on a different machine until this has happenned (the lock is present in order to prevent DB corruption).

1. Log onto the DB session manager and kill all sessions from the problematic machine - see FtsProcedure15DropDBLock.

2. Start the agents on the backup machine.

service transfer-agents start

If the machine is not running

1. Start the agents on the backup machine.

service transfer-agents start

If an agent daemon does not start within a few seconds, do the procedure above for "If you cannot log onto the node and you cannot or do not want to shut down the machine".


Last edit: GavinMcCance on 2006-04-05 - 13:33

Number of topics: 1

Maintainer: GavinMcCance


Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2006-04-05 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback