FTS Third Level procedure
The following describes the 3rd level procedure for checking the FTS nodes. It contains more detailed procedures for checking and fixing problems, along with general debugging hints. It is designed to be followed by the 3rd level support.
It should be followed mostly in sequence.
The code examples below are designed, as much as possible, to be cut and pasted into the terminal window. Please report any errors you see.
The general procedure in case of action is always:
1) Understand clearly from the 2nd level support the current situation
Ask the following questions:
- Did the backup FTS host get switched on and the DNS flipped?
- Is the problem with the primary machine, the backup machine or visible on both?
- Has any other action been taken so far? (Server reboots, daemon restarts, config file changes, etc)?
- Did the smoke test show up any strange things in any of the logfiles ("ORA errors, Java stack traces, etc)? (for which no action was indicated in the 2nd level support procedures)
- Anything else that looks out of place or behaves differently from what you've seen before?
2) Be VERY sure that you are not running daemons on both the primary and backup FTS servers.
Check the daemons on both the primary and backup FTS servers. There should only be daemons running on one of the servers.
If daemons are running on both:
- Kill the daemons on the primary machine.
- Restart the daemon on teh backup machine.
- Attempt to fix any problems on the backup.
3) Repeat the FTS smoke test
Run the
FTSSmokeTest if you haven't already. You may see something along the way that was missed the first time.
In the morning check to see if you can update the
SecondLevelProcedure or
FTSSmokeTest to clarify what to look out for.
4) Search for wierd things
Bad software versions, bad usernames, bad certificates, etc.
Log into the box as
root
, download and run the
fts-check.sh
script at the bottom of the
FtsServerInstall112 page.
wget https://twiki.cern.ch/twiki/pub/LCG/FtsServerInstall/fts-check.sh
chmod u+x fts-check.sh
./fts-check.sh
TODO: make a proper suite of these diagnosis scripts and deploy them.
5) Log into the box an prod around
You're the expert. If you're not the expert, call him.
This section will be expanded as things occur to me. At the moment, it relies on the software developers' knowledge of the software.
Make sure that any fixes you find are put back into the FTSSmokeTestAndActions or FTSThirdLevelProcedure
6) Repeat the FTS clean procedure
This will give you a clean environment to start debugging.
Follow the
FtsMeanAndClean15 procedure.
Useful notes
Log into the DB server FTS account and look around
You may need access to the DB schema to check for corruption or other problems. Look in the file:
/opt/glite/etc/config/glite-file-transfer-service-oracle.cfg.xml
the
DB*
attributes at the top give you the DB hostname, DB servicename, username and password. You
can use these to log into the DB server:
sqlplus username/password@dbhostname/dbservicename
--
GavinMcCance - 14 Jul 2005