Procedure in case of problems with FTS
Important :
Remember to
call the operators (75011) before rebooting any machine, so that they (temporarily) ignore the alarms (and don't start a parallel intervention.)
Lemon monitoring
Note that the two FTS daemons are monitored by Lemon : if either of the FTS daemon is down for any reason, it will be restarted automatically. After 3 unsuccessful restarts, an alarm will be raised and the operators will call the support phones.
You can check the list of alarms in:
LemonAlarmList
TODO: Lemon monitoring is not yet written, although there is a primitive watchdog script
This includes the description of some kinds of problems that may occur together with possible solutions. It is not meant to be exhaustive and is under constant update. If you can solve the problem then:
- Log your action in the intervention log
- Email the Second Level Support mailing list : hep-service-sc-level2@cernNOSPAMPLEASE.ch
If you can not solve the problem then try to use the backup FTS host:
- Figure out which is the backup host :
/afs/cern.ch/project/gd/SC3/machines_list.txt
- Stop the service on the bad node
service gLite stop
. Be sure: kill -9
any processes owned by users sc3
and tomcat4
.
- Start the service on the new node
service gLite start
.
- Run the FTSSmokeTest and make sure this host is working correctly
- If the backup is working:
If the backup is not working then you are experiencing a more general problem
- Try to think at general problems like network connections, firewalls, connection with database etc ... and look at the appropriate log files (both of the current service machine and the backup) for hints (
/var/log/glite/glite-transfer-agent-*.log
and /usr/share/tomcat/logs/org.glite.data
are the best to look in). The FTSSmokeTest is continually being improved to add tests to try to identify problems in any of its dependency services.
- If you are still having problems
--
GavinMcCance - 07 Jul 2005