What to log when there is a problem with the service
Whenever is a problem in the LFCs that has to be understood, you should log some data.
The first thing to do is checking if there is a process taking completely one of the CPUs. For this you can use the command:
uptime
If it is, and nothing is being logged on /var/log/lfc/log then the service is stuck and you should collect and report the logs.
You can start creating a directory with the date YYMMDD and put in it these files:
- Log the last lines of the LFC log file with:
tail -999 /var/log/lfc/log > log-YYMMDD.txt
- Log all the processes running with:
ps auxwwwfm > ps-YYMMDD.txt
- Get the lfcdaemon process number with
ps auxwww|grep lfc
and log the list of open files with: lsof -p lfcdaemon_process_number > lsof-YYMMDD.txt
- Log the top cpu processes:
top -b -n 1 > top-YYMMDD.txt
- The same for the netstat command:
netstat -a > netstat-YYMMDD.txt
- Obtain a core image with:
gcore lfcdaemon_process_number
- Log a list of the mapped memory regions via:
cat /proc/lfcdaemon_process_number/maps > maps-YYMMDD.txt
- Log gdb output:
gdb -p lfcdaemon_process_number
(gdb) info threads
(gdb) set logging file gdb-YYMMDD.txt
(gdb) set loggin on
Call macro that does a stack dump of all threads: (gdb) allt 42 (42 means num of threads to log/check)
(gdb) quit
After running gdb, be sure that the lfcdaemon process is not blocked (does not have the T bit) when running: !ps
If it has it, make the daemon continue with the command: kill -CONT lfcdaemon_process_number
-- Main.dcollado - 18 Dec 2006