VOMS Post Mortem 24th October 2008
Problem and Impact
The node voms113 half of the
voms.cern.ch was removed from the alias at 20:15. It is not clear when the service actually
failed but I expect it stopped accepting
voms-proxy-init connections at the same time the load balancer could no longer see it.
Consequently half of the
voms-proxy-init commands will have failed for around 5 minutes before 20:15.
Full service was restored at 22:32 and in the two hour period the one node behind
voms.cern.ch will have provided adequate
service.
Significance of Security Scan Starting at 19:40
The main indicator that the security scan was the trigger is essentially the time period. IPTables was clearly confused and blocking packets which are
perfectly normal and part of the normal service. e.g Taipei requesting a grid-mapfile from voms-admin.
Follow Up to Security Scan
Have contacted computer.security mainly to request not to do this kind of thing at 19:00 on a Friday. But basically this is an OS problem
really, I'm not the only one for who this happens. Have also ask how often they run the security scan.
Alarm to Operators.
- The failure was first detected by a lemon metric at 21:13 and shortly afterwards it should have ideally raised an alarm with the Operators. But:
- An actuator script runs 3 times first to try and fix the detected problem, unfortunately the second iteration hung and with no timeout configured on the actuator the 3rd attempt was never reached. The hang would have been due to a lack of network though in principal voms itself should not hang in such situations. A timeout will, and has been added. A quick grep across all services suggest
timeout=0
is very common. It's the default.
- This should have been raised even earlier though. The node could not be queried by the load balancer from 20:15 so something should have noticed. i.e something should be made to notice.
- The
remote_probe/ping
test is now enabled.
- It is not clear (tome) why the heartbeat alarm was never triggered.
Detailed Time Line.
This includes some earlier events in the day time as well which may or may not be significant. Also some random odd things from
the logs are also included again which may not be significant.
- 15:32
- voms113 was set to maintenance mode to confirm that SLS notifications were being forward via SMS to my GSM. They were.
- 16:05
- voms113 was set back to production mode.
- 19:40:30
- Routine security scan started from 137.138.137.111
- 19:40:35
- IPTables on voms113 blocked an outgoing UDP connection on 514 (syslog) to lxfs6074.cern.ch (not obvious lxfs6074 is not in the configured syslog)
- 20:12
- IPTables on voms113 blocked an outgoing TCP connection on 8443 (FTS?) to w-fts01.grid.sinica.edu.tw .
- This is taiwan asking for a grid-mapfile. This ack should not have been blocked normally suggesting iptables is in a bad way.
- 20:15
- The Load Balancing service can now not see results from voms113 and removes it from the DNS.
- 20:26
- tomcat on voms113 begins to get into difficulty due to being unable to create any more threads.
- All start up attempts are successful in the sense that the start up script returns 0 for use by lemon and at that point the logs look fine, however after a short time in operation tomcat spits out the error to the log file but this is to late for lemon. Solution add a sleep to the end of actuator so lemon will pick up the failed start.
- In reality tomcat_outofmemory and tomcat_server_log were not added to the central DB by lemon so it might never have made a difference. Re-request now made but I should have checked it was there.
- 21:13
- voms service on voms113 cannot be contacted locally on box.
- Lemon metric (809) = -1, enough for voms113 to be removed from the voms.cern.ch alias. This would probably have happened if it had not already happened above.
- SLS Scripts also notice that metric(809) is bad and SLS updated accordingly.
- SMS sent to me that SLS is showing problems with the VOMS service.
- 21:14
- Lemon exception(30096) actuator runs because of metric(809) , finishes at 21:34 with exit code 1.
- 21:36
- Lemon exception(30096) actuator runs because of metric(809) (or is -1 to previous run?).
- This actuator apparently never finishes according to the lemon logs.
- So since three runs of the actuator never happen the alarm is never raised.
- I have a
Timeout=0
which I presume means never timeout?
- 20:26:40
- IPTables rejected contact from sure04.cern.ch UDP on port 4242 (to be checked)
- 20:26:42
- IPTables rejected contact from 10.5.178.156 , UDP on port 623. (to be checked)
- 20:26:46
- IPTables rejected contact from sure04.cern.ch UDP on port 4242. (to be checked)
- 20:26:47
- IPTables rejected contact from 10.5.178.156 , UDP on port 623. (to be checked)
- 20:26:47
- IPTables rejected contact from sure04.cern.ch UDP on port 4242. (to be checked)
- 21:40
- I start looking at the voms113 service. voms113 cannot be pinged or logged into but is still up on lemonweb.cern.ch
- 22:12
- I set voms113 to maintenance.
- This stops all the service restart attempts.
- The node seems okay, I can now login.
- This suggests all the service restarts are doing more harm than good in this case.
- 22:19
- Last obvious contact of the security scan from 137.138.137.111.
- 22:21
- Load Balancer is now able to see voms113 again but with a value of -1 since it is in maintenance mode. i.e not added back to voms.cern.ch alias.
- 22:26
- The voms service on voms113 apparently recovers. (i.e metric(809) > 0 )
- I check and restart services by hand.
- At this point it looks like the service may have recovered by itself. Unfortunately I do not notice and proceed.
- I was planning to put voms113 into maintenance and reboot the node via ipmi.
- Metric (809) was bad from 21:12 to 22:26 but no alarm seems to have made it to LAS and the Operators. Being followed up.
- 22:15
- Expecting this all to be an iptables confusion I choose to reboot the node.
- 22:19
- Node begins to reboot.
- 22:31
- Everything looks okay on voms113. Set the node back to production.
- 22:32
- Load Balancer notices the node is now back in production and adds it back to voms.cern.ch. Everything is working.
- 22:40ish
- SLS noticies, I am SMSed again.
Load Graph for Period
Conditions on Host Under Which Crash Happened.
Since part of the FTS service also failed at the same time it is worth tabulating some of the properties of the two machines at the time
the crash happened.
PostMortemFts24Oct08
Host |
OS |
Running Kernel |
Arch |
Hardware |
Network Driver |
Listening Services |
Kernel Tuning |
voms113 |
4.7 |
2.6.9-78.0.1.EL.cernsmp |
i386 |
dl_07_2 |
bnx2 1.6.9 478DCC31494367381918DC4 |
edg-voms, snmpd, ssh, tomcat, notd |
net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.core_uses_pid = 1 kernel.shmmax = 134217728 net.ipv4.tcp_keepalive_intvl = 10 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_time = 500 dispatch = false |
fts112 |
3.0.8 |
2.4.21-52.EL.cernsmp |
i386 |
ch620-636-24 |
tg3 |
ssh and notd |
net.ipv4.conf.default.rp_filter = 1 kernel.core_uses_pid = 1 kernel.shmmax = 134217728 net.ipv4.tcp_keepalive_intvl = 10 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_time = 500 dispatch = false |
--
SteveTraylen - 28 Oct 2008