VOMS Post Mortem 24th October 2008

Problem and Impact

The node voms113 half of the voms.cern.ch was removed from the alias at 20:15. It is not clear when the service actually failed but I expect it stopped accepting voms-proxy-init connections at the same time the load balancer could no longer see it. Consequently half of the voms-proxy-init commands will have failed for around 5 minutes before 20:15.

Full service was restored at 22:32 and in the two hour period the one node behind voms.cern.ch will have provided adequate service.

Significance of Security Scan Starting at 19:40

The main indicator that the security scan was the trigger is essentially the time period. IPTables was clearly confused and blocking packets which are perfectly normal and part of the normal service. e.g Taipei requesting a grid-mapfile from voms-admin.

Follow Up to Security Scan

Have contacted computer.security mainly to request not to do this kind of thing at 19:00 on a Friday. But basically this is an OS problem really, I'm not the only one for who this happens. Have also ask how often they run the security scan.

Alarm to Operators.

  • The failure was first detected by a lemon metric at 21:13 and shortly afterwards it should have ideally raised an alarm with the Operators. But:
    • An actuator script runs 3 times first to try and fix the detected problem, unfortunately the second iteration hung and with no timeout configured on the actuator the 3rd attempt was never reached. The hang would have been due to a lack of network though in principal voms itself should not hang in such situations. A timeout will, and has been added. A quick grep across all services suggest timeout=0 is very common. It's the default.
  • This should have been raised even earlier though. The node could not be queried by the load balancer from 20:15 so something should have noticed. i.e something should be made to notice.
    • The remote_probe/ping test is now enabled.
  • It is not clear (tome) why the heartbeat alarm was never triggered.

Detailed Time Line.

This includes some earlier events in the day time as well which may or may not be significant. Also some random odd things from the logs are also included again which may not be significant.

15:32
voms113 was set to maintenance mode to confirm that SLS notifications were being forward via SMS to my GSM. They were.
16:05
voms113 was set back to production mode.
19:40:30
Routine security scan started from 137.138.137.111
19:40:35
IPTables on voms113 blocked an outgoing UDP connection on 514 (syslog) to lxfs6074.cern.ch (not obvious lxfs6074 is not in the configured syslog)
20:12
IPTables on voms113 blocked an outgoing TCP connection on 8443 (FTS?) to w-fts01.grid.sinica.edu.tw .
  • This is taiwan asking for a grid-mapfile. This ack should not have been blocked normally suggesting iptables is in a bad way.
20:15
The Load Balancing service can now not see results from voms113 and removes it from the DNS.
20:26
tomcat on voms113 begins to get into difficulty due to being unable to create any more threads.
  • All start up attempts are successful in the sense that the start up script returns 0 for use by lemon and at that point the logs look fine, however after a short time in operation tomcat spits out the error to the log file but this is to late for lemon. Solution add a sleep to the end of actuator so lemon will pick up the failed start.
  • In reality tomcat_outofmemory and tomcat_server_log were not added to the central DB by lemon so it might never have made a difference. Re-request now made but I should have checked it was there.
21:13
voms service on voms113 cannot be contacted locally on box.
  • Lemon metric (809) = -1, enough for voms113 to be removed from the voms.cern.ch alias. This would probably have happened if it had not already happened above.
  • SLS Scripts also notice that metric(809) is bad and SLS updated accordingly.
  • SMS sent to me that SLS is showing problems with the VOMS service.
21:14
Lemon exception(30096) actuator runs because of metric(809) , finishes at 21:34 with exit code 1.
21:36
Lemon exception(30096) actuator runs because of metric(809) (or is -1 to previous run?).
  • This actuator apparently never finishes according to the lemon logs.
  • So since three runs of the actuator never happen the alarm is never raised. frown
  • I have a Timeout=0 which I presume means never timeout?
20:26:40
IPTables rejected contact from sure04.cern.ch UDP on port 4242 (to be checked)
20:26:42
IPTables rejected contact from 10.5.178.156 , UDP on port 623. (to be checked)
20:26:46
IPTables rejected contact from sure04.cern.ch UDP on port 4242. (to be checked)
20:26:47
IPTables rejected contact from 10.5.178.156 , UDP on port 623. (to be checked)
20:26:47
IPTables rejected contact from sure04.cern.ch UDP on port 4242. (to be checked)
21:40
I start looking at the voms113 service. voms113 cannot be pinged or logged into but is still up on lemonweb.cern.ch
22:12
I set voms113 to maintenance.
  • This stops all the service restart attempts.
  • The node seems okay, I can now login.
    • This suggests all the service restarts are doing more harm than good in this case.
22:19
Last obvious contact of the security scan from 137.138.137.111.
22:21
Load Balancer is now able to see voms113 again but with a value of -1 since it is in maintenance mode. i.e not added back to voms.cern.ch alias.
22:26
The voms service on voms113 apparently recovers. (i.e metric(809) > 0 )
  • I check and restart services by hand.
  • At this point it looks like the service may have recovered by itself. Unfortunately I do not notice and proceed.
  • I was planning to put voms113 into maintenance and reboot the node via ipmi.
  • Metric (809) was bad from 21:12 to 22:26 but no alarm seems to have made it to LAS and the Operators. Being followed up.
22:15
Expecting this all to be an iptables confusion I choose to reboot the node.
22:19
Node begins to reboot.
22:31
Everything looks okay on voms113. Set the node back to production.
22:32
Load Balancer notices the node is now back in production and adds it back to voms.cern.ch. Everything is working.
22:40ish
SLS noticies, I am SMSed again.

Load Graph for Period

Load.png

Conditions on Host Under Which Crash Happened.

Since part of the FTS service also failed at the same time it is worth tabulating some of the properties of the two machines at the time the crash happened. PostMortemFts24Oct08

Host OS Running Kernel Arch Hardware Network Driver Listening Services Kernel Tuning
voms113 4.7 2.6.9-78.0.1.EL.cernsmp i386 dl_07_2 bnx2 1.6.9 478DCC31494367381918DC4 edg-voms, snmpd, ssh, tomcat, notd net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.core_uses_pid = 1
kernel.shmmax = 134217728
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 500
dispatch = false
fts112 3.0.8 2.4.21-52.EL.cernsmp i386 ch620-636-24 tg3 ssh and notd net.ipv4.conf.default.rp_filter = 1
kernel.core_uses_pid = 1
kernel.shmmax = 134217728
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 500
dispatch = false

-- SteveTraylen - 28 Oct 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Load.png r1 manage 7.5 K 2008-10-28 - 17:35 SteveTraylen  
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2008-11-03 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback