Two OVM servers were rebooted by the high availability system
Description
During what was supposed to be routine maintenance the OVS agent rebooted two nodes (dbsrvg3213 and dbsrvg3506) causing the hosted machines to be restarted elsewhere.
All the virtual machines restarted but on some the networking was broken as the emulated network devices were not "connected" to the correct
ethX
interfaces. As such a number of machines were unable to connect to the NAS.
In particular this affected
DEVDB11
Impact
Primarily users of
DEVDB11 which was brought back a few hours later on a physical host
Time line of the incident
10:15 Two nodes are rebooted
Analysis
Currently waiting for a response from Oracle (SR 3-4816118881)
See
https://twiki.cern.ch/twiki/bin/viewauth/DB/Private/YetAnotherMysteryReboot for some preliminary thoughts.
Follow up
It seems that it will not be possible to perform a transparent upgrade of OVM2 so several days of downtime involving a complete halt of all systems will be needed.
The guest networking issue might be solved by adding
alias ethX
to
/etc/modprobe.conf
In short OVM2 is not fit for use as a virtualisation platform.
--
EwanRoche - 26-Oct-2011