Two OVM servers were rebooted by the high availability system

Description

During what was supposed to be routine maintenance the OVS agent rebooted two nodes (dbsrvg3213 and dbsrvg3506) causing the hosted machines to be restarted elsewhere.

All the virtual machines restarted but on some the networking was broken as the emulated network devices were not "connected" to the correct ethX interfaces. As such a number of machines were unable to connect to the NAS.

In particular this affected DEVDB11

Impact

Primarily users of DEVDB11 which was brought back a few hours later on a physical host

Time line of the incident

10:15 Two nodes are rebooted

Analysis

Currently waiting for a response from Oracle (SR 3-4816118881)

See https://twiki.cern.ch/twiki/bin/viewauth/DB/Private/YetAnotherMysteryReboot for some preliminary thoughts.

Follow up

It seems that it will not be possible to perform a transparent upgrade of OVM2 so several days of downtime involving a complete halt of all systems will be needed.

The guest networking issue might be solved by adding alias ethX to /etc/modprobe.conf

In short OVM2 is not fit for use as a virtualisation platform.

-- EwanRoche - 26-Oct-2011

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2011-10-26 - EwanRoche
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback