Intervention on the physical power in vault area

Description

  • A planned intervention on the physical power in the vault area, resulted in an outage of nine blade servers in the one enclosure.

Impact

  • All virtual servers running on these servers where rebooted and some didn't not come back.

Time line of the incident

  • 2-April-12 - 11:07 Planned intervention on physical power to racks for RAC10, and GEN3
  • 2-April-12 - 11:07 Partial loss of servers in blade enclosure encSX1101 : blades dbsrvg35[01-09]
  • 2-April-12 - 11:07 dfm and alarms from gen3 storage and rac10 storage : re missing power
  • 2-April-12 - 11:19 no contact alarms from dbvrtg067 etc.
  • 2-April-12 - 11:50 check out of power reports by going to the vault area
  • 2-April-12 - 13:22 Power restored to all effected blade servers in encSX1101
  • 3-April-12 - 13:41 Planned Intervention on Physic Power completed.

Analysis

  • There was a Power intervention announced in the C5 minutes. This intervention effected the physics power to all our systems in the vault area ie itrac10 and Gen3.

In order to minimize disruption to IT-DB services during the planned power outage, IT-CF, moved equipment that was exclusively on physical power to a mix of both physical and critical power. This was preceded with a visual check of power supplies to ensure that all power supplies to equipment was functioning correctly. - No problems where noted.

Unfortunately one blade enclosure for gen3, containing a large number of virtual servers suffered a power outage.

Each blade enclosure for the GEN3,RAC10 have a total of six power supplies (PSU), 3 on physics power and 3 on critical. When physics power was cut, 2 of the 3 powered psu's (on critical) failed, leaving only one psu powering the whole enclosure. On a positive note 5 blades did stay up in the enclosure during this period . Power was restored to the enclosure by the swapping all psu to the critical power.

Currently no alarms are raised by any monitoring directly or indirectly on the enclosures. The problem has been reported to the procurement team.

OVM problems

Three server pools were badly affected (G0, G1, G-Pool) and when power returned a number of the hyper-visors were unable to connect to the OVM manager or mount their storage. Because these are the old inconsistently configured systems there is little that can be done for them.

note These are the machines that management had decided not to upgrade in order to reduce the amount of time spent on OVM2.

The principal problems observed with OVM were:

  • Hyper-visors unable to contact the NAS for unknown reasons

  • OVM manager trying to start VMs on broken servers (despite them being in maintenance mode!)

  • Removal of broken machines from pools blocked due to them being the preferred host of a VM....

  • Agents losing connection with the OVM manager for unknown reasons

Follow up

  • Improve communications, ie When work is being carried out, the responsibles should be informed directly, rather than relying on the C5 channel.
  • More machines have been added to GEN3B in order to maintain/migrate services. These were taken from the broken systems and re-installed with the good configuration.
  • CF side are investigating further to understand power supply issue on enclosures.
  • In collaboration with the CF/sys admins a suitable power test should be carried out to verify that alarms are correctly raised in case of psu failure.

-- PaulSmith - 04-Apr-2012

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2012-04-25 - PaulSmith
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback