Description

On 11th August following instruction of Netapp Support (NetApp Log #2001633149) we tried to change the shelf ID loop identificator of two diskshelf. One added three weeks ago and one added last Monday (9/Aug/2010)

Impact

  • Just one aggregate was offline, that means that only file systems on that aggregate suffered from a longer downtime (~15 minutes), rest of files system just had a downtime of ~3 minutes. Though most databases and application servers are in a non production state, there are certain file system that are in production i.e. AIS application servers for test and development as well as the production Introscope setup. Some of the database didn't notice the reboot as the storage cluster reboots in less than 3 minutes and the database is capable to keep IO in memory.

Time line of the incident

  • 12:10 Ruben Gaspar (IT/DB): Following intervention notice: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/100811renumber.htm I started the intervention. I realised the Netapp procedure doesnt work for the first diskshelf ten minutes later.
  • 12:30 Ruben Gaspar (IT/DB): Trying to apply same procedure to the second diskshelf produces a boot from the node taking over the cluster activity, after the boot 2 minutes later several file systems from the failover node are offline, due to 4 apparently faulty disks. Indeed the disks are not faulty but due to instabilities induced by powering off/on the diskshelf (part of Netapp procedure ) the FC loop detects the disks as troublesome.
  • 12:35 Ruben Gaspar (IT/DB): Check with Nilo and Eric we decided to power off the cluster to recover all file systems as soon as possible. This will produce a downtime also on the ones working. Mod informed.
  • 12:41 mail to ais-incidents to indicate that a problem was encountered
  • 12:45 Ruben Gaspar (IT/DB): System is back online. Mod informed.
  • 13:15 Luigi restarts the virtual machines
  • 15:55 mail to ais-incidents indicating that the platform should be available again
  • 16:00 and 16:01 notification from Lucy and Ada that some of the application servers are not reachable
  • 16:15 some AIS application servers OS restarted by Luigi
  • 16:22 and 16:32 confirmation from Bartosz and Ada that all is correct
  • 17:01 confirmation form Dmitry that all of the infrastructure is running fine
  • 17:15 discussion about Introscope which needed to be restarted (by AIS)

Analysis

See above.

Follow up

A case is on-going on Netapp support.

-- RubenGaspar - 11-Aug-2010

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-08-12 - EricGrancher
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback