ACCLOG activity suspended due to outage of one NAS Filer.

Description

Database activity in ACCLOG became suspended when the NFS access to one of the filers was cut due to problems in the upgrade to Ontapp 8.1.3

Impact

  • ACCLOG service was not operational, new data could not be loaded because the failing filer contained the current tablespace.

Time line of the incident

  • 23-July-2013 11:00 Started the Ontapp upgrade
  • 23-July-2013 11:22 After succesful upgrade of dbnasa401, we initiate the takeover of dbnasa402 - Filer then goes into a bad state
  • 23-July-2013 11:47 Contacted Netapp engineer in Lausanne for first analysis
  • 23-July-2013 12:30 Created Priority Case 1 (2004452244) with Netapp Support - Problem is related to dirty nvram contents that prevent proper boot sequence to complete
  • 23-July-2013 15:30 Support finds a method to bypass the check for nvram content and get the boot going, update steps to 8.1.3 are completed
  • 23-July-2013 15:35 dNFS links got cleared automatically when the Filer completed the reboot. DB service became operational

Analysis

Upgrade to Ontapp 8.1.3 in the dbnasa401 and dbnasa402. This is a pair cluster with fail-over features that permits a transparent upgrade.

Ontapp upgrade was from 8.0.2P1.

The upgrade completed successfully on a401. On a402, the takeover action launched from a401 did not complete even though we use the correct takeover syntax. The filer a402 did not manage to shutdown gracefully and remain in nowhere land. Subsequent attempts to complete a reboot failed because the NVRAM contents were still "dirty", and therefore for Ontapp it was unsafe to continue the upgrade steps that would follow the reboot.

With the help of Netapp engineers, we could bypass the initial NVRAM check and complete the upgrade. The NVRAM content could be "re-played" safely (block structure in the nvram compatible between 8.0.2 and 8.1.3). The DB activity resumed when the dNFS links came back to life and no data seems to have been lost as confirmed by Acclog managers

Follow up

  • We have sent to Support all the available trace/core dump files for further analysis. Problem related to a known bug in the 8.0 versions where ssh connections are not closed and that blocks the takeover process. It is fixed in the current 8.1.3 ontapp version.
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2013-08-08 - NiloChinchilla
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback