RAC52 storage issue

Description

Database services running on RAC52 were in general unavailable to users. In some cases existing connections were working but no new connections were allowed.

Impact

  • All databases who had volumes (RDBMS or CRS) on RAC52 storage. These are all production database services.
    • List available in the outage record: https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&n=OTG0040966
    • Final list of databases that had volumes on two affected aggregates as seen from storage. However application resiliency might have limited the impact: ACCCON, ACCLOG, ACCMEAS, ADCR, AISBIP, ALICESTG, ATLARC, ATLASSTG, ATLIMT, ATLR, ATONR, CASTORNS, CERNDBU, CMSARC, CMSONR, CMSR, CMSSTG, COMPR, CSDB, ENCVORCL, GISDBR, LASER, LCGR, LHCBONR, LHCBR, LHCBSTG, PDBR, PUBSTG, QPSR, REPACKDB, SCADAR, SUSI, TIM, ZORA

Time line of the incident

  • 19.Nov 20:45: first filer crashed
  • 19.Nov 21:11: partner node become unresponsive and rebooted
  • 19.Nov 21:45: partner node was brought online
  • 19.Nov 22:00: first filer was brought back online and databases are restarted
  • 19.Nov 22:20: databases are back to normal operation

Analysis

On Sunday 19.Nov approximately at 20:45 one of the production storage filers hosting several database volumes crashed with what appears as an internal data processing timeout. As per HA policy the data access was immediately switched to surviving partner filer, however the problem re-appeared while re-playing the journal and as a consequence also partner node become unavailable. Impact was wide but due to policy of not keeping related volumes on one filer many databases were only degraded and not completely unavailable.

Root cause of the internal storage problem is being followed up with vendor, case 2007155241. (https://mysupport.netapp.com/cssportal/faces/oracle/webcenter/portalapp/pages/css/casesparts/CaseDetailsLanding.jspx?_adf.no-new-window-redirect=true&caseNumber=2007155241)

Database services running on RAC52 were in general unavailable to users, but in some cases existing connections were working but no new connections were allowed. The new connections were not possible because although the RDBMS related volumes were alright (e.g. volumes still on RAC50, like SCADAR, QPSR, ACCLOG, ACCCON, ACCMEAS) the CRS volumes were on RAC52, so the LISTENERS (managed by CRS) would now allow new connections.

CRS was is an unmanageable and unresponsive state: CRS commands to check/manage it were not working but the OS process were all present. Most systems recovered by themselves when the CRS volumes became responsive again. After the storage was back to normal rebooting the hosts restored all services (the ones that did not recover by themselves)to normal operations.

In the case of some database hosts, like ATLR, which has routable IP address, the volumes could not be mounted without manual intervention, and this made the database unavailable for a longer period of time than others.

Not everyone is aware that the CERN Status Board (cern.ch/ssb) is the place to get quick and reliable information on the status of services provided by IT. We should remind this to the users.

Follow up

  • There should be an easy way to get a list of databases and services unavailable
    • This should be a script that can be executed from db-manager

  • The current email deluge is highly inefficient especially when we have major outage
    • Should be easier to disable monitoring for a large number of databases for the duration for the outage, web interface as opposed to git

  • In addition to the SSB entry we should consider informing the operators in case of big outage.

  • We should add in SSB ticket as soon as possible but do not be explicit on the affected services until we have a full list available. We could use a generic message similar to "Some database services unavailable. We are investigating the issue."

  • Computer Center operator procedures should be reviewed
    • The operator did not contact the person on piquet but someone else from IT-DB
    • There was feedback received that the operator did not have a clear idea of how to proceed in this situation
    • update from Monday, 27 Nov 2017
      • there was a meeting with Benoit Clement and Fabio Trevisani about this. They will followup with the operator on why he did not call the person on piquet.
      • Operator Procedures were reviewed and they were correct for this situation. The operator did follow the procedure in case of a phone call ("Phone Call from a user").

  • We should remind our users how should they proceed in such situations: always call the Computer Center operator and then send email/open ticket with more details that would help solve the issue

  • We should also remind our users that the CERN Status Board (cern.ch/ssb) is the place to get quick and reliable information on the status of services provided by IT.

Vendor analysis

  • More than 30GB of core dump files was uploaded to NetApp support to analyze
  • Support engineers finally matched the problem to : https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=865444
    • Excerpt from the above bug: Certain applications, such as SQL Server and SAP HANA,issue large truncations to files as a part of reclaim operations. When a large file that is sparse, compressed, or cloned is truncated, ONTAP defers the deallocation of the file blocks by assigning them to a separate file, and calculates the block counts of each piece of the original file. In a very large file with holes, compressed or cloned blocks, the calculation of the block counts can cause a delay in the consistency point (CP).If the delay exceeds beyond ten minute limit, then it leads to a controller disruption.

Permanent solution

  • Vendor recommends to upgrade to at least ONTAP 9.1 where this problem is fixed. This is planned over YETS 2017.
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2017-11-30 - MiroslavPotocky
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback