Multiple node restarts after power cut

Description

Several databases have been affected following the power issue (cf the CF C5 report) of 24 servers in RAC50/barn at 0:22 on Saturday 7th June. All clusters have reconfigured automatically (moving the services on the remaining node(s)). It has lead to sessions being dropped and transiant unavailability of the database during the reconfiguration. Manual restart of some of the instances has been required.

Impact

  • Service degraded (node shutdown): adcr, atonr, atlr, atlarc, zora, acclog, accon, enccorcl, susi, lcgr, cmsr, cmsonr_sdg, alicestg, alicestg, atlasstg, pubstg, csdb, lemonrac, itcore

Time line of the incident

  • Begin: Fri, Jun 06, 2014 23:50
  • resolution date: Sat, Jun 07, 2014 02:15
  • Node restarts are in the range 00:20 - 00:30 of Saturday morning

Analysis

  • Root cause: one of the PDUs of critical rack BF08 tripped off 06/06 at 23:50 because of a defective power supply in a server. All boxes in the rack, except the management switch, stayed up. But unfortunately, the operator switched off by mistake the second PDU and stopped all boxes in the racks. The database service was strongly affected.

  • Additional findings: some of the DB services did not handle the failure gracefully, which was the default for the majority, see details below

Issues, follow-up actions and ideas for additional follow-up actions

  • SUSI:
    • the sqlnet.ora file on the affected instance was not correct and prevented restart of the instance
    • manual action and fix: copy of sqlnet.ora from the surviving node (which had a good copy) and restart
    • additional follow-up ideas: automatic configuration and check may include sqlnet.ora
  • ITCORE
    • The affected instance (at node itrac50023) did not start up automatically, throwing an error ORA-119 ("invalid specification for system parameter %s")
    • Manual fix: set remote_lister to null and then set it to 'old syntax' with the two -s scan vips
    • Follow-up: not understood * backup filesystems in some systems were not mounted because of missed puppet config
    • manually fixed, permanent fix
  • envorcl
    • instance would not startup because of ORA-16188: LOG_ARCHIVE_CONFIG settings inconsistent with previously started instance
    • manual fix required: investigation of the issue on MOS and then by setting LOG_ARCHIVE_CONFIG
    • Follow-up: we learn that when deconfiguring ADG log_archive_config should either be left untouched or modified to DG_CONFIG=(NODG_CONFIG)
  • cmsonr ADG, second instance went into mount mode
    • after the crash of instance N.1, instance 2 gors into mount mode! "Close the database due to aborted recovery session"
    • secondary issue: problems with DB links to an instance that has been db opened from mount mode
    • Follow up: reproduce the possible bug causing the ADG to go in mount mode -> and open SR
    • Follow up: do no re-open DBs that went from open to mount but rather shutdown and open
  • castor, issues with app reconnection
    • Follow up: Kate will test with developers to reproduce the issue
  • csdb, issues with sending emails
    • this showed a misconfiguration of email, only node itrac50001 was configured to send email, after reboot the service moved to itrac50053
    • this has been since fixed
    • additional follow up: cs asked to failover the service, without force option (they preferred to restart the app)
  • Follow-up question:
    • should service be relocated after such incidents? should they be relocated with force option?

  • Additional point
    • Good configuration on our side: all clusters affected had at least one node up during the power cut.

-- LucaCanali - 16 Jun 2014

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-06-19 - LucaCanali
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback