Spontaneous reboots of different nodes of CMSR production database

Description

During last 3 weeks CMS offline production database was affected by numerous reboots of different cluster nodes. The reboots were affecting applications critical for CMS data analysis activity.

Impact

  • The applications connecting to rebooted nodes were loosing their sessions and any transactions in progress. This disturbes CMS data analysis acrtivities.

Time line of the incident

  • 15-Sep-2011 13:20 - First reboot (node number 2). Node evicted by the clusterware. No indication of the root cause.
  • 21-Sep-2011 12:20 - Another reboot of node number 2 followed by reboots of nodes 3 and 4. Again no indication of the root cause in the logs
  • 22-Sep-2011 14:20 - Node number 1 reboots.
  • 22-Sep-2011 - OSWatcher software deployed gather more info.
  • 30-Sep-2011 14:30 - Node 3 reboots. Nothing in the logs including OSWatcher logs
  • 02-Oct-2011 - a SR to Oracle Support has been opened
  • 06-Oct-2011 8:20 - reboot of node 2 followed by reboot of node 1
  • 06-Oct-2011 17:20 - node 3 reboots
  • 06-Oct-2011 - first feedback from Oracle Support. They requested setting diagwait timeout at the clusterware level.
  • 07-Oct-2011 16:00 - The whole database stopped in order to enable diagwait.

Analysis

No clue so far about what causes the reboots. We suspect there was a bug introduced by one of the patches applied on 29th of August. Hopefully with diagwait enabled there will be more info in logs. The problem is being analyzed by Oracle Support.

Follow up

Enabling diagwait seems to solve the issue. As diagwait doesn't have any side effects there is no problem to keep the database running in this mode (majority of RAC databases at CERN actually do have diagwait enabled permanently).
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-11-07 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback