Services were not failed over properly after node eviction on ATLR, June 26th 2010

Description

4th node of ATLAS offline database (ATLR) suffered from high load and was evicted from the cluster. Unfortunately failover mechanism, normally instantaneous, did not re-act properly and a manual intervention was needed.

Impact

  • Applications using the 9 following Oracle services could not connect to the database between 15:10 and 16:00 on Saturday 26th June: atlas_ami, atlas_atlog, atlas_authdb, atlas_config, atlas_coolwrite, atlas_dd, atlas_larcalib, atlas_oksprod, atlas_prodsys

Time line of the incident

  • 26.06.2010 15:10 - 4th node of the ATLR database was rebooted by the clusterware but Oracle service failover mechanism have not relocated affected services properly.
  • 26.06.2010 15:30 - DBA on shift started investigation
  • 26.06.2010 16:00 - Problem fixed

Analysis

  • Investigation showed that the incident could be caused by one of known bugs described on metalink.oracle.com in the documents 8392418 and 7188878.
  • A service request has been opened to Oracle Support in order to confirm that. No reply, yet.

Follow up

  • Follow up depends on the reply from Oracle Support.

-- JacekWojcieszuk - 07-Jun-2010


This topic: DB > WebHome > PostMortems > PostMortem26June10
Topic revision: r1 - 2010-06-29 - unknown
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback