Services were not failed over properly after node eviction on ATLR, June 26th 2010
Description
4th node of ATLAS offline database (ATLR) suffered from high load and was evicted from the cluster. Unfortunately failover mechanism, normally instantaneous, did not re-act properly and a manual intervention was needed.
Impact
- Applications using the 9 following Oracle services could not connect to the database between 15:10 and 16:00 on Saturday 26th June: atlas_ami, atlas_atlog, atlas_authdb, atlas_config, atlas_coolwrite, atlas_dd, atlas_larcalib, atlas_oksprod, atlas_prodsys
Time line of the incident
- 26.06.2010 15:10 - 4th node of the ATLR database was rebooted by the clusterware but Oracle service failover mechanism have not relocated affected services properly.
- 26.06.2010 15:30 - DBA on shift started investigation
- 26.06.2010 16:00 - Problem fixed
Analysis
- Investigation showed that the incident could be caused by one of known bugs described on metalink.oracle.com in the documents 8392418 and 7188878.
- A service request has been opened to Oracle Support in order to confirm that. No reply, yet.
Follow up
- Follow up depends on the reply from Oracle Support.
--
JacekWojcieszuk - 07-Jun-2010
Topic revision: r1 - 2010-06-29
- unknown