Description
Instance 4 of Atlas offline was reboteed 4 times during the period of 8 days. All reboots happend between 4AM - 5AM and occured every second day.
Impact
- The reboots affected COOL sessions connected to instance 4 of ATLR database. During those reboots database was fully accessible and each time services were properly relocated between cluster nodes which minimized service impact of those reboots.
Time line of the incident
- Sunday 28.11 4:21AM atlr4 reboot - node rebooted by cluster, services relocated to remaining nodes.
- Tuesday 30.11 4:41AM atlr4 reboot - node rebooted by cluster, services relocated to remaining nodes.
- Thursday 02.12 4:21AM atlr4 reboot - node rebooted by cluster, services relocated to remaining nodes.
- Saturday 04.12 4:28AM atlr4 reboot - node rebooted by cluster, services relocated to remaining nodes.
Analysis
Different symptoms for each reboot were observed therefore the root cause was not clear for a period of few days. Initially COOL application was suspected and lots of effors were made to investigate this. In parallel all database jobs specific for this instance or COOL application were carefully reviewed. To provide detailed diagnostic informations OSWatcher and addidional database jobs responsible for creating AWR snapshots every 2 minutes between 4AM and 5AM were deployed. Thanks to forementioned additional monioring informations internal script for rotating Oracle logs and trace files were discovered as a root cause of this problem. This script is deployed on every DB machine and has never caused that kind of problem.
Follow up
After clean-up in local directories problem has not reappeared. We are still working on this issue to fully understand it and to prevent from future occurences.