ADCR database suffered from lost writes

Description

In the period of 9th November 2012 to 15th January 2013 ADCR production database has suffered from several lost writes. In total 5 incidents were registered in which an old version of database block has been retrieved from disk when newer was expected; meaning that some of the block write operations did not make it to disk - hence we had "lost writes".

Impact

The lost writes in total resulted in corruption of 2 indexes, 2 index partitions and 1 table partition in the ADCR production database – one object in each incident. There was no permanent data loss as we were able to fix all corruptions: corrupted indexes have been rebuilt, and the corrupted table has been fixed with manual deletes of excess rows (the lost write happened for a delete operation).

As a result of the incidents ADCR Active Data Guard standby was stale for 10-65hours after each incident. This is because standby detects lost writes by validating the redo data from primary with its own copy of data blocks and stops further synchronization if inconsistency is detected.

Time line of the incidents

  • 09-11-2012 21:00 – Index ATLAS_DQ2.IDX_LOCATION_CONS: ADG stale for ~40hrs
  • 11-11-2012 05:22 – Index SYS.WRI$_ADV_EXECS_IDX_02
  • 16-11-2012 21:59 – Index partition ATLAS_PANDA.PART_FILESTABLE4_PK(PART_INITIAL_01011970): ADG stale for ~65hrs
  • 01-12-2012 21:07 – Index partition ATLAS_PANDAARCH.FILES_ARCH_PANDAID_IDX(FILESTABLE_ARCH_NOV_2012): ADG stale for ~20hrs
  • 15-01-2013 06:06 – Table partition ATLAS_LFC.CNS_FILE_REPLICA(ID_LESS_1000000000): ADG stale for ~11hrs

Analysis

The problem has been investigated in cooperation with Oracle and NetApp support and we have identified several problems, which could have caused or contributed to the lost writes but the true root cause was never fully confirmed.

The most likely cause was related to NFS storage configuration. Addition of "-p" flag to the execution of "reallocate" operation – to redistribute data over available disks – was introduced just 2 weeks before the 1st incident and no new incidents were observed when this flag was removed after 15th January. It cannot be proved for sure that this was the root cause, but we cannot take the risks of re-enabling the "-p" option for the critical ADCR database and we could not reproduce the problem elsewhere.

For the last incident on 15th January RAC related bug was considered as 2 instances had written the same block at the same time/SCN and none of those writes made it to the disk. Oracle RAC development team was investigating a possibility of in-RDBMS clash between those 2 writes resulting with lost IOs, however they did not make any final conclusions. We were asked to put some additional database parameters in place to enable further analysis in case of new incidents, but we couldn't enable them on ADCR due to high performance overhead, and moreover there were no new incidents since then.

Previous incidents did not follow the same write pattern, for example only 1 instance was writing block in the incident that happened on 1st Dec. Also in all incidents apart from the last one INDEX blocks were affected, which might suggest the last incident has a different root cause.

We also observed large number of direct NFS read errors in affected systems. They didn't seem to be fatal and happened mostly for redo log archival operations (ARC process or RMAN sessions), but were also visible in a small number of end user / application sessions. The cause of dNFS errors was identified as DNS load balancing for storage server IPs. It cannot be excluded that some of these errors caused lost writes, although there is no direct evidence either.

Similar problems have also been observed on recovery servers which run automatic test recoveries from backups. Recovery specific bugs were found and addressed by patches, but some of the failed recoveries might have been caused by the same factors as ADCR lost writes. In total we observed 14 failed recovery attempts between 2nd Oct 2012 and 13th Feb 2013.

Follow up

Although the root cause of lost writes has not been fully confirmed we have identified several problems and corrective actions have been applied:

  • The "-p" flag was removed from NetApp reallocate command
  • Recent Oracle CPU patches and some one-off patches were introduced
  • dNFS errors have been vastly avoided by disabling DNS load balancing for IP aliases of NAS servers
  • newest Linux kernel was deployed
  • storage diskshelf firmware upgrades were performed
After applying the corrective measures we have not seen new production incidents since 15th Jan 2013 and no new failed recoveries since 13th Feb 2013.

With the technologies we use and also due to nature of lost write problem it is not possible to have full protection against this kind of problem. To minimize the chances of hitting lost writes in the future we should:

  • always install the most recent software updates/patches for all system components
  • do not use unproved/untested features – such as NetApp reallocate command with –p option
To improve detection of lost writes we should set “db_lost_write_protect “ database parameter to TYPICAL for all databases for which Data Guard standby or automatic recovery is configured.

References:

1. Oracle SR#3-6427355311: ORA-00600 [kdsgrp1] on primary and ORA-00600 [3020] on both standbys
2. Oracle SR#3-652123744: spin off SR 3-6427355311 - Lost Write on RAC/Direct NFS Configuration
3. NetApp case #2003879656: Suspicion of lost write (Oracle RAC 11.2.0.3 - Direct NFS)

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2013-04-24 - EmilPilecki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback