ADCR_ADG_RAC7 database suffered from multiple block corruptions after disk failures

Description

In the last 2 weeks of December 2013 we have observed multiple read failures on 5 ASM disks. 3 of those disks have then broken completely and were ejected from ASM. After disk failures many data blocks appeared as corrupted causing unavailablity of objects in 4 data files: 11,18,83 and 127. One more disk was discovered to be causing corruptions (faulty) on 8th Jan and it was manually dropped from ASM. Some of the corrupted blocks have been fixed by "Automatic Block Media Recovery" mechanism, while others appeared as logical corruptions (lost writes) and reqired manual fix. The corruptions appeared in datafiles only (ADCR_DATADG1 disk group) and seem to be caused by earlier mentioned disk failures.

Impact

ADCR Active Data Guard DB was stale or partially inaccessible from 28th Dec 9:20 until 31 Dec 14:00, when decision was taken to move centrally distributed TNS aliases to ADCR_DG_RAC8 standby and therefore switch end users to the 2nd, healthy standby.

The following tablespaces have been affected:

ATLAS_DQ2_DATA02 - index blocks required manual fix; have been fixed with corrupting blocks on standby and rebuilding the index on primary; ABMR was still showing corrupted blocks after that fix, and finally the whole datafile has been copied from RAC8 standby.

ATLAS_DQ2_DATA01 - several blocks of a very large IOT index were corrupted; incremental datafile backup from production was first attempted, but more corruptions surfaced afterwards; to fix the problem whole datafile backup was taken on RAC8 standby and copied to RAC7 standby.

ATLAS_PS_ARCH07 - 1 table block corruption was observed; incremental datafile backup from production was used to fix the problem.

ATLAS_LFC_DATA01 - after 2nd observed corruption full datafile backup was taken from primary database; this fixed the problem, but then another corruption/lost write was detected and the file has been copied again (this time from RAC8 standby); 2 days after the 2nd restore file went corrupted again indicating some persistent H/W problem. One disk more was suspected to be faulty (causing corruptions) and was therefore force dropped from the storage array.

Full functionality of RAC7 standby was restored on 9th Jan 2014 and no new corruptions were observed since then. TNS aliases were switched back to RAC7 standby on xx Jan 2014.

Time line of the incidents and corrective actions

  • 17-12-2013 [20:00] - ASM disk 56 fails, causing some write errors "ORA-27061: waiting for async I/Os failed; Linux-x86_64 Error: 5: Input/output error"; this is a likely cause of lost writes.
  • 28-12-2013 [09:18 – 16:02] - 7 blocks of the logical index in ATLAS_DQ2_DATA02 tablespace (file#127) were detected to be inconsistent with REDO (suffered lost writes); blocks were marked as corrupt and the index have been rebuilt on primary system to fix the problem; in the meantime more blocks of the same file were detected physically corrupt and were automatically repaired (fetched from primary) with Automatic Block Media Recovery feature.
  • 28-12-2013 [19:34] – 1 block of a very large IOT primary index in ATLAS_DQ2_DATA01 tablespace (file#83) was detected to be incinsistent with REDO; as IOT indexes cannot be rebuild incremental backup of the datafile was taken from primary system and the standby file was manually recovered with the backup; MRP was resumed and standby was able to catch up, therefore becoming fully functional; in the meantime file#127 was reporting and automatically fixing several physical block corruptions with ABMR.
  • 30-12-2013 [03:28] - ASM disk 82 fails.
  • 30-12-2013 [16:44] - ASM disk 45 fails, causing multiple write errors "ORA-27061: waiting for async I/Os failed; Linux-x86_64 Error: 5: Input/output error"; this is a likely cause of lost write.
  • 30-12-2013 [16:47] - 1 block of a table in ATLAS_PS_ARCH07 tablespace (file#18)was detected to be incinsistent with REDO; the corruption was fixed by taking incremental datafile backup on primary and recovering the file on standby from that backup.
  • 30-12-2013 [17:58] - 2 blocks of IOT primary index in ATLAS_DQ2_DATA01 tablespace (file#83) - the same object that was corrupted on 28th Dec, were detected to be incinsistent with REDO; since this was a 2nd incident with the same datafile full "backup as copy" was taken on RAC8 standby and copied to RAC7 ADG; the file was switched to copy, recovered and finally onlined; in the meantime file#11 was reoporting multiple corrupted blocks that were repaired with ABMR feature.
  • 01-01-2014 [15:40-21:53] - 2 blocks of an index in ATLAS_LFC_DATA01 tablespace were detected to be incinsistent with REDO; online index rebuild and marking block as corrupt on standby was attempted as first, but on discover of the 2nd corruption the whole file was backed up "as copy" from primary to RAC7 standby; the file was switched to copy, recovered and onlined; in the meantime file#11 and also file#127 were both reporting corrupted blocks being fixed with ABMR.
  • 02-01-2014 [19:52] - 1 block of an index in ATLAS_LFC_DATA01 tablespace was detected to be incinsistent with REDO (not the same index as before); online index rebuild on primary was performed and the block marked corrupt on standby to fix the problem; in the meantime file#127 was reporting block corruptions being fixed with ABMR; afterwards more blocks of file#11 were reported to be corrupt but not fixed, for example: ORA-01578: ORACLE data block corrupted (file # 11, block # 2995481).
  • 06-01-2014 - file#11 was again copied from a healthy source (RAC8 standby) due to more blocks being reported as corrupt in the alert log; at the same time corrupted blocks were also being reported on file#127.
  • 07-01-2014 - file#127 was copied from RAC8 standby due to errors spotted earlier in the alert log, after the copy no new corrupt blocks were found in this file.
  • 08-01-2014 [03:10-03:14] - 2 blocks of an index in ATLAS_LFC_DATA01 tablespace were detected to be incinsistent with REDO; since this happened for the 3rd time in this particular file and the file was already recovered twice from a healthy source some persistent H/W issue was suspected that is corrupting the blocks constantly; further investigation showed that both corrupt blocks are located on disk /dev/mapper/itstor718_3p1 that was reporting errors on storage side just before the incident - the this was force dropped to avoid further corruptions.
  • 08-01-2014 [xx:xx] - a disk was discovered to be faulty (although not ejected by ASM) and was manually dropped. No new corruptions were reported afterwards.

Follow up and lessons learned

  1. We have to extend RAC8 warranty as we still have several important production services on this hardware, including the 2nd standby for ADCR; the current warranty is expiring in Jan-2014.
  2. We have to check disk level monitoring and automatic creation of tickets when disks are failing or when I/O errors happen; in the past we used to receive such notifications, but it is no longer the case.
  3. Our environment have to be ready to support fast and transparent switching of database users from one standby to another or from a standby to primary. To achieve this we should implement functional IP aliases (+ rellocation procedures) for all production services, meaning:
    • Functional IP aliases for scan VIPs, round robin for all hosts and also single IP for individual hosts, e.g. ADCR-ADG, ADCR-ADG1, ADCR-ADG2; note that round robin aliases used for SCAN in the CRS configuration cannot be reused as functional.
    • Make sure only functional aliases to scan VIPs are used in centrally distributted TNS file, instead of machine names or IP addresses.
    • Document on Twiki the script which moves functional IP aliases from one server/IP to another, this would be used to quickly redirect users to a different standby or to the primary system.[Emil to coordinate and document the procedure]
  4. In case of massive corruptions involving multiple blocks, especially related to H/W failures, and also if certain conditions listed below apply we should not mark blocks as corrupt on standby without, at least, completing some additional checks first:
    • Failed standby is the only standby, or all standby-s reported corruptions of the same blocks.
    • There is a "lost write on primary" warning in standby alert log.
    • The object concerned is not an index; note that table data blocks or IOT index blocks cannot be rebuilt and so repaired with corrupting blocks on standby.
    • The object is an index or index partition, but it's very big (and so very slow to rebuild online), everything >100GB should be considered too big.
  5. If more than 1 standby exists and the problem was only reported on 1, we can safely assume that the problem is local to that failed standby. In such case all corrective actions on the failed standby are allowed. Detailed instruction for dealing which each of those cases will be doccumented on Twiki in the "Stuck Recovery" article: https://twiki.cern.ch/twiki/bin/viewauth/DB/Private/FixingORA600-3020
  6. If multiple corruptions are present or expected in a single datafile, and especially after prior H/W issue, we should use datafile "backup as copy" from a healthy source to fix the broken file; some important conciderations for datafile copy to standbys with ASM storage:
    • For ASM to ASM direct (over the network) copy with RMAN you need 4x file_size of space in the target ASM; this is because RMAN wrongly allocates 2x more space than needed (likely a bug) which is then mirrored by ASM making a total space requirement of 4x; if there is not enough space consider deleting the original (corrupted) file on the target, before starting the copy. This extra space will finally be deallocated when the copy is finished.
    • If there is not enough space in the target ASM disk group even after deleting the corrupted file consider backup as copy from NAS based system (if such exists) as it does not require any excess space.
    • If there is not enough spac in the target ASM disk group and there is no healthy NAS based source, but there is space on some ASM based source, use local backup as copy on the ASM source and then transfer the file to failed standby using DBMS_FILE_TRANSFER package. Note that in order to copy a file this way to standy, you need to open the standby in R/W mode for the duration of the copy, meaning converting it to SNAPSHOT STANDBY. Once the copy is finished, and before the file is renamed or switched to copy the database should be converted back to PHYSICAL STANDBY.
  7. If the datafile was copied from a healthy source it should be put in place by: offlining and dropping the corrupted file, followed by datafile rename of the copied file to its original name; note that standby database needs to be temporarily put in MANUAL standby_file_management mode for the rename to work.
Points 1-2 will be handled by Giacomo, while points 3-7 will be worked out by Emil.
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-01-09 - EmilPilecki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback