WARNING: This web is not used anymore. Please use PDBService.RaidLostPDB instead!
 

T3 lost RAID 0 - Intervention log

Author: Magnus Lübeck

Date: Jan 6, 2004

Description: This is a summary of what was done on the T3 array "dbsct37", which lost a raid 0 volume on new year's eve.

Sun Support Case ID: 37032948

Two disks were marked as bad in the T3 array, which cased the whole array to stop functioning as intended. The problem was resolved with help from Sun Support.


 

From: Darie Duclos [mailto:Darie.Duclos@Sun.COM]
Sent: Tuesday, January 06, 2004 12:29 PM
To: Magnus Lubeck

Subject: case 37032948


Hi Magnus,

I am getting back to your T3 problem. Sorry for the delay. It's been crazy. I'm writing up
what we have done so far so that we can proceed methodically. Does this look complete to you?

- on both cluster nodes:
   vxdisk offline c4t1d0
   vxdisk offline c4t1d1
   vxdisk rm c4t1d0
   vxdisk rm c4t1d1
 
vxdisk list [check that c4t1d0-1 are not present]
 
- on the T3:
   vol unmount v0
   vol remove v0 [this deletes all slices]
 

- on both cluster nodes:
   devfsadm -C -v
   scgdevs
   scdidadm -C
   
- on the T3:
   vol add v0 data u1d1-8 raid 5 standby u1d9
   vol init v0 data
   volslice add (add and init both slices: 50GB and 186GB)
   vol mount v0
   
- on both cluster nodes:
   devfsadm -v
   scgdevs
   scdidadm -r
   scdidadm -l [check that c4t1d0-1 are listed]
   format [check that luns are visible on both nodes, label them on one node]
   vxdctl enable
   vxdisk list [check that luns are seen and have error status]
   
- on master cluster node:
   vxdiskadm -> option 5 to replace both failed disks

I am missing some parts of the syslog because it got rotated again. Can you send me the
URL where I can pick up the syslog files from yesterday?


Now I'm looking at the syslog errors during the mirror rebuild. These are bad blocks discovered during reading or writing. The T3 marks them bad and

re-allocates the data elsewhere. This isn't a problem.

  

Jan 05 13:01:53 ISR1[1]: N: u1d2 SCSI Disk Error Occurred (path = 0x1) Jan 05 13:01:53 ISR1[1]: N: Sense Key = 0x1, Asc = 0x18, Ascq = 0x2 Jan 05 13:01:53 ISR1[1]: N: Sense Data Description = Recovered Data - Data Auto-Reallocated Jan 05 13:01:53 ISR1[1]: N: Valid Information = 0x426b41 Jan 05 13:17:27 ISR1[1]: N: u1d6 SCSI Disk Error Occurred (path = 0x0) Jan 05 13:17:27 ISR1[1]: N: Sense Key = 0x1, Asc = 0x18, Ascq = 0x2 Jan 05 13:17:27 ISR1[1]: N: Sense Data Description = Recovered Data - Data Auto-Reallocated Jan 05 13:17:27 ISR1[1]: N: Valid Information = 0x4d5aa7

The drive worked hard to get the data, so it reallocated it.

All 0x1/0x18 are read conditions.

Jan 05 14:13:58 ISR1[1]: W: u1d7 SCSI Disk Error Occurred (path = 0x0) Jan 05 14:13:58 ISR1[1]: W: Sense Key = 0x3, Asc = 0x11, Ascq = 0x0 Jan
    1. 14:13:58 ISR1[1]: W: Sense Data Description = Unrecovered Read Error Jan 05 14:13:58 ISR1[1]: W: Valid Information = 0x759268 Jan 05 14:13:58
    ISR1[1]
    N: u1d7 SVD_DONE: Command Error = 0x3 Jan 05 14:13:58 ISR1[1]: N: u1d7 sid 56985 stype 815 disk error 3

The 0x3 Sense key indicates the command terminated with a non-
recovered error condition. If 0x3 sense key occurs once or

twice a month, then that should be ok. The T3 firmware, during
normal read or "vol verify fix" (with 2.1.3 or 1.18.3 or higher
firmware revisions) operation, corrects the bad sector on the drive by
reconstructing the data (assuming a RAID configuration) and writes it
back to the drive, which in turn writes it to a spare sector. If the

occurrence of the 0x3 sense key is more frequent, then it is highly
recommended that the drive be replaced.

Jan 05 14:13:59 ISR1[1]: N: u1d7 SCSI Disk Error Occurred (path = 0x0) Jan 05 14:13:59 ISR1[1]: N: Sense Key = 0x1, Asc = 0xc, Ascq = 0x1 Jan 05 14:13:59 ISR1[1]: N: Sense Data Description = Write Error - Recovered With Auto
Reallocation
Jan 05 14:13:59 ISR1[1]: N: Valid Information = 0x759268

This is a good news message, and what it says is that the drive
had a write error but it was recovered and re-allocated.
Nomally if the feedback signal from the write is not strong,
then the drive reallocates.

So the second error condition indicates that the data from the first error was recovered

and re-allocated.

Have there been any more errors since then?

Thanks.

Darie

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2005-12-07 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PSSGroup All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback