Disk Replacement on PDB Cluster

Saturday's (12.03.2005) intervention on PDB Cluster:
1. In order to stop flood of emails I commented out the following
line in the '/etc/mail/aliases' file

root: pdb-sysadmin@cern.ch,root@

2. To check whether the problem is really due to a disk failure
I connected as root to dbsct37 (the storage box mentioned in
e-mails) and executed the following commands:

dbsct37:/:<56>vol stat

v0 u1d1 u1d2 u1d3 u1d4 u1d5 u1d6 u1d7 u1d8 u1d9
unmounted 0 0 0 0 4 0 0 0 0

dbsct37:/:<57>fru list
ID TYPE VENDOR MODEL REVISION SERIAL
------ ----------------- ----------- ----------- ------------- --------
u1ctr controller card 0301 501-5710-02( 0200/020106 116028
u1d1 disk drive SEAGATE ST336605FSUN A638 3FP1RL19
u1d2 disk drive SEAGATE ST336605FSUN A638 3FP1RSJJ
u1d3 disk drive SEAGATE ST336605FSUN A638 3FP1RPZM
u1d4 disk drive SEAGATE ST336605FSUN A638 3FP1RRX6
u1d5 disk drive
u1d6 disk drive SEAGATE ST336605FSUN A638 3FP106ZV
u1d7 disk drive SEAGATE ST336605FSUN A638 3FP1RS2L
u1d8 disk drive SEAGATE ST336605FSUN A638 3FP1RRW9
u1d9 disk drive SEAGATE ST336605FSUN A638 3FP1RRRJ
u1l1 loop card SLR-MI 375-0085-01- 5.02 Flash 098201
u1l2 loop card SLR-MI 375-0085-01- 5.02 Flash 098928
u1pcu1 power/cooling unit TECTROL-CAN 300-1454-04( 0000 055393
u1pcu2 power/cooling unit TECTROL-CAN 300-1454-04( 0000 076139
u1mpn mid plane SLR-MI 370-3990-02- 0000 053952

dbsct37:/:<58>fru stat
CTLR STATUS STATE ROLE PARTNER TEMP
------ ------- ---------- ---------- ------- ----
u1ctr ready enabled master - 40.0

DISK STATUS STATE ROLE PORT1 PORT2 TEMP VOLUME
------ ------- ---------- ---------- --------- --------- ---- ------
u1d1 ready enabled data disk ready ready 39 v0
u1d2 ready enabled data disk ready ready 37 v0
u1d3 ready enabled data disk ready ready 36 v0
u1d4 ready enabled data disk ready ready 36 v0
u1d5 fault enabled data disk bypass bypass - v0
u1d6 ready enabled data disk ready ready 35 v0
u1d7 ready enabled data disk ready ready 35 v0
u1d8 ready enabled data disk ready ready 36 v0
u1d9 ready enabled data disk ready ready 35 v0

LOOP STATUS STATE MODE CABLE1 CABLE2 TEMP
------ ------- ---------- ------- --------- --------- ----
u1l1 ready enabled master - - 33.5
u1l2 ready enabled slave - - 36.0

POWER STATUS STATE SOURCE OUTPUT BATTERY TEMP FAN1 FAN2
------ ------- --------- ------ ------ ------- ------ ------ ------
u1pcu1 ready enabled line normal normal normal normal normal
u1pcu2 ready enabled line normal normal normal normal normal

3. As the output indicated clearly that the problem is with one of the disks
I decided to contact Sun Support. As the result a new case has been opened.
SUN support contacts as 2005.March:
* Phone number: +41 848 786 002 (from shift phones dial 333 and then 0 848...)
* Contract ID: CH X050 GEN - (19)90/1

4. After the disk has been replaced by Sun Support. I verified the status of
the disk array:

dbsct37:/:<27>vol stat

v0 u1d1 u1d2 u1d3 u1d4 u1d5 u1d6 u1d7 u1d8 u1d9
unmounted 0 0 0 0 0 0 0 0 0

dbsct37:/:<28>fru list
ID TYPE VENDOR MODEL REVISION SERIAL
------ ----------------- ----------- ----------- ------------- --------
u1ctr controller card 0301 501-5710-02( 0200/020106 116028
u1d1 disk drive SEAGATE ST336605FSUN A638 3FP1RL19
u1d2 disk drive SEAGATE ST336605FSUN A638 3FP1RSJJ
u1d3 disk drive SEAGATE ST336605FSUN A638 3FP1RPZM
u1d4 disk drive SEAGATE ST336605FSUN A638 3FP1RRX6
u1d5 disk drive SEAGATE ST336605FSUN A838 3FP0K5W1
u1d6 disk drive SEAGATE ST336605FSUN A638 3FP106ZV
u1d7 disk drive SEAGATE ST336605FSUN A638 3FP1RS2L
u1d8 disk drive SEAGATE ST336605FSUN A638 3FP1RRW9
u1d9 disk drive SEAGATE ST336605FSUN A638 3FP1RRRJ
u1l1 loop card SLR-MI 375-0085-01- 5.02 Flash 098201
u1l2 loop card SLR-MI 375-0085-01- 5.02 Flash 098928
u1pcu1 power/cooling unit TECTROL-CAN 300-1454-04( 0000 055393
u1pcu2 power/cooling unit TECTROL-CAN 300-1454-04( 0000 076139
u1mpn mid plane SLR-MI 370-3990-02- 0000 053952

dbsct37:/:<29>fru stat
CTLR STATUS STATE ROLE PARTNER TEMP
------ ------- ---------- ---------- ------- ----
u1ctr ready enabled master - 41.5

DISK STATUS STATE ROLE PORT1 PORT2 TEMP VOLUME
------ ------- ---------- ---------- --------- --------- ---- ------
u1d1 ready enabled data disk ready ready 40 v0
u1d2 ready enabled data disk ready ready 38 v0
u1d3 ready enabled data disk ready ready 38 v0
u1d4 ready enabled data disk ready ready 38 v0
u1d5 ready enabled data disk ready ready 38 v0
u1d6 ready enabled data disk ready ready 37 v0
u1d7 ready enabled data disk ready ready 38 v0
u1d8 ready enabled data disk ready ready 38 v0
u1d9 ready enabled data disk ready ready 38 v0

LOOP STATUS STATE MODE CABLE1 CABLE2 TEMP
------ ------- ---------- ------- --------- --------- ----
u1l1 ready enabled master - - 36.5
u1l2 ready enabled slave - - 38.5

POWER STATUS STATE SOURCE OUTPUT BATTERY TEMP FAN1 FAN2
------ ------- --------- ------ ------ ------- ------ ------ ------
u1pcu1 ready enabled line normal normal normal normal normal
u1pcu2 ready enabled line normal normal normal normal normal


5. Then as root on both cluster nodes I performed the following commands:

vxdisk offline c4t1d0
vxdisk offline c4t1d1
vxdisk rm c4t1d0
vxdisk rm c4t1d1

vxdisk list [check that c4t1d0-1 are not present]

devfsadm -C -v
scgdevs
scdidadm -C

6. Next on T3 I mounted the volume v0:

dbsct37:/:<30>vol mount v0


7. On both cluster nodes:
devfsadm -v
scgdevs
scdidadm -r
scdidadm -l [check that c4t1d0-1 are listed]
format [check that luns are visible on both nodes, label them on one node]
vxdctl enable
vxdisk list [check that luns are seen and have error status]


8. On master cluster node:
vxdiskadm -> option 5 to replace both failed disks

9. Mirrors recreation takes several hour and can be observed with vmsa tool.


This topic: PSSGroup > PhysicsDatabasesSection > OnShift > DiskReplacementPDB
Topic revision: r1 - 2005-12-07 - unknown
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback