Unavailability of the CMS offline production DB (CMSR), 15th March 2011
Description
The CMS offline production database (CMSR) went down due to problems with SAN. Failover to a standby system was necessary. The service was not available for about 1 hour and a half.
Impact
- The database was completely down for approx. 1.5 hours
- List of affected services includes: CMS_ALCT, CMS_ANODE_BOARDS, CMS_C2K, CMS_COND, CMS_CSC, CMS_DBS, CMS_DBS_WRITER, CMS_EMU_CERN, CMS_EMU_FAST, CMS_EMU_HV, CMS_HCL, CMS_INTEGRATION, CMS_LUMI_PROD_OFFLINE, CMS_PVSS, CMS_PXL, CMS_SSTRACKER, CMS_T0AST, CMS_TEC_LYON, CMS_TESTBEAM, CMS_TRANSFERMGMT, CMS_TRANSFERMGMT_SC, CMS_TRANSFERMGMT_TEST, CMSR_APEX, CMSR_BACKUP, CMSR_LB
Time line of the incident
- Following the incident on 11th of March the database was running on standby hardware on RAC8.
- 15-Mar-11 11:45 - The database goes down due to a multi-disk failure.
- 15-Mar-11 11:50 - PDB team started to investigate the issue. The database cannot be restarted.
- 15-Mar-11 12:40 - A decision to fail over to the standby system (former primary) has been taken.
- 15-Mar-11 13:20 - The database fully open for users.
Analysis
- Investigation showed that the disks reported by SAN as failed were not really broken.
- Still the fact that many of them became unavailable at the same time caused corruption of the ASM diskgroup and loss of all datafiles.
- The SAN issue was most likely related to a bug in the HBA driver (integrated with the Linux kernel 2.6.18-194.26.1.el5).
- Series of tests have been run and confirmed that the problem is fixed in the kernel 2.6.18-238.1.1.el5
Follow up
- Update kernel on those machines that currently use kernel 2.6.18-194.26.1.el5.
- Improve kernel validation procedures.
Topic revision: r1 - 2011-03-31
- unknown