CMSR database hung following vendor mistake during broken disk replacement
Description
CMS offline production database (CMSR) got stuck around 14:00 following a mistake made by vendor during replacement of a broken disk.
Impact
- The whole database was not available between 14:00 and 15:10pm. Affected services: CMS_ALCT, CMS_ANODE_BOARDS, CMS_C2K, CMS_COND, CMS_CSC, CMS_DBS, CMS_DBS_WRITER, CMS_EMU_CERN, CMS_EMU_FAST, CMS_EMU_HV, CMS_HCL, CMS_INTEGRATION, CMS_LUMI_PROD_OFFLINE, CMS_PVSS, CMS_PXL, CMS_SSTRACKER, CMS_T0AST, CMS_TEC_LYON, CMS_TESTBEAM, CMS_TRANSFERMGMT, CMS_TRANSFERMGMT_SC, CMS_TRANSFERMGMT_TEST, CMSR_APEX, CMSR_BACKUP, CMSR_LB
- CMS Tier-0 system was not available between 14:00 and 21:30.
Time line of the incident
- 27-Sep-2011 14:00 - an actively used disk has been removed from the disk array by vendor. The database gets stuck.
- 27-Sep-2011 14:10 - Alarm received. Database support team started working on the issue.
- 27-Sep-2011 15:10 - All DB instances restarted. Database available again.
- 27-Sep-2011 16:51 - CMS restarted the Tier-0 system. Failed to open a datafile.
- 27-Sep-2011 19:45 - CMS opened TEAM GGUS:74709
at 19:45. The ticket was escalated to ALARM at ~21:00 and the issue was fixed at ~21:30 and the Tier-0 export system was successfully restarted.
Analysis
- Despite the vendor's mistake the Oracle ASM software should be able to transparently handle such a situation.
- Oracle stack hung because of a hang in the SCSI layer.
- Other databases installed with the same Oracle software version, OS, kernel and hardware type do not show similar issues.
- The main suspect is the controller of the disk array hosting the mistakenly replaced disk.
- Datafile got offline as a consequence of the I/O errors generated by the wrong disk replacement.
Follow up
- The disk array will be drained and tested.
- Database monitoring has been enhanced in order to get alarms whenever a file goes offline.
--
MarcinBlaszczyk - 20-Sep-2010