LHCBR database got stuck
Description
LHCB offline production database (LHCBR) hung completely following a disk failure
Impact
The database was not available to end users for about 1 hour and 10 minutes. The list of affected Oracle services includes: LCG_LFC_LHCB, LHCB_AMGABOOKKEEPING, LHCB_BOOKKEEPING, LHCB_BOOKKEEPING_NEW, LHCB_COOL, LHCB_DIRACBOOKKEEPING, LHCB_DSS, LHCB_ECAL, LHCB_INTEGRATION, LHCB_MUON_CONTROL, LHCB_MUON_MWPC, LHCB_MUON_PNPI, LHCB_RICHHPD, LHCB_SCANBOOK, LHCBR_BACKUP, LHCBR_LB, LHCBR_NOLB
Time line of the incident
- 26-Feb-11 00:57 - disk in slot 2 of itstor730 failed, the whole database completely hung
- 26-Feb-11 01:00 - person on shift started to work on the issue
- 26-Feb-11 02:00 - Oracle processes on nodes 1 and 2 of the RAC restarted which made the third instance to start working properly again
- 26-Feb-11 02:10 - database fully available
- 26-Feb-11 03:55 - itstor730 rebooted spontaneously and was evicted from the ASM diskgroup
Analysis
- The analysis showed that the issue was most likely related to a failure of the controller of itstor730.
Follow up
- Itstor730 has been completely removed from the configuration and is being checked by sysadmins