DB outage caused by problems in filer dbnasr1132
Description
- Several databases (PDBR,LHCBR,ATLR,ATONR,SUSI-TEST,ZORA-TEST,TIM-DB) remained in a "suspended" state due to NFS paths no longer available to some of their volumes. A couple
of instances crashed but could be restarted properly later.
Impact
- Access to the databases was not possible during the outage. There was no data loss though.
Time line of the incident
- 24-July-2013 18:50 - Filer dbnasr1132 shows hw problems with a network card, the cluster software does not manage to complete a takeover (from dbnasr1131 partner) and we lose access to the volumes in dbnasr1132.
- 24-July-2013 20:10 - Eric is already checking, Marcin is contacted by Atlas, several Support tickets created by users (lhcb,atlas,ams)
- 24-July-2013 20:20 - Ruben contacted and he takes the case
- 24-July-2013 21:45 - P1 case open (by Ruben) with Netapp support
- 24-July-2013 11:45 - after long iterations with support, a boot process is initiated in dbnasr1132
- 25-July-2013 00:25 - dbnasr1132 boot process completes (~40minutes to complete!!), the partner (dbnasr1131) manages to complete a take-over and service is restored from one filer only. Support suggests to performa giveback. The full cluster functionality is restored, DB services recover automatically
- 25-July-2013 01:30 - Nilo checked that all DB instances were working. Crashed instances (zora-test and atonr1) restarted.
Analysis
- The Netapp support managed to discovered information in the trace files Ruben sent that points to a HW issue. A new motherboard and network cards will be sent.
Follow up
- Motherboard and Network card replaced