Unavailability of the CMS offline production DB (CMSR), 15th March 2011

Description

The CMS offline production database (CMSR) went down due to problems with SAN. Failover to a standby system was necessary. The service was not available for about 1 hour and a half.

Impact

  • The database was completely down for approx. 1.5 hours
  • List of affected services includes: CMS_ALCT, CMS_ANODE_BOARDS, CMS_C2K, CMS_COND, CMS_CSC, CMS_DBS, CMS_DBS_WRITER, CMS_EMU_CERN, CMS_EMU_FAST, CMS_EMU_HV, CMS_HCL, CMS_INTEGRATION, CMS_LUMI_PROD_OFFLINE, CMS_PVSS, CMS_PXL, CMS_SSTRACKER, CMS_T0AST, CMS_TEC_LYON, CMS_TESTBEAM, CMS_TRANSFERMGMT, CMS_TRANSFERMGMT_SC, CMS_TRANSFERMGMT_TEST, CMSR_APEX, CMSR_BACKUP, CMSR_LB

Time line of the incident

  • Following the incident on 11th of March the database was running on standby hardware on RAC8.
  • 15-Mar-11 11:45 - The database goes down due to a multi-disk failure.
  • 15-Mar-11 11:50 - PDB team started to investigate the issue. The database cannot be restarted.
  • 15-Mar-11 12:40 - A decision to fail over to the standby system (former primary) has been taken.
  • 15-Mar-11 13:20 - The database fully open for users.

Analysis

  • Investigation showed that the disks reported by SAN as failed were not really broken.
  • Still the fact that many of them became unavailable at the same time caused corruption of the ASM diskgroup and loss of all datafiles.
  • The SAN issue was most likely related to a bug in the HBA driver (integrated with the Linux kernel 2.6.18-194.26.1.el5).
  • Series of tests have been run and confirmed that the problem is fixed in the kernel 2.6.18-238.1.1.el5

Follow up

  • Update kernel on those machines that currently use kernel 2.6.18-194.26.1.el5.
  • Improve kernel validation procedures.
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2011-03-31 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback