CMSR database hung following vendor mistake during broken disk replacement

Description

CMS offline production database (CMSR) got stuck around 14:00 following a mistake made by vendor during replacement of a broken disk.

Impact

  • The whole database was not available between 14:00 and 15:10pm. Affected services: CMS_ALCT, CMS_ANODE_BOARDS, CMS_C2K, CMS_COND, CMS_CSC, CMS_DBS, CMS_DBS_WRITER, CMS_EMU_CERN, CMS_EMU_FAST, CMS_EMU_HV, CMS_HCL, CMS_INTEGRATION, CMS_LUMI_PROD_OFFLINE, CMS_PVSS, CMS_PXL, CMS_SSTRACKER, CMS_T0AST, CMS_TEC_LYON, CMS_TESTBEAM, CMS_TRANSFERMGMT, CMS_TRANSFERMGMT_SC, CMS_TRANSFERMGMT_TEST, CMSR_APEX, CMSR_BACKUP, CMSR_LB
  • CMS Tier-0 system was not available between 14:00 and 21:30.

Time line of the incident

  • 27-Sep-2011 14:00 - an actively used disk has been removed from the disk array by vendor. The database gets stuck.
  • 27-Sep-2011 14:10 - Alarm received. Database support team started working on the issue.
  • 27-Sep-2011 15:10 - All DB instances restarted. Database available again.
  • 27-Sep-2011 16:51 - CMS restarted the Tier-0 system. Failed to open a datafile.
  • 27-Sep-2011 19:45 - CMS opened TEAM GGUS:74709 at 19:45. The ticket was escalated to ALARM at ~21:00 and the issue was fixed at ~21:30 and the Tier-0 export system was successfully restarted.

Analysis

  • Despite the vendor's mistake the Oracle ASM software should be able to transparently handle such a situation.
  • Oracle stack hung because of a hang in the SCSI layer.
  • Other databases installed with the same Oracle software version, OS, kernel and hardware type do not show similar issues.
  • The main suspect is the controller of the disk array hosting the mistakenly replaced disk.
  • Datafile got offline as a consequence of the I/O errors generated by the wrong disk replacement.

Follow up

  • The disk array will be drained and tested.
  • Database monitoring has been enhanced in order to get alarms whenever a file goes offline.

-- MarcinBlaszczyk - 20-Sep-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-09-29 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback