LHCBR database got stuck

Description

LHCB offline production database (LHCBR) hung completely following a disk failure

Impact

The database was not available to end users for about 1 hour and 10 minutes. The list of affected Oracle services includes: LCG_LFC_LHCB, LHCB_AMGABOOKKEEPING, LHCB_BOOKKEEPING, LHCB_BOOKKEEPING_NEW, LHCB_COOL, LHCB_DIRACBOOKKEEPING, LHCB_DSS, LHCB_ECAL, LHCB_INTEGRATION, LHCB_MUON_CONTROL, LHCB_MUON_MWPC, LHCB_MUON_PNPI, LHCB_RICHHPD, LHCB_SCANBOOK, LHCBR_BACKUP, LHCBR_LB, LHCBR_NOLB

Time line of the incident

  • 26-Feb-11 00:57 - disk in slot 2 of itstor730 failed, the whole database completely hung
  • 26-Feb-11 01:00 - person on shift started to work on the issue
  • 26-Feb-11 02:00 - Oracle processes on nodes 1 and 2 of the RAC restarted which made the third instance to start working properly again
  • 26-Feb-11 02:10 - database fully available
  • 26-Feb-11 03:55 - itstor730 rebooted spontaneously and was evicted from the ASM diskgroup

Analysis

  • The analysis showed that the issue was most likely related to a failure of the controller of itstor730.

Follow up

  • Itstor730 has been completely removed from the configuration and is being checked by sysadmins
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2013-01-16 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback