Description

Atlas offline DB, ADCR, went down due to disk falure on 23th of November 2011 around 21:45PM. After emergency failover to standby hardware, ADCR database reported a corrupted block which affectted one table of PANDA application. Therefore all services except Panda have been started around 11:00PM. After some investigation & consultation with application experts a workaround to skip reading corrupted block has been used. Panda services have been started around 11:45PM. Unfortunately, standby database was missing ATLAS_LFC service and this was discovered and fixed in the morning ~9AM next day.

Impact

  • All services running on ADCR database, in particular:
    • All ADCR services except for ATLAS_PANDA, ATLAS_PANDAMON, ATLAS_LFC were unavailable for 1:15h
    • ATLAS_PANDA, ATLAS_PANDAMON were unavailable for 1:55h
    • ATLAS_LFC was unavailable for around 11h.

Time line of the incident

  • 23-Nov-11 21:45 - ADCR database went down
  • 23-Nov-11 22:30 - ADCR database was restarted on a standby hardware, corrupted block reported
  • 23-Nov-11 23:00 - All services except for ATLAS_PANDA, ATLAS_PANDAMON and missing ATLAS_LFC were started
  • 23-Nov-11 23:45 - Workaround to skip reading corrupted block has been used and services and ATLAS_PANDA, ATLAS_PANDAMON services have been started
  • 24-Nov-11 08:45 - ATLAS_LFC service reported as unavailable
  • 24-Nov-11 08:55 - ATLAS_LFC service added and started
  • 24-Nov-11 15:30 - corrupted block were fixed.

Analysis

  • Database crash has been caused by single disk failure which happened during underlying ASM rebalancing process followed by another disk failure which occurred few days before. Failover to standby hardware was the fastest way to bring the system back. A delay with starting PANDA services has been caused by a corrupted block which was reported by ADCR database after being started on a standby hardware. Given the fact that corrupted block in question was affecting only 8 rows of one application table, a workaround to skip reading this single block has been used and Panda services have been started. Due to an human error, the ATLAS_LFC service was not configured on a standby database, so when all services were started it remain unnoticed that this particular service was missing. This problem was reported only in the morning and was fixed immediately.

Follow up

  • Procedure for manual failover has been upgraded with the step of comparing number of original services with the standby ones.
  • Standby database for ADCR will be recreated and failed disks will be replaced

-- MarcinBlaszczyk - 24-Nov-2011

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-11-25 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback