Unavailability of the CMS offline production DB (CMSR), 11th March 2011

Description

The CMS offline production database (CMSR) went down due to a local power cut in CERN CC. Failover to a standby system was necessary. The service was not available for about 2 hours. Applications connecting to the database from outside CERN could not connect due to a firewall mis-configuration until approx 3:10pm.

Impact

  • The database was completely down for approx. 2 hours
  • For applications connecting from outside CERN the total downtime was approx. 4 hours 50 minutes

Time line of the incident

  • 11-Mar-11 10:20 - A short local power cut in the critical area of the computer center affecting certain RAC5 and RAC7 machines. For majority of machines the power cut was transparent several disk few used by the CMSR database went down. This caused the database's outage.
  • 11-Mar-11 10:30 - PDB team started to investigate the issue. The database cannot be restarted.
  • 11-Mar-11 11:00 - The reason preventing the database from starting up is understood. Due to the power cut 2 disks belonging to two different disk arrays failed compromising data integrity.
  • 11-Mar-11 11:10 - A decision to fail over to the standby system has been taken
  • 11-Mar-11 12:30 - The failover operation completed. users connected to the CERN GPN network can connect to the database.
  • 11-Mar-11 14:47 - A user connecting from outside CERN reported an issue. Investigation starts.
  • 11-Mar-11 11:10 - The issue has been traced down to be related to the configuration of the CERN central firewall. The Security Team has been asked to open port 10121 for selected machines.
  • 11-Mar-11 15:10 - The firewall configuration changed. The database fully available.

Analysis

  • The power cut in the critical area brought several disk arrays down because the physics power they are connected to was switched off on the power bar. Most likely the machines have been in this situation since December 18th when the last general power cut in CC took place. The issue was not noticed by available monitoring tools.
  • Outage of several disk arrays brought down the whole database as the system was design to survive only single-point failures.
  • After the disk arrays were restarted the database could not be started as there were 2 disk broken. Again the system was design to survive any single-point failure but not to survive multi-point failures.
  • Connections from outside CERN to the standby system were disabled on the central firewall because such exceptions have to be explicitly requested. Typically PDB team requests an exception in advance. For RAC8 hardware it was not done.

Follow up

  • Improve monitoring to discover problems with power provisioning.
  • Improve hardware deployment procedures to request appropriate central firewall configuration in advance.

-- MarcinBlaszczyk - 20-Sep-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-03-14 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback