Cinder database restored from backup

Description

The Cinder database hosted in the DBOD OpenStack CRS was relocated to the second node of the cluster, generating a problem in the log sequence number. To fix this, we tried to restart the DB. As that didn't solve the problem, we finally had to restore the data from a backup.

Impact

  • User visible: yes. OpenStack BlockStorage service down.
  • Duration of OpenStack service outage: from 17:36h to 18:41h
  • Duration of Cinder DB outage: from 15:56h to 18:08h
  • Data loss: yes. From 14:25:57h (restore from backup) to 18:08h. We managed to recover from the binlogs the DB statements from 14:25:57h to 18:08h so that the BlockStorage service managers can analyse them and take actions if needed to leave their service in a consistent state.

Time line of the incident

  • 07-Jan-16 15:57 - The Oracle Clusterware (DBOD HA solution) detects that cinder DB is down in the node where it should be running (itrac5142) and tries to relocate it to the second node (itrac5114).
  • 07-Jan-16 15:59 - cinder service is moved by the CRS to node itrac5114 but unfortunately a mysqld process is already running there since 10-Nov-2015, what starts triggering error messages suggesting that the DB may be corrupt or may have copied the InnoDB tablespace but not the InnoDB log files.
  • 07-Jan-16 16:01 - DBOD service managers are notified by the DBOD Monitoring Service that cinder DB is down
  • 07-Jan-16 17:00 - The monitoring alert is treated, the problem understood, and the OpenStack BlockStorage service managers are contacted to request the relocation of the service and restarting the DB
  • 07-Jan-16 17:39 - Green light to restart the DB is received from the OpenStack BlockStorage service manager
  • 07-Jan-16 17:51 - The relocation and restart is performed but doesn't solve the problem. The cinder DB may be corrupted. We inform the OpenStack BlockStorage service managers, suggesting to restore the DB from a backup performed on the same day at 14:25:57h.
  • 07-Jan-16 18:03 - Green light is received to try to perform a dump of the DB and to perform the DB restore
  • 07-Jan-16 18:11 - The dump cannot be performed as the DB is rejecting the operations. We restore from the backup and inform the OpenStack BlockStorage service manager that their DB instance is operational but with data from 14:25:57h
  • 07-Jan-16 18:46 - The OpenStack BlockStorage service manager confirms that Cinder is working again and that the list of volumes taken before the restart has a significant number of differences compared to the current list from the backup.
  • 08-Jan-16 11:56 - We were able to recover from the MySQL binlogs the sequence of statements performed to Cinder from Jan 7th at 14:25:57h (restore time) to Jan 7th at 15:57h (when the problem started), and even a few minutes after. These are sent to the OpenStack BlockStorage service manager, as it may be useful to recover their service from the data loss after the restore.

Analysis

  • The Oracle Clusterware (DBOD HA solution) detected that cinder DB was down in the node where it should be running (itrac5142) and tried to relocate it to the second node (itrac5114).
  • The Cinder service was moved by the CRS to node itrac5114 but unfortunately a mysqld process was already running there since 10-Nov-2015, what triggered error messages like the following:
160107 15:59:43 mysqld_safe A mysqld process already exists
2016-01-07 16:03:37 7f977ba55700 InnoDB: Error: page 1115 log sequence number 22443906836
InnoDB: is in the future! Current system log sequence number 19370816814.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: for more information.
  • We suggested the OpenStack BlockStorage service managers to relocate and restart the Cinder service in the second node to fix the problem. This was accepted, performed, but didn't solve the problem. Messages like the one above appeared in the second node too.
  • We suggested the OpenStack BlockStorage service managers to restore the DB from a backup performed on the same day at 14:25:57h. This was accepted. We tried first to perform a dump of the DB but didn't succeed as the DB was rejecting the operation with messages like:
/usr/local/mysql/mysql-5.6.17/bin/mysqldump -u dod_xxx -p -S/var/lib/mysql/mysql.sock.cinder.xxxx --lock-tables=false --single-transaction --all-databases > /tmp/all-database.sql
Enter password:
mysqldump: Couldn't execute 'SELECT /*!40001 SQL_NO_CACHE */ * FROM `quota_usages`': Lost connection to MySQL server during query (2013)

  • The DB is restored from the backup performed at 14:25:57 and the OpenStack BlockStorage service manager informed that their DB instance is operational again.

Follow up

  • The DBOD service managers should be alerted when a DB in a HA cluster is running in the two nodes at the same time. A sensor should be written and deployed for this.
  • The DBOD service managers will study how we can modify the CRS status check to avoid running MySQL in a second node if it's already running in one

-- DavidCollados - 2016-01-08

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2016-01-08 - DavidCollados
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback