Unavailability of Mysql backend used by Drupal, 21st July 2011

Description

Mysql vm's are on dbvrtg046 and dbvrtg047.

Impact

No user access to drupal sites, inclusing SSB from ~4:10am till ~9:45 am. Instabilities are still present.

Time line and analysis

/var/log/messages at dbvrtg047, showing moments before of the reboot and further reboots.

Jul 20 22:33:53 dbvrtg047 ntpd[2656]: synchronized to 137.138.17.69, stratum 2
Jul 20 22:34:58 dbvrtg047 ntpd[2656]: no servers reachable
Jul 20 22:37:10 dbvrtg047 ntpd[2656]: synchronized to 137.138.16.69, stratum 2
Jul 20 22:56:19 dbvrtg047 ntpd[2656]: synchronized to 137.138.17.69, stratum 2
Jul 20 23:00:47 dbvrtg047 nscd: nss_ldap: reconnected to LDAP server ldap://xldap.cern.ch/
Jul 20 23:11:24 dbvrtg047 ntpd[2656]: synchronized to 137.138.16.69, stratum 2
Jul 20 23:19:57 dbvrtg047 ntpd[2656]: no servers reachable
Jul 20 23:24:11 dbvrtg047 ntpd[2656]: synchronized to 137.138.17.69, stratum 2
Jul 20 23:47:46 dbvrtg047 ntpd[2656]: synchronized to 137.138.16.69, stratum 2
Jul 20 23:49:07 dbvrtg047 syslogd 1.4.1: restart.
Jul 21 02:21:54 dbvrtg047 syslogd 1.4.1: restart.
Jul 21 08:29:15 dbvrtg047 syslogd 1.4.1: restart.
Jul 21 09:08:01 dbvrtg047 syslogd 1.4.1: restart.
On the CRS on node dbvrtg047, it was evicted by dbvrtg046 (this is usually related to heartbeat missing):
2011-07-20 23:51:17.255
[cssd(7204)]CRS-1601:CSSD Reconfiguration complete. Active nodes are dbvrtg046 dbvrtg047 .
2011-07-20 23:51:17.795
[crsd(5900)]CRS-1012:The OCR service started on node dbvrtg047.
2011-07-20 23:51:17.805
[evmd(5545)]CRS-1401:EVMD started on node dbvrtg047.
2011-07-20 23:51:18.999
[crsd(5900)]CRS-1201:CRSD started on node dbvrtg047.
2011-07-21 01:38:24.884

We have observed that when a vm reboots the underlying storage still believes the system is alive and locks to the files are kept. This is may a consequence of the type of virtualization used, hardware virtualised.

In this situation the members of the cluster, dbvrtg046 and dbvrtg047, start to compete in order to bring the database again up. This fails as there are locks from an ancien mysqld daemon.

dbvrtg047 is evicted dbvrtg046 takes the relay:
110720 23:54:23 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.5.9-log' socket: '/var/lib/mysql/mysql.sock' port: 5500 MySQL Community Server (GPL)

dbvrtg046 is evicted, dbvrtg047 takes the relay:
110721 1:42:06 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.5.9-log' socket: '/var/lib/mysql/mysql.sock' port: 5500 MySQL Community Server (GPL)

At Jul 21 02:21:54, both nodes try to take the relay

dbvrtg046: 110721 02:23:44 mysqld_safe Starting mysqld daemon with databases from /ORA/dbs03/DRUPAL/mysql
dbvrtg047: 110721 02:25:42 mysqld_safe Starting mysqld daemon with databases from /ORA/dbs03/DRUPAL/mysql
dbvrtg047: 110721 02:25:52 mysqld_safe Starting mysqld daemon with databases from /ORA/dbs03/DRUPAL/mysql
Both try to restart till finally they give up. This left some locks on the NAS.

At 4 in the morming, dbsrvtg046 is definitively shutdowned:

[2011-07-21 04:07:54 7088] INFO (XendDomainInfo:1924) Domain has shutdown: name=843_dbvrtg046 id=523 reason=reboot.
[2011-07-21 04:09:42 7088] INFO (image:553) 843_dbvrtg046 device model terminated

At 8:34am dbvrtg047 reboots and can get the instance back. Till finally after a manual intervention the system is back at:

110721  9:44:34 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.9-log'  socket: '/var/lib/mysql/mysql.sock'  port: 5500  MySQL Community Server (GPL)
From ~4:10 till 9:44 the mysqld was down.

It reboots again at:

Jul 21 10:45:25 dbvrtg047 syslogd 1.4.1: restart.
Back again:
110721 10:59:03 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.9-log'  socket: '/var/lib/mysql/mysql.sock'  port: 5500  MySQL Community Server (GPL)

Follow-up

We are trying to contact the product manager of OVM in USA and Europe as so far we didnt manage to open an SR on the Oracle support web site, due to a license propagation issue.

The service started yesterday to prepare two physical machines. It's our analysis that inestabilites are link to the CRS + Virtualization layer. We expect to do the migration next week.

-- RubenGaspar - 21-Jul-2011

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-07-21 - RubenGaspar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback