Instability of node 2 and 4 of CMSR database affecting online to offline replication
Description
CMSR rebooted 3 times on Monday evening 13.09.2010 (node 4 twice, node 2 - once) causing hang of Streams replication.
Impact
- The reboots affected directly CMS PhEDEx, Frontier and CMS Dataset Bookkeeping applications which are main users of those nodes. The applications lost connections and transactions in progress. They could reconnect to the database immediately thanks to service failover mechanism.
- Streams replication hangs could potentially affect re-processing activities which are relying on streaming of condition data.
Time line of the incident
All events described below happened on 13.09.2010.
- 22:20 - cmsr4 reboots -> node rebooted by cluster, services relocated to remaining nodes.
- 23:20 - high load on cmsr2 due to user processes consuming all the memory of the server. The server was completely overloaded and affected new connections to the cmsr cluster.
- 23:23 - cmsr2 reboots -> node rebooted by cluster, services relocated to remaining nodes
- 23:30 - high load on cmsr4 due to user processes consuming all the memory of the server. The server was completely overloaded and affected new connections to the cmsr cluster.
- 23:33 - cmsr4 reboots -> node rebooted by cluster, services relocated to remaining nodes
- 23:45 - pvss replication unstuck and fully functional -> required manual intervention
- 23:50 - conditions replication unstuck and fully functional -> required manual intervention
Analysis
Thanks to new PGA monitoring tool sql queries responsible for high memory utilization were identified. The reason why node reboots affeccted Streams replication is not clear.
Follow up
Sql queries responsible for high load were identified and are now under investigation in cooperation with CMS.
--
MarcinBlaszczyk - 21-Sep-2010