Instability of node 2 and 4 of CMSR database affecting online to offline replication

Description

CMSR rebooted 3 times on Monday evening 13.09.2010 (node 4 twice, node 2 - once) causing hang of Streams replication.

Impact

  • The reboots affected directly CMS PhEDEx, Frontier and CMS Dataset Bookkeeping applications which are main users of those nodes. The applications lost connections and transactions in progress. They could reconnect to the database immediately thanks to service failover mechanism.
  • Streams replication hangs could potentially affect re-processing activities which are relying on streaming of condition data.

Time line of the incident

All events described below happened on 13.09.2010.
  • 22:20 - cmsr4 reboots -> node rebooted by cluster, services relocated to remaining nodes.
  • 23:20 - high load on cmsr2 due to user processes consuming all the memory of the server. The server was completely overloaded and affected new connections to the cmsr cluster.
  • 23:23 - cmsr2 reboots -> node rebooted by cluster, services relocated to remaining nodes
  • 23:30 - high load on cmsr4 due to user processes consuming all the memory of the server. The server was completely overloaded and affected new connections to the cmsr cluster.
  • 23:33 - cmsr4 reboots -> node rebooted by cluster, services relocated to remaining nodes
  • 23:45 - pvss replication unstuck and fully functional -> required manual intervention
  • 23:50 - conditions replication unstuck and fully functional -> required manual intervention

Analysis

Thanks to new PGA monitoring tool sql queries responsible for high memory utilization were identified. The reason why node reboots affeccted Streams replication is not clear.

Follow up

Sql queries responsible for high load were identified and are now under investigation in cooperation with CMS.

-- MarcinBlaszczyk - 21-Sep-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-09-21 - MarcinBlaszczyk
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback