Few short interruptions of replication of CMS data from online to offline
Description
On Tuesday 13th July at around 1:30 AM the replication of CMS data failed due to
LogMiner error caused by memory fragmentation. The monitoring software automatically restarted Streams processes. Later on at around 9:00 the replication failed again as a result of a DBA mistake.
Impact
The impact on CMS was minimal since in both cases the replication was restarted very quickly. The failures increased a bit replication latency (up to 15 minutes).
Time line of the incident
- Tuesday 13th July 1:30 AM - replication fails due to memory issues. Few minutes later it is automatically re-started.
- Tuesday 13th July 8:15 AM - replication was intentionally stopped by the weekly job performing cleanup operations
- Tuesday 13th July 8:30 AM - DBA did not realized that replication is not working and restarted it manually.
- Tuesday 13th July 9:00 AM - replication failed again because one of the Streams packages got invalidated by the cleanup
- Tuesday 13th July 9:15 AM - replication restarted manually
Analysis
The memory issue that caused the first failure is a known issue and can be worked around by increase of amount of physical memory installed on CMSONR servers.
The DBA mistake was related to a documentation issue. There was no information in the documentation about the job performing cleanup of streams tables.
Follow up
Memory extension has been agreed with CMS and will be implemented soon.
Documentation problem has been fixed.
--
JacekWojcieszuk - 16-Jul-2010
Topic revision: r1 - 2010-07-16
- unknown