Replication of ATLAS data to Tier1 sites stopped because of capture crashing permanently

Description

On Monday, 23rd of August at about 8.15 AM, the ATLAS data capture process has not restarted after periodic, automatic Streams related maintenance work (weekly shrink of logminer tables). The capture process crashed each time, during a manual restart giving the following Oracle error: ORA-01346: LogMiner processed redo beyond specified reset log scn. Because of lack of the capture process, the propagation of data from Atlas offline database towards Tier1 sites has not been performed.

Impact

Condition data from Atlas offline DB was not propagated to Tier1 sites during the incident time, 35 hours.

Time line of the incident

  • Monday, 23rd of August 8.15AM - failed restart of the capture process
  • Tuesday, 24th of August 7PM - capture process starts and the data streaming relaunches

Actions

  • Manual restarts of the capture process - no results
  • Restart of the database - no results
  • Identification of potential archive logs, which we suspected to be corrupted and replacement of them by the archive logs coming from ATLR database - no results
  • Opening of service request no. 3-2045177871. SR was escalated to severity 1.
  • Application of Oracle support solution - no results, however the root cause have been understood
  • Further investigation of Log Miner's internal tables and application of the solution created by us helped finally to start up the capture process
  • Follow up: the instructions for opening the standby will have the part stressing the reset of log_archive_dest highlighted

Analysis

The ORA-01346 Oracle error suggests that the LogMiner has detected a new branch with resetlogs scn information prior to redo already mined. The source database has never been opened with reset logs, however it became obvious that the LogMiner has received the redo logs of the same database, however coming from another incarnation. Finally, it turned out, that previous week, the Atlas offline's standby database has to be opened in order to export some data from it. The standby database contains the same settings as the primary one, including log_archive_dest parameter indicating the Atlas downstream database. Because of human error, the setting has not been changed, so when ATLR standby became opened for a while, some archive logs have been sent to Atlas downstream and registered there. Afterwards, they were identified by the LogMiner process as logs coming from another incarnation of the ATLR database. The solution of the problem consisted on removing the traces of these problematic archive logs from all concerned internal tables of LogMiner.

-- PrzemyslawRadowiecki - 26-Aug-2010

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-08-26 - LucaCanali
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback