ACCelerators LOGging database blocked due to lack of space in archivelog area

Description

Due to lack of space in the archive log area the archiving of redologs stopped for ACCLOG database, an Oracle RAC db.

Impact

Some application relying on ACCLOG didnt work as Timber for about 120 minutes, but there was no data loss.

Time analysis

On 8th August:

06:18 Logging database ACCLOG stopped
06:58 LHC operators cannot retrieve logged data via TIMBER
08:24 Chris Roderick (BE/CO/DM) calls the IT/DB Oracle support (Nilo Segura)
08:38 Nilo found the root cause and provided a temporary fix, adding the remaining space available in controller file system.
08:42 Logging resumed, including the backlog of buffered data

Root cause:

The acclerators logging backup was suffering from problems since 14/07/2011 (INC052661) where we first reported issues. The database has been moved to a 10gbps TSM server located on building 613.
This move was expected to improve performance while sending/retrieving data together to provide a different location from database server in case of a disaster scenario.

We have had several iterations with TSM support in order to correct this situation but problems arose intermitently, some of them have been reported via support cases (INC055600,INC056733,INC057682).

Since last weekend we have suffered again issues while backing up ACCMEAS and ACCLOG, the two databases on this TSM server. Last Sunday, we started to transfer archivelogs from accmeas to our big SATA storage as the archivelog area ~730Gb was getting full, the archivelog production of this database went up almost 1TB during the weekend. During this period, problems were observed on ACCLOG but the database managed always to finally send data to TSM, therefore our attention was focus on ACCMEAS.

Our monitoring system sent email warning about this situation on Tuesday morning.
The ping of the databases, which sends an sms, worked as connectivity was not lost, just a commit would have detected this issue. This is being improved.

Solution

The low performance of TSM609 was due to many streams of several databases trying to recover from previous days backups. Some of these streams were stopped on Tuesday so ACCLOG could catch up. The situation nowadays is fine.

We have also changed the file system where archivelogs are retained till send to tape, so with actual archivelog ratio we can keep at least 4 days without sending to tape.

-- RubenGaspar - 10-Aug-2011

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2011-08-10 - RubenGaspar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback