ACCelerators LOGging (ACCLOG) database blocked due to lack of space in archivelog area

Description

Due to lack of space in the archive log area, ACCLOG, an Oracle RAC database, froze as the archiving of redologs was not feasible.

Impact

Some applications relying on ACCLOG did not work as Timber for about 120 minutes. There was no data loss. This database is considered critical for accelerators.

Archiving of the redolog files is an essential component of the Oracle database architecture. It ensures that transactions can be replayed, either after a host/database crash or to retrieve the state of the database just before an error was made, it is also used as a way to "roll forward" the transactions starting from a restored copy of the database. It is not a backup but an essential part of the database architecture, each of the archived redolog files are essential in order to replay transactions.

In the IT-DB setup, we force a switch of theredolog files every 15 minutes, the redolog file are first copied to disk and then through the TDPO layer sent to TSM, after having been sent to TSM the archived redolog files are removed from the disk. A buffer of space is available on the disk arrays which are part of the database setup in order to have the database continue working even so it can not sent files to TSM.

Time analysis

On 6th August 2011:

On 9th August 2011:

  • 06:18 Logging database ACCLOG stopped
  • 06:58 LHC operators cannot retrieve logged data via TIMBER
  • 08:24 Chris Roderick (BE/CO/DM) calls the IT/DB Oracle support (Nilo Segura)
  • 08:38 Nilo found the root cause and provided a temporary fix, adding the remaining space available in controller file system.
  • 08:42 Logging resumed, including the backlog of buffered data

Root cause:

ACCLOG database has been moved to a 10gbps TSM server located on building 613 around two months ago, TSM609. In general the availability and speed of bakckup/restore processes has improved a lot since then. Together with better performance, this change of TSM server provides a different location from database server which an important advantage in case of a disaster scenario. Nevertheless, the accelerators logging backup was suffering from problems since 14/07/2011 (https://cern.service-now.com/service-portal/view-incident.do?n=INC052661) where we first reported issues to TSM support.

We have had several iterations with TSM support, we have worked closely with them in order to correct this situation but problems arose intermitently, some of them have been reported via support cases (https://cern.service-now.com/service-portal/view-incident.do?n=INC055600, https://cern.service-now.com/service-portal/view-incident.do?n=INC056733, https://cern.service-now.com/service-portal/view-incident.do?n=INC057682, https://cern.service-now.com/service-portal/view-incident.do?n=INC057758, https://cern.service-now.com/service-portal/view-incident.do?n=INC057671, https://cern.service-now.com/service-portal/view-incident.do?n=INC056638). Cooperation with TSM support has been productive.

Since last weekend we have suffered again issues while backing up ACCMEAS and ACCLOG, the two databases on this TSM server, TSM609. Last Sunday, we started to transfer archivelogs from ACCMEAS to our big SATA storage as the archivelog area ~730Gb was getting full, the archivelog production of this database went up almost 1TB per day during the weekend. During this period, problems were observed on ACCLOG but the database managed always to finally send data to TSM, therefore our attention was focused on ACCMEAS. On early morning on Tuesday TSM609 was not available as result ACCLOG failed to transfer. This fact together with an increased activity on the database lead to exhaustion of the database archivelog area, which froze the database

Production of archivelogs per day in ACCLOG:

DAY                     SIZE_GB
-------------------- ----------
11-JUL-2011 00:00:00      409.3
12-JUL-2011 00:00:00      456.9
13-JUL-2011 00:00:00      474.1
14-JUL-2011 00:00:00      476.7
15-JUL-2011 00:00:00      480.3
16-JUL-2011 00:00:00      478.1
17-JUL-2011 00:00:00        523
18-JUL-2011 00:00:00      549.4
19-JUL-2011 00:00:00        529
20-JUL-2011 00:00:00      536.9
21-JUL-2011 00:00:00      535.2
22-JUL-2011 00:00:00      514.7
23-JUL-2011 00:00:00      481.4
24-JUL-2011 00:00:00      480.3
25-JUL-2011 00:00:00      484.7
26-JUL-2011 00:00:00      476.5
27-JUL-2011 00:00:00      471.7
28-JUL-2011 00:00:00      473.3
29-JUL-2011 00:00:00      480.6
30-JUL-2011 00:00:00      478.5
31-JUL-2011 00:00:00      453.4
01-AUG-2011 00:00:00      420.9
02-AUG-2011 00:00:00        487
03-AUG-2011 00:00:00      474.3
04-AUG-2011 00:00:00      471.6
05-AUG-2011 00:00:00      474.5
06-AUG-2011 00:00:00      480.5
07-AUG-2011 00:00:00      488.6
08-AUG-2011 00:00:00      977.3
09-AUG-2011 00:00:00      867.5

Our monitoring system sent email warnings about this situation on Tuesday morning.
A monitoring tool that regularly ping databases, which sends an SMS to DBAs, did not send an SMS as the connectivity was not lost.

Solution

The low performance of TSM609 was due to many streams of several databases trying to recover from previous days backups. Some of these streams were stopped on Tuesday so ACCLOG could catch up. The situation nowadays is fine.

The monitoring script which sends SMS has been enhanced on Thursday August 11th to include transactions and thus better detect such cases.

We have also changed the file system where archivelogs are retained till send to tape, so with actual archivelog ratio we can keep more than 4 days without sending to tape. At the next technical stop, end of August, a new storage system will be used, with higher capacity, in order provide a higher autonomy in case of unavailability of TSM. It should be noted that while the redolog files are only stored on the disk setup, the database is at risk of not being able to recover from a user error or a host/database incident.

We will be proposing an alternative solution for ensuring higher availability based on disk subsystem.

-- RubenGaspar - 11-Aug-2011

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-08-11 - RubenGaspar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback