ACCelerators LOGging database blocked due to lack of space in archivelog area

Description

Due to lack of space in the archive log area, ACCLOG, an Oracle RAC db, froze as the archiving of redologs was not feasible.

Impact

Some applications relying on ACCLOG didnt work as Timber for about 120 minutes. There was no data loss. This database is considered critial for accelerators.

Time analysis

On 8th August:

06:18 Logging database ACCLOG stopped
06:58 LHC operators cannot retrieve logged data via TIMBER
08:24 Chris Roderick (BE/CO/DM) calls the IT/DB Oracle support (Nilo Segura)
08:38 Nilo found the root cause and provided a temporary fix, adding the remaining space available in controller file system.
08:42 Logging resumed, including the backlog of buffered data

Root cause:

ACCLOG database has been moved to a 10gbps TSM server located on building 613 around two months ago, TSM609. In general the availability and speed of bakcup/restore processes has improved a lot since then. Together with better performance, this change of tsm server provides a different location from database server which an important advantage in case of a disaster scenario. Nevertheless, the acclerators logging backup was suffering from problems since 14/07/2011 (INC052661) where we first reported issues to TSM support.

We have had several iterations with TSM support, we work close, in order to correct this situation but problems arose intermitently, some of them have been reported via support cases (INC055600,INC056733,INC057682). Cooperation with TSM support has been productive.

Since last weekend we have suffered again issues while backing up ACCMEAS and ACCLOG, the two databases on this TSM server, TSM609. Last Sunday, we started to transfer archivelogs from ACCMEAS to our big SATA storage as the archivelog area ~730Gb was getting full, the archivelog production of this database went up almost 1TB during the weekend. During this period, problems were observed on ACCLOG but the database managed always to finally send data to TSM, therefore our attention was focus on ACCMEAS. On early morning on Tuesday TSM609 was not available as result ACCLOG failed to transfer. This fact together with an increased activity on the database lead to exhaustion of the database archivelog area, which froze the database

Production of archivelogs per day in ACCLOG:

DAY                     SIZE_GB
-------------------- ----------
11-JUL-2011 00:00:00      409.3
12-JUL-2011 00:00:00      456.9
13-JUL-2011 00:00:00      474.1
14-JUL-2011 00:00:00      476.7
15-JUL-2011 00:00:00      480.3
16-JUL-2011 00:00:00      478.1
17-JUL-2011 00:00:00        523
18-JUL-2011 00:00:00      549.4
19-JUL-2011 00:00:00        529
20-JUL-2011 00:00:00      536.9
21-JUL-2011 00:00:00      535.2
22-JUL-2011 00:00:00      514.7
23-JUL-2011 00:00:00      481.4
24-JUL-2011 00:00:00      480.3
25-JUL-2011 00:00:00      484.7
26-JUL-2011 00:00:00      476.5
27-JUL-2011 00:00:00      471.7
28-JUL-2011 00:00:00      473.3
29-JUL-2011 00:00:00      480.6
30-JUL-2011 00:00:00      478.5
31-JUL-2011 00:00:00      453.4
01-AUG-2011 00:00:00      420.9
02-AUG-2011 00:00:00        487
03-AUG-2011 00:00:00      474.3
04-AUG-2011 00:00:00      471.6
05-AUG-2011 00:00:00      474.5
06-AUG-2011 00:00:00      480.5
07-AUG-2011 00:00:00      488.6
08-AUG-2011 00:00:00      977.3
09-AUG-2011 00:00:00      867.5

Our monitoring system sent email warnings about this situation on Tuesday morning.
A monitoring tool that regularly ping databases, which sends an sms to DBA's, worked as connectivity was not lost, just a commit would have detected this issue. This is being improved.

Solution

The low performance of TSM609 was due to many streams of several databases trying to recover from previous days backups. Some of these streams were stopped on Tuesday so ACCLOG could catch up. The situation nowadays is fine.

We have also changed the file system where archivelogs are retained till send to tape, so with actual archivelog ratio we can keep at least 4 days without sending to tape.

-- RubenGaspar - 10-Aug-2011

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-08-10 - RubenGaspar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback