Extended downtime of LHCB online DB following power cut of August 9th, 2010

Description

LHCB online database was down from Monday 9-8-2010 at 21:41till at Tuesday 10-8 at 14:15. The original cause of the issue was a powercut at LHCB online pit, when power was restored the DB was found corrupted. The service was restored with a switchover to standby on Tuesday 10-8 around 14:15.

Impact

  • LHCBONR database was not available during the power cut and the following time needed to recover the DB. This has blocked LHCB online activities, as the experiment was not able to set-up detector after power was restored without working database. No physics data could be gathered. Transaction loss was minimal due to the use of standby DB (estimated to be less than 1 minute of transactions lost)

Time line of the incident

Sequence of events:
  • 21:41 - Power is cut
    • storages a16f01 (with BBU) and a16f02 (without BBU) remain on UPS,
    • storage a16f03 (without BBU) looses all cache contents (around 1.26GB as cache is kept 63% full and cache size is 2GB)
    • storage a16f04 is powered off but cache data is protected by BBU
    • DB servers (lbrac01-lbrac04) were protected by UPS and rebooted later mainly due to loosing access to Oracle Clusterware voting disks:
    • ... neither IT-DB-DSA nor LHCb is aware at that point that some storage arrays do not have BBUs

  • 21:44 - IT-DB piquet starts receiving emails and SMSes from our monitoring systems that LHCBONR cluster nodes are not reachable. Alerts continue every 6 minutes (monitoring run delay).

  • 22:02 - IT-DB piquet send email to Lbonline.Support@cernNOSPAMPLEASE.ch to check if this is a GPN network connectivity problem (LHCb was experiencing those before). No reply (as email cannot be retrieved without network connectivity in the pit caused by power cut). LHCb twiki is not accessible, no possibility of checking their contacts and procedures.

  • around 23 - IT-DB piquet calls LHCb online piquet to check what is happening in the pit. Notified about power cut and the fact that the power and connectivity may not return within 1-2h. Nothing else to do at that point, waiting to be notified. Computer operations notified at 23:53.

  • 02:55 (10th Aug) - Phone call from Niko (LHCb) that the power has been restored some time ago and DB will not start up. Assessing the situation, DB not able to start, some storages may not be visible correctly - clean reboot of all 4 nodes to start from a fresh storage situation.

  • 03:15 - 03:40 - checking all nodes after restart - storage visible, clusterware started, DB won't start. ASM recovery diskgroup holding archived redo logs and backups is damaged beyond repair - a lost write in metadata - recreate required. LUNs holding disaster recovery compressed backupsets are fine, but require manual mounting as clusterware cannot mount them without working DB. ASM data diskgroup seems fine as it mounted without problems. Niko and Pdb.Service notified about the findings. Scratching damaged ASM diskgroup, recreating it, duplicating controlfile.

  • 03:40 - it seems like for recovering the DB it is needed to restore some archived redo logs from backup and standby. Using ext3 compressed backups and standby DB.

  • ... in the meantime - trying to force the DB to do a typical crash recovery which should take some minutes, but it is crashing with some errors about inconsistent datafiles ('WARNING! Recovering data file 18 from a fuzzy backup. It might be an online backup taken without entering the begin backup command.'). It also complains about datafile 54 and undo tablespace throwing some errors in RMAN. Difficult to diagnose, very unusual case.

  • 04:11 - update to Niko and Pdb Service

  • ... still trying to force the DB to do a proper recovery. Looking at different scenarios.

  • around 04:28 - Found some corrupted data in online redo logs and file 54 was reporting some errors that were preventing recovery (suspecting corruption in this data file) - decision to restore file 54 from backup. No idea about the BBUs yet.

  • 04:52 - update to Computer Operations

  • 05:55 - restore of datafile 54 completed (after 1h19m), trying to re-run recovery, failing.

  • 06:15 - As a last resort - forcing recovery with 'ALTER DATABASE RECOVER database using backup controlfile', some datafiles recovered, but then got 'ORA-10567: Redo is inconsistent with data block (file# 153, block# 2872412)' - clear indication of a widespread corruption, together with previous problems.

  • ... in the meantime some investigation around the storage arrays and discovery of the missing BBUs! We cannot trust the data on ASM data diskgroup. Read only files should not be corrupted, but not the rest.

  • around 7 a.m. IT-DB piquet consults with other DBAs in the team - decision to do a full restore from ext3 compressed backups. Before this we copy all read-only datafiles to recovery diskgroup (in order to speed up restore) and then scratch the data diskgroup.

  • around 8:20 - we meet at work to discuss restore scenarios and I pass all my discoveries and steps I've done.

  • around 9 a.m. - read-only datafiles are copied and data diskgroup re-created, we are preparing for a full restore. After further discussions we launch full restore around 9:50. Estimated time is up to 4h.

  • ... in the meantime discussing with Niko what scenarios should we use and how recent he would like the recovered DB to be. Restoring to just before the crash can take more time as we decided to use standby DB for it.

  • around noon - the restore is running much slower than expected (CPU bound on server nodes). We consider using the standby for disaster recovery. We will need some help from LHCb to convert all the TNSes to and introduce new DNS aliases. Restore estimates show that it may be finished only just before 22:00. We prepare standby DB in the mean time and open it read-only for testing if some data can be retrieved from TN in the pit - tests successful. We decide to open standby read/write once it is ready. IT MOD updated with details.

  • 13-14h - we are preparing standby for opening by recovering it to the latest possible time. DNS aliases are prepared and assigned to machines in IT CC. In the mean time we prepare new TNS aliases to update them with DNS aliases.

  • 14:15 - standby DB is ready and recovered to the point time less than 1 minute before the power cut (we might have lost only few tens of seconds of data). New TNS aliases updated also on AFS and propagated to DFS. Preparing scripts to create all DB services on the standby. Services created and started.

  • 15:00 - some minor issues with listener and also memory configuration found and fixed. All services are now up and running, we see applications in the pit connecting to the opened standby DB. LHCBONR DB is fully functional again on the standby servers. Only applications not using DFS or AFS tnsnames.ora (locally cached) cannot connect. Niko investigating to identify and fix them.

  • 15:20 - update to IT MOD. Started working on Oracle streaming for LHCb databases.

  • 16:30 - Oracle streaming is fixed for LHCb. MOD notified.

Analysis

  • The origin of the issue that corrupted the LHCBONR database after the power cut has been identified in 2 missing BBUs in 2 out of 4 storage arrays, that is the cache of the arrays were not protected against power failure. This have caused major data and metadata corruption. After follow up with LHCb it seems that the BBUs were missing because they were not put back in place after previous maintenance work performed by LHCB on their storage arrays. Moreover the cache configuration for the arrays in question had been set to write back policy, which relies on the use of write cache for performance reasons.

Follow up

  • We will make sure that the BBU units are put back to storage arrays and we will ask LHCb Sysadmins to implement monitoring on BBUs and cache configuration.
    • update - BBUs were put back on the 12th of August at 4 p.m., all the controllers are now protected and cache is working in write-back mode.
  • DB services will be switched back to HW on the pit once HW functionality is restored to full functionality. Currently scheduled for the week of August 16th.
    • update - LHCBONR switched back successfully to the DB in pit at 10:55 on 16th of Aug 2010.

Documents

-- DawidWojcik - 12-Aug-2010

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatdocx 2009-02-05-Online-Db-Service-v15-LHCb.docx r1 manage 29.7 K 2010-08-16 - 11:03 LucaCanali agreement with lhcb online on DB support
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2010-08-20 - DawidWojcik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback