Qualiac and other long-term export files deleted from TSM

Description

Because of deletion request sent to TSM team, apart from old and not used filesystem backup nodes, itrac1112 node has been deleted, which had been used also for storing export dump archives, generated after successful recoveries. These archives were needed to provide long-term retention for some databases - mostly for 1 or 2 years, but in case of Qualiac and Castor Name Server database for 10 years.

Impact

  • The most of the files older than 7 months have been lost, only around 20% of them have been restored back. Files newer than 7 months are still available, as they're archived in the other place.

Time line of the incident

  • 18-05-2015 11:11:01 - Deletion request created in SNOW (RQF0457689)
  • 26-05-2015 11:19:00 - JIRA ticket (AISIT-1749) to restore some Qualiac data going 3 years in the past created
  • 26-05-2015 12:41:00 - After failed tries to restore the data, SNOW ticket (INC0793069) to TSM team opened
  • 26-05-2015 14:00:12 - Deletion request ticket resolved by TSM team (backup nodes deleted)
  • 26-05-2015 16:02:32 - TSM team pinged about INC0793069
  • 26-05-2015 16:28:47 - TSM team replied with information that ticket INC0793069 has not been handled as nobody was on support
  • 26-05-2015 16:36:00 - TSM team realizes that the data has been deleted and informed us, analysis of the situation started
  • 26-06-2015 afternoon/evening - checks if we could restore the data from other sources (filesystem backups, offsite server etc.)
  • 26-05-2015 22:39:03 - TSM team asked to confirm if archives are deleted along with filesystem backups when the node is deleted, without checking their retention
  • 27-05-2015 09:20:18 - TSM team confirmed that archives were deleted, discussions started on how to restore the files
  • 27-05-2015 ~11:00:00 - AIS people warned about possibility of data loss
  • 27-05-2015 14:54:04 - SNOW ticket (INC0793919) to restore deleted files created after discussion with TSM team
  • 27-05-2015 16:23:48 - reply from TSM team that it is not possible to get information about tapes used to backup files from the logs, which don't go before September 2014
  • 28-05-2015 morning - further discussions with TSM team about possibilities to restore the data
  • 28-05-2015 12:29:55 - TSM started metadata restore in order to allow us restoring deleted files
  • 28-05-2015 16:45 - commenting out job deleting old export dumps from offsite server, to keep them there for longer period
  • 29-05-2015 09:26:13 - metadata restore still ongoing, which then appeared to be finished, even though import did not finish properly (looked like hanging)
  • 29-05-2015 ~10:00 - informed by TSM team, that restores should be possible, working with them on configuring server for that purpose
  • 29-05-2015 ~11:00 - started restores of Qualiac files still residing on not yet reused tapes, second session started for files of other databases
  • 29-05-2015 13:54 - 16 Qualiac files restored, the oldest file from 13.03.2013
  • 29-05-2015 15:00-18:00 - restored files copied to offsite backup server
  • 29-05-2015 ~17:00 - second restore session still much work to do, discussing with TSM on possible ways to make it faster, restarted in 2 separate sessions
  • 30-05-2015 08:39:20 - sessions failed and were restarted few times, but finally 56 additional files have been restored - TSM team informed to do the cleanup on their side
  • 31-05-2015 ~12:50 - other files copying to offsite backup server started, as well as archiving to proper tsm node (for all files)
  • 01-06-2015 - other files offsite copy finished, as well as archiving to tapes
  • 01-06-2015 ~12:00 - corrected recovery system code to generate and log proper metadata about tsm server where files have been archived to
  • 02-06-2015 ~12:30 - metadata about restored files manually sent to TSM put in the database

Analysis

  • Exports taken after successful recoveries have been used for many years to provide long-term backups where requested by the users. For short period they have been kept on offsite backup server disks. For long-term they have been archived to TSM with agreed retention set for each file independently. Unfortunately, all the files were sent to TSM nodes used also for filesystem backups, which were different for each recovery server and configured, as well as managed by the other team. The fact that they were needed not only for filesystem backups have been missed and while migrating recovery servers, cleaning has been done, removing old filesystem node backups, which causes also deletion of archives kept there. In November 2014 this has already been changed, by creation of specific node name to be used only for purpose of archiving export files: DBRECOVERY_EXP, but still Oracle B&R team has not been aware about old data being deleted and request to create new dedicated TSM node, has been made only to ease manageability while configuring new recovery servers. Since around 7 months all files are sent there, but the older ones are lost, except the ones we were able to restore with the help of TSM team.

Follow up

  • Restore of 72 files performed (around 20% of deleted files)
  • Restored files copied to offsite server and sent back to proper TSM node (dbrecovery_exp)
  • Decision made to keep all exports also on offsite backup server disks, until the other solution is agreed and implemented (if any) -> job cleaning old exports from offsite server disabled, NAS volume resized
  • Found and corrected bug, causing metadata of sent exports having wrong TSM server defined (this is only about correctness of metadata, as files were sent to correct place)
  • All files archived to DBRECOVERY_EXP have been restored and sent to offsite server

-- SzymonSkorupinski - 2015-06-02

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2015-06-19 - SzymonSkorupinski
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback