TWiki> DB Web>PostMortems>PostMortem22May12 (revision 3)EditAttachPDF

CMSONR database blockage during snapshot taking (in progress)

Description

On Tuesday a snapshot requested by CMS was finishing export from CMSONR database. The job was not able to finish and caused library cache locks on the database. First connections timeouts to cms_pvss_confdb were reported by phone by CMS. It was checked, but the DB was at the time working properly, and problem was decided to be a transient issue and a reconnection problem. After 40 minutes another timeouts appeared, it was not possible to connect to database and already connected sessions were experiencing locks. Due to those problems CMS was not able to configure their run for 1.5h

Impact

  • CMS run was affected. They were not able to start the run for 1.5h

Time line of the incident

  • 22-May-12 09:10 - K. Dziedziedziniewicz started snapshot job
  • 22-May-12 10:25 - Export dumpfile was created for snapshot
  • 22-May-12 10:25 - Racmon DB monitoring email reported timeouts connecting to cms_pvss_conddb service
  • 22-May-12 10:30 - Frank Glege from CMS Onlice called K. Dziedziniewicz reporting connection problems to PVSS
  • 22-May-12 10:35 - K. Dziedziniewicz checked DB connectivity and perfomance, they were fine, DB was accessible. DB monitoring was also not reporting any issues.
  • 22-May-12 10:40 - K. Dziedziniewicz called Frank Glege, who confirmed DB is back and clients are working on reconnection issue
  • 22-May-12 11:20 - Z. Baranowski with K.Dziedziniewicz were investigating issued with snapshots of online schemas, that seemed to be hanging. At the same time message from DB monitoring arrived, that connections to CMSONR database timeout. DB team started investigation.
  • 22-May-12 11:30 - Z. Baranowski found out that there is library cache lock in the database, on dbms_logrep_util package. Lock was caused by sys user and the snapshot job trying to access the package (undocumented, extrelmy powerfull) at the same time.
  • 22-May-12 11:32 - Z. Baranowski killed session creating snapshot, DB unblocked and started to work properly

Analysis

  • Data pump job used to export snapshot data didn't finish properly. Dump file was created but job cleanup hang on cleaning queues. There was library cache lock on dbms_logrep_util package cased by activity of the datapump jobs and additional maintenance job which is run once a week (Tuesday 9.30). The apperance of library cache locks has extremly bad influence on DB performance and creates an avalanche effect.
Due to this lock session already existing got blocked and new sessions were not able to connect to a number of services (but that was transient). It is to be investigated with Oracle why this lock appeared.

Follow up

  • It was decided with CMS that snapshots will not be take during the run until there is the clear explanation of why this issue happened.
  • The weekly maintenance job was disabled, since Streams replication has been removed from CMSONR it is no longer needed.
  • There was an attempt to reproduce the problem with datapump exporting and same maintenance job running together but no contention observed afterwards.
  • There is an SR (3-5748470351) opened with Oracle Support to explain this issue.

-- KateDziedziniewicz - 24-May-2012

Edit | Attach | Watch | Print version | History: r7 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2012-05-25 - ZbigniewBaranowski
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback