CMSONR database blockage during snapshot taking (in progress)
Description
On Tuesday a snapshot requested by CMS was finishing export from CMSONR database. The job was not able to finish and caused library cache locks on the database. First connections timeouts to cms_pvss_confdb were reported by phone by CMS. It was checked, but the DB was at the time working properly, and problem was thought to be a transient issue and a reconnection problem.
After 40 minutes another timeouts appeared, it was not possible to connect to database and already connected sessions were experiencing locks. Due to those problems CMS was not able to configure their run for 1.5h.
Impact
- CMS run was affected. They were not able to start the run for 1.5h
Time line of the incident
- 22-May-12 09:10 - K. Dziedziedziniewicz started snapshot job
- 22-May-12 10:25 - Export dumpfile was created for snapshot
- 22-May-12 10:25 - Racmon DB monitoring email reported timeouts connecting to cms_pvss_conddb service
- 22-May-12 10:30 - Frank Glege from CMS Onlice called K. Dziedziniewicz reporting connection problems to PVSS
- 22-May-12 10:35 - K. Dziedziniewicz checked DB connectivity and performance, they were fine, DB was accessible. DB monitoring was also not reporting any issues.
- 22-May-12 10:40 - K. Dziedziniewicz called Frank Glege, who confirmed DB is back and clients are working on reconnection issue
- 22-May-12 11:20 - Z. Baranowski with K.Dziedziniewicz were investigating issued with snapshots of online schemas, that seemed to be hanging. At the same time message from DB monitoring arrived, that connections to CMSONR database timeout. DB team started investigation.
- 22-May-12 11:30 - Z. Baranowski found out that there is library cache lock in the database, on dbms_logrep_util package. Lock was caused by sys user and the snapshot job trying to access the package (undocumented, extremely powerful) at the same time.
- 22-May-12 11:32 - Z. Baranowski killed session creating snapshot, DB unblocked and started to work properly
Analysis
- Data pump job used to export snapshot data didn't finish properly. Dump file was created but job cleanup hang on cleaning queues. There was library cache lock on dbms_logrep_util package cased by activity of the datapump jobs and additional maintenance job which is run once a week (Tuesday 9.30). The appearance of library cache locks has extremely bad influence on DB performance and creates an avalanche effect.
Due to this lock session already existing got blocked and new sessions were not able to connect to a number of services (but that was transient). It is to be investigated with Oracle why this lock appeared.
When library cache contention occurred on CMSONR following sessions were involved in it:
- 3 from datapump
- 1 from weekly_logmnr_shrink
- 1 or 2 of MMON slaves - every 30 minutes when AWR snapshots were starting ( between 10:00 and 11:53 there wasn't any AWR snapshot that finished successfully!). In the snapshot we also have some information about buffered queues (using DBMS_LOGREP_UTIL dependent objects) so probably this is the reason why thy were hanging and were receiving timeouts
- 5-6 from DBSNMP and STRMMON - victims of exclusive pin on DBMS_LOGREP_UTIL - any query on dba_capture view was hanging
The most interesting thing was happening between 3 datapump sessions:
- First one was waiting since end of the export for library cache lock while executing 'select 'x' from dba_queues...'
- Second one is waiting for first one with the event: pin S wait on X while running : sys.kupc$que_int.detach_queues;
- The third one is waiting for strange event 'reliable message' while executing sys.kupc$que_int.delete_queues
Follow up
- A snapshot mechanism allowing snapshoting from active data guard was implemented. Snapshots are now being taken not from crtitical CMSONR database but from the ADG copy in CC.
- The weekly maintenance job was disabled, since Streams replication has been removed from CMSONR it is no longer needed.
- There was an attempt to reproduce the problem with datapump exporting and same maintenance job running together but no contention observed afterwards.
- There is an SR (3-5748470351) opened with Oracle Support to explain this issue.
- A check on SQL response time and Active Session No changes was added to DB monitoring system
--
KateDziedziniewicz - 24-May-2012