Local filesystem got full on two LHCBR nodes

Description

  • Local filesystem ORA/dbs00 got full on LHCBR 1st and 2nd node.

Impact

  • LHCBR database was not accepting new connections for services working on affected nodes.

Time line of the incident

  • Tuesday 02-Nov-00 00:45 - Streams monitoring reported Capture latency on LHCBDSC
  • Tuesday 02-Nov-00 01:15 - 100% filesystem usage was noticed on two LHCBR nodes.
  • Tuesday 02-Nov-00 01:30 - Filesystem was cleaned with old trace files. Services were relocated to appropriate nodes.

Analysis

  • Oracle lmd trace file gets unreasonably big and even after dropping occupies space on local filesytem because oracle process is constantly writing to this file. This issue is a known bug fixed in PSU April 2010 patchset. However due to the emergency rollback of aforementioned patch (in May 2010) issue was still affecting production systems. Logrotation script which is responsible for management of log files created a copy of lmd file during compression phase. Therefore on two machines filesystem got full in one moment.

Follow up

  • Workaround for deleting lmd trace file without affecting Oracle database were applied on all production machines. Monitoring tool has been improved with additional feature for alerting DBA on Shift when filesystem is getting full.

-- MarcinBlaszczyk - 10-Nov-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-11-10 - MarcinBlaszczyk
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback