DBoD VOMS database unavailable

Description

The DBoD VOMS database was unavailable on Saturday 18th October from 05:54 to 17:10.

Impact

  • CERN VOMS server not providing proxy anymore due to the database unavailability

Time line of the incident

  • 18-Oct-14 05:54 - Virtual machine dbvrtg4037 hosting 3 databases (VOMS, lxr, alvmysql) crashed. Databases could not be re-started as files on NAS volumes were locked or marked as locked.
  • 18-Oct-14 10:47 - DBoD team is notified that there is a problem with the database (SNOW ticket). DBoD service is only covered during working hours.
  • 18-Oct-14 17:30 - Ignacio from the DBoD team starts investigating the problem and how could be resolved.
  • 18-Oct-14 19:00 - After trying the standard workaround which didn't work, Ignacio ended up recreating the locked datafiles as new files with new inodes and no locks with the same contents as the originals.
  • 18-Oct-14 19:06 - Eva restarts the database through the GUI which succeeded to restart due to the work done by Ignacio.
  • 18-Oct-14 19:12 - Eva updates the GGUS ticket asking someone to check if the database is now accessible.
  • 18-Oct-14 19:25 - Got confirmation that the issue seems to be solved.
  • 18-Oct-14 20:36 - GGUS ticket is updated, everyting is fine on the server side.
  • 20-Oct-14 09:00 - Re-check of lock status on all the affected NAS volumes, and cleaning of the temporal files created while duplicating the affected datafiles during the intervention.

Analysis

  • The Oracle VM guest dbvrtg4037 crashed:
    • The guest was actually killed following a SIGSEGV (ie segmentation fault on the qemu process) as we can see in the hypervisor logs:
      [2014-10-18 05:54:52 4572] WARNING (image:490) domain 2130_dbvrtg4037: device model failure: pid 1619: died due to signal 11; see /var/log/xen/qemu-dm-2130_dbvrtg4037.log 
      [2014-10-18 05:54:53 4572] WARNING (XendDomainInfo:1907) Domain has crashed: name=2130_dbvrtg4037 id=2.
      
    • The guest was correctly restarted about 5 seconds later at 05:54:58:
      [2014-10-18 05:54:58 4572] DEBUG (XendDomainInfo:117) XendDomainInfo.create_from_dict({ [...] 'name_label': '2130_dbvrtg4037' [...] })
      [2014-10-18 05:54:59 4572] DEBUG (XendDomainInfo:2327) XendDomainInfo.constructDomain
      
    • No other evidence has been found in either the hypervisor and the guest logs and kernel messages.
    • Note that this was the first time we have experienced a SIGSEGV crash of a guest in over 4 years running Oracle VM in production. No core dumps were captured as the behaviour on crash is to restart the guest as soon as possible.
  • Lock files Issue:
    • This is a known issue which happens on NFS volumes when the host suddenly crashes, leaving behind file locks on said volumes.
    • Typically after restart the locks can be removed with already implemented procedures. In this case this failed because of a lack of privileges when running said procedures. This was unknown to Ignacio at the time of the intervention, and has been afterwards included in the operative documentation.
    • (What follows is slightly theoretical) In the case of virtual machines like dbvrtg4037 which use paravirtualization drivers for network access the locks may be held somehow closer to the hypervisor level, which may cause them to be somehow harder to remove. A hard reset of the virtual machine may have helped but it wasn't possible at the time.
    • The solution which was implemented is as follows: As the affected datafiles are locked by references to their inode entries, they are recreated by changing the name of the affected file (renaming conserves the locked inodes) and creating a new file with the old name and exact same contents and metadata of the original file (same data blocks, new un-locked inodes).
  • DBoD service coverage is only working hours. Clearly defined in the manifesto which is signed before creating a DBoD instance. Issue was fixed on Saturday thanks to Ignacio. It could be needed to wait until Monday morning.

Follow up

  • DBoD operations instructions updated with the procedure to fix files locks in case this happens again.
  • We are evaluating whether to enable the capturing of crash dumps on the guests.

-- EvaDafonte - 21 Oct 2014

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2014-10-22 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback