Unavailability of the CMS offline production DB (CMSR), 28th April 2011

Description

The CMS offline production database (CMSR) got stuck on Thursday 28th April 2011 around 23:45. The service was not available for about 1 hour and a half.

Impact

  • The database was unresponsive for approx. 1.5 hours
  • List of affected services includes: CMS_ALCT, CMS_ANODE_BOARDS, CMS_C2K, CMS_COND, CMS_CSC, CMS_DBS, CMS_DBS_WRITER, CMS_EMU_CERN, CMS_EMU_FAST, CMS_EMU_HV, CMS_HCL, CMS_INTEGRATION, CMS_LUMI_PROD_OFFLINE, CMS_PVSS, CMS_PXL, CMS_SSTRACKER, CMS_T0AST, CMS_TEC_LYON, CMS_TESTBEAM, CMS_TRANSFERMGMT, CMS_TRANSFERMGMT_SC, CMS_TRANSFERMGMT_TEST, CMSR_APEX, CMSR_BACKUP, CMSR_LB

Time line of the incident

  • 15-Apr-11 23:45 - First error messages from the monitor tool, database seems to not respond.
  • 15-Apr-11 23:55 - DB team started to investigate the issue, ssh connection possible, sqlplus response very slow.
  • 15-Apr-11 00:20 - CMS computing run coordinator informed (GGUS ticket already opened by him), experiment contacts informed, call received from the computer operator due to GGUS ticket, notification sent to the computer operator for the SSB.
  • 15-Apr-11 00:40 - Instance number 4 does not respond anymore and seems to block other instances. Instance number 4 is re-started by the dba.
  • 15-Apr-11 01:00 - The database is fully open for users. Root cause not yet understood. CMS computing coordinator informed (he sent us the link to the GGUS ticket to be updated), experiment contacts and SSB updated.

Analysis

  • Investigation showed that all sessions were waiting on events like library cache lock, cursor pin S wait on X, etc.
  • sqlplus connection to instance number 4 was not possible.
  • Nothing was found in the logs.
  • Due to the database state, was not possible to determine exactly what was causing the locks. Suspect that instance number 4 could be the cause.
  • Once instance number 4 was killed, locks disappeared and database started to respond again.

Follow up

  • SSB was not updated with the messages sent by the DB team even though we got a confirmation from the computer operator that message was posted. Being followed up by Anthony Grossir.
    • Update: it seems that there was a problem with the procedure used by the operator to post the message in SSB. Fixed Friday (29.04) morning.
  • GGUS ticket was not received by the DB team (only through CMS computing run coordinator). Being followed up.
  • Operator updated GGUS ticket saying that castor support was called. This was a mistake, DB support was called (confirmed by the operator).
  • Further investigations on Friday morning did not reveal the root cause. With no evidence on the logs, we cannot determine the cause of the problem.

-- EvaDafonte - 29-Apr-2011

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2011-05-09 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback