Unavailability of the CMS offline production DB (CMSR), 28th April 2011
Description
The CMS offline production database (CMSR) got stuck on Thursday 28th April 2011 around 23:45. The service was not available for about 1 hour and a half.
Impact
- The database was unresponsive for approx. 1.5 hours
- List of affected services includes: CMS_ALCT, CMS_ANODE_BOARDS, CMS_C2K, CMS_COND, CMS_CSC, CMS_DBS, CMS_DBS_WRITER, CMS_EMU_CERN, CMS_EMU_FAST, CMS_EMU_HV, CMS_HCL, CMS_INTEGRATION, CMS_LUMI_PROD_OFFLINE, CMS_PVSS, CMS_PXL, CMS_SSTRACKER, CMS_T0AST, CMS_TEC_LYON, CMS_TESTBEAM, CMS_TRANSFERMGMT, CMS_TRANSFERMGMT_SC, CMS_TRANSFERMGMT_TEST, CMSR_APEX, CMSR_BACKUP, CMSR_LB
Time line of the incident
- 15-Apr-11 23:45 - First error messages from the monitor tool, database seems to not respond.
- 15-Apr-11 23:55 - DB team started to investigate the issue, ssh connection possible, sqlplus response very slow.
- 15-Apr-11 00:20 - CMS computing run coordinator informed (GGUS ticket already opened by him), experiment contacts informed, call received from the computer operator due to GGUS ticket, notification sent to the computer operator for the SSB.
- 15-Apr-11 00:40 - Instance number 4 does not respond anymore and seems to block other instances. Instance number 4 is re-started by the dba.
- 15-Apr-11 01:00 - The database is fully open for users. Root cause not yet understood. CMS computing coordinator informed (he sent us the link to the GGUS ticket to be updated), experiment contacts and SSB updated.
Analysis
- Investigation showed that all sessions were waiting on events like library cache lock, cursor pin S wait on X, etc.
- sqlplus connection to instance number 4 was not possible.
- Nothing was found in the logs.
- Due to the database state, was not possible to determine exactly what was causing the locks. Suspect that instance number 4 could be the cause.
- Once instance number 4 was killed, locks disappeared and database started to respond again.
Follow up
- SSB was not updated with the messages sent by the DB team even though we got a confirmation from the computer operator that message was posted. Being followed up by Anthony Grossir.
- Update: it seems that there was a problem with the procedure used by the operator to post the message in SSB. Fixed Friday (29.04) morning.
- GGUS ticket was not received by the DB team (only through CMS computing run coordinator). Being followed up.
- Operator updated GGUS ticket saying that castor support was called. This was a mistake, DB support was called (confirmed by the operator).
- Further investigations on Friday morning did not reveal the root cause. With no evidence on the logs, we cannot determine the cause of the problem.
--
EvaDafonte - 29-Apr-2011