Castor Name Server not available due to problems with one of the database datafiles.

Description

Castor software was blocked because it could not modify data in the database due to one database datafile in OFFLINE status.

Impact

  • All Castor operations blocked. There is just a single Name Server database used by all the Castor clients ( physics experiments ) so the unavailability of the system impacts effectively everybody. This blocks all the Stager and SRM operations.

Time line of the incident

  • 29-Nov-2010 13:32:43 The CASTORNS1 instance detects a problem in the Cluster database as it can not communicate with the second instance CASTORNS2
  • 29-Nov-2010 14:31 SMON background process detects a problem with a datafile in the CASTOR_INDX tablespace, the datafile is offline and this causes several Name Server sessions become blocked trying to modify data. The database sessions were blocked in row lock contention. The client sessions were getting
ORA-00376: file 21 cannot be read at this time
ORA-01110: data file 21: '/ORA/dbs03/CASTORNS/datafile/o1_mf_castor_i_314h19fs_.dbf'
  • 29-Nov-2010 14:57 - The Clusterware evicts the second node and forces a reconfiguration of the cluster database. Instance 1 recovers Instance 2.
  • 29-Nov-2010 15:17 - Restart of the first instance does not fix the problem with the datafile that is still offline
  • 29-Nov-2010 15:30 - Decided to go for recovery of the datafile, database restarted in mount state for RMAN operation.
  • 29-Nov-2010 15:33 - Recovery completes
  • 29-Nov-2010 15:42 - After some final checks, the service is back online with no data loss ..

Analysis

The root cause is not clear. The second node (dbsrvc204) was definitively in the limbo due to local filesystems in READ ONLY mode, but it did not hold any service as all the client connections were on the first node (dbsrvc202), so it should not be held responsible for the problem of the datafile. We could have faced two different unrelated problems that happened to coincide in time. The filesystems in dbsrvc204 were fixed by a reboot. The state of dbsrvc204 was very strange because no monitor detected anything (lemon/oms).

The database datafile issue could be explained by a momentary glitch in the communication between the server and the storage box. If the database loses sight of a certain datafile for a certain period of time it will set it to OFFLINE to prevent further problems.

The recovery is needed because the datafile gets out of sync with the rest of the database regarding the database internal transaction clock (the SCN number). The problem only affected one datafile so the glitch must have been indeed very short.

There is no info in the Storage logs that could point to a problem in the storage either.

We are checking about a potential HW problem in the second node that could explain why the local filesystems had been put into READ ONLY mode. This is normally due to the OS detecting a problem with the journalling filesystem or in the underlaying HW. A similar case happened to another server, and it was fixed with an update of the firmware. However, initial tests have not detected any issue with the local disks in the server.

Follow up

  • Continue to monitor if the problem in the second node re-appears, if so perform the same firmware upgrade or prepare migration to new hardware.

-- NiloChinchilla - 01-Dec-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-12-01 - NiloChinchilla
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback