CASTOR storage controllers upgrade

Description

As recommended by Netapp support an upgrade of CASTOR nas boxes was performed successfully on all FAS3140 and FAS3240 (8 controllers). The upgrade of the FAS3040 controllers had issues which lead to c2castordevdb, c2repackdb and the castor (name server) databases being unavailable during a period of time.

The intervention was discussed with the CASTOR operations team and announced on the IT status board https://itssb.web.cern.ch/planned-intervention/upgrade-operating-system-castor-nas-controllers/09-07-2012 ("The intervention is expected to be transparent but due to its nature databases and application services are at risk. In case of problem users will lose their database session and a downtime of about 30 minutes depending on services may occur.")

Impact

  • Service degraded or no accessible for about 15 minutes on repack, castor name server and castordev. The time span for the latter was about 45 minutes.

Time line of the incident

  • 9-July-2012 at 9:42: After upgrading successfully one node (dbnasc201), we start to upgrade the second node in the cluster. Sadly the shutdown of the machine hung (dbnasc202). Errors displayed on the console point to a possible problem with the PSU
  • 9-July-2012 at 10:00 Netapp support is called to try to figure out how to force controller's boot or force a take over without data impact. dbnasc202 is not serving data. This affects c2castordevdb.
  • 9-July-2012 at 10:00 Visual inspection of PSU's at CC. All looks ok.
  • 9-July-2012 at 10:20 A force shutdown is forced from the console. This finally unblocks the situation and dbnasc202 is taken over by dbnasc201.
  • 9-July-2012 at 10:34 This dbnasc201/dbnasc202 cluster is again operational.
  • 9-July-2012 Meantime we have upgraded FAS3140 and FAS3240 clusters. This has worked out as expected, with no impact for database services.
  • 9-July-2012 Upgrade of the first node of a FAS3040 cluster works as expected. The second node gives the same problem, it has also required a force restart.
  • 9-July-2012 Service availability (no new sessions accepted, transanction waiting on commit) for following databases was impacted: repack database from ~ 11:55 until 12:04 and CASTOR NS from ~ 11:30 until 11:43

Analysis

Dbnasc201/dbnasc202 errors during shutting down of node dbnasc202:


dbnasc202*> Mon Jul  9 09:44:33 CEST [dbnasc202: cf.fsm.nfo.startingGracefulShutdown:notice]: Negotiated failover: starting graceful shutdown.
Setting boot image to image2.
.
Uptime: 236d15h42m18s
Mon Jul  9 09:44:50 CEST [dbnasc202: kern.shutdown:notice]: System shut down because : "D-blade Shutdown".
Takeover/Sendhome no longer inhibited by pending halt/reboot.
rpc failed: RPC: Unable to receive; errno = Undefined error: 0
rpc failed: RPC: Unable to send; errno = Undefined error: 0
Mon Jul  9 09:55:06 CEST [dbnasc202: kern.time.rpc.error:ALERT]: Unable to read updated Timekeeping options.
Mon Jul  9 10:00:00 CEST [dbnasc202: kern.uptime.filer:info]:  10:00am up 236 days, 15:43 1763765739 NFS ops, 0 CIFS ops, 0 HTTP ops, 0 FCP ops, 0 iSCSI ops
Mon Jul  9 10:00:02 CEST [dbnasc202: monitor.chassisFan.removed:ALERT]: Chassis fan SYS_FAN_1 is removed
Mon Jul  9 10:00:02 CEST [dbnasc202: callhome.c.fan.fru.rm:error]: Call home for CHASSIS FAN FRU REMOVED: SYS_FAN_1
Mon Jul  9 10:00:02 CEST [dbnasc202: monitor.chassisFan.removed:ALERT]: Chassis fan SYS_FAN_2 is removed
Mon Jul  9 10:00:02 CEST [dbnasc202: callhome.c.fan.fru.rm:error]: Call home for CHASSIS FAN FRU REMOVED: SYS_FAN_2
Mon Jul  9 10:00:02 CEST [dbnasc202: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Shutting down. Number of Failed chassis fans more than tolerable limit
Mon Jul  9 10:00:02 CEST [dbnasc202: callhome.fans.failed:EMERGENCY]: Call home for MULTIPLE FAN FAILURE
Mon Jul  9 10:00:05 CEST [dbnasc202: monitor.chassisPower.degraded:notice]: Chassis power is degraded: sensor PSU1 Present
Mon Jul  9 10:00:05 CEST [dbnasc202: callhome.chassis.power:error]: Call home for CHASSIS POWER DEGRADED: sensor PSU1 Present
Mon Jul  9 10:00:08 CEST [dbnasc202: cf.takeover.disabled:warning]: Controller Failover is licensed but takeover of partner is disabled due to reason : version mismatch.

This looks like a problem related to this particular architecture.

First boot of the controller when the problem is hit, generates a core dump. All evidences have been forwarded to NetApp.

Follow up

  • Netapp case: 2003315054 (update on July 12th: NetApp is investigating the content of the core files).
  • Data consistency is being verified for the CASTORNS (update on July 12th, no corruption detected by the backup).

-- RubenGaspar - 09-Jul-2012

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2012-07-12 - EricGrancher
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback