FTS3 database down due to a hardware problem
Description
FTS3 connectivity loss during weekend.
Impact
FTS3 users unable to connect to the database. Any application relying on FTS3 database.
Time line of the incident
- 4-Oct-2014 at 18:30: IT-DB monitoring detected lack of connectivity to FTS3 DBoD instance.
- 04-10-2014 20:01:18: INC0649797: Operator tries to restart the server programmatically. It doesn't work. Please check the incident.
- 04-10-2014 20:53:11: Operator logs a severe hardware problem. Vendor case is open.
- 04-10-2014 21:12: Ruben (DBoD admin) checks the situation. He opened a IT/SSB incident: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0014622
. Nothing else can be done during the weekend as it's a hardware problem. A solution is postponed for Monday.
- 05-10-2014 13:40: Developer opened an incident (INC0649909 ). A dialog is started, where Ruben re-explained the Manifesto and the working hours nature of the service. An assertion about the importance of this service is also requested, but he doesn't get a clear answer. Please check the incident.
- 05-10-2014 17:56:34: Ruben decided not to wait till Monday and to re-install a new server as in the mean time thanks to Giacomo Tenaglia (IT-DB-IMS section) has managed to free a server in the same rack.
- 05-10-2014 22:10:51: After server re-installation, done by Giacomo, instance was recovered on the new server and it was open. Service has been re-established.
Analysis
Severe hardware failure of the server hosting the instance.
Follow up
Permanent solution implemented.
--
RubenGaspar - 2014-10-29