Storage intervention affecting severely OVM infrastructure
Description
A clean-up action on the storage affected several production volumes used by production OVM.
Impact
- All DBoD instances, application servers running on that OVM pool.
Time line of the incident
- Fri Feb 22 09:34:04 CET: apps_ovm_g3 set offlined
- Fri Feb 22 09:36:05 CET: apps_ovm2gen3a, apps_ovm2gen3b set offlined
- ~ Fri Feb 22 09:42:00 CET: volumes set back online.
Analysis
A clean-up operation on the storage for safety reasons is made up of two steps:
- offlining the volumes: NFS accesses is blocked.
- destroying the volumes a few days after
Destruction is always done a few days after if not unexpected impact is detected. Sadly a misunderstanding of the admin in charge took wrong volumes for the clean-up.
Follow up
Two incidents where open to follow-up this issue:
https://itssb.web.cern.ch/service-incident/several-db-demand-instances-down/22-02-2013
&
http://itssb.web.cern.ch/service-incident/itdb-virtualisation-not-available/22-02-2013
Downtime for affected services varies from 30 to 60 minutes.