Replication for ATLAS conditions and LHCB conditions to SARA stopped
Description
The database at SARA was unavailable since a storage corruption issue on 18.08. Replication to SARA for ATLAS conditions and LHCB conditions stopped and excluded from main setup. SARA DB was restored with point in time recovery on 01.09. Such long delay was caused by datafile corruption observed after database recovery. Service request and further collaboration with Oracle Support was necessary to restore single instance database on the separate machine. The replication to SARA were finally restored on 08.09 around 14:00. The procedure involved another Tier1 site (RAL) as source of the read-only data. Due to the fact that main Capture was split after SARA database corruption it was merged back on 10.09.
Impact
- No replication to SARA for ATLAS conditions and LHCB conditions between 18.08 (12:23) and 08.09 (14:00)
- No replication to RAL for ATLAS conditions and LHCB conditions between 06.09 (12:00) and 08.09 (14:00)
Time line of the incident
- Wednesday, 18th of August, 12:23 - SARA db crush with datafile corruption
- Friday 20th of August, 17:00 - replication to SARA excluded from main setup
- Friday 1st of September - SARA database restored with single instance database on different machine
- Monday 6th of September, 14:00 - replication to RAL stopped to serve as a data source for SARA recovery
- Wednesday 8th of September, 14:00 - SARA replication was restored (for both ATLAS and LHCB)
- Friday 10th of September, 16:30 - SARA replication added back to the main setup
Analysis
The streams replication restore procedure (Tier0->Tier1) is well documented and to recover from such failure DBA should simply follow the recovery procedure. In this particular situation the main problem was database corruption at SARA (Tier1) resolved by collaboration with Oracle support. Cern DBAs served with their advices (database recovery phase) and as main coordinatiors of streams setup recovery process.
Follow up
Since it's essential to collaborate effectively with Tier1's administrators during recovery of such failures this issue is being followed up with including it in "Incident review" on 3Dworkshop in November 2010. Recovery scenario will be also discussed in "Backup and recovery and actions to reduce risk DB loss" part. Agenda:
http://indico.cern.ch/conferenceDisplay.py?confId=111194
--
MarcinBlaszczyk - 21-Sep-2010