LCGR extended downtime during db upgrade to 10.2.0.5

Description

The LCGR database was down on Tuesday 25.01 from 09:30 to 17:00 due to the problems encountered during the database upgrade to 10.2.0.5. (Scheduled and expected downtime was 2.5 hours).

Impact

  • All services running on the LCGR database were unavailable from 09:30 to 17:00
  • List of affected services: ALICE_DASHBOARD, ATLAS_DASHBOARD, CMS_DASHBOARD, LCG_DASHBOARD, LCG_FCR, LCG_FTS, LCG_FTS_MONITOR, LCG_FTS_T2, LCG_FTS_T2_W, LCG_FTS_W, LCG_GRIDMAP, LCG_GRIDOPS, LCG_GRIDVIEW2, LCG_LFC, LCG_OPS, LCG_SAM_PI, LCG_SAM_PORTAL, LCG_SAM_PPS, LCG_SAME, LCG_SITEMON, LCG_VOMS, LHCB_DASHBOARD

Time line of the incident

  • 25-Jan-2011 09:30 - Scheduled intervention to upgrade LCGR database to Oracle 10.2.0.5 starts as planned.
  • 25-Jan-2011 09:35 - During database shutdown, LCGR cluster goes into an "unknown" status and does not respond.
  • 25-Jan-2011 09:50 - We discovered that the issue is caused by the firewall: denial of service attack (/var/log/messages file contains following error "ip_conntrack: table full, dropping packet"). We disabled iptables. LCGR cluster is re-started and cleanly shutdown again.
  • 25-Jan-2011 10:20 - LCGR database upgrade starts.
  • 25-Jan-2011 12:30 - The upgrade process appears to hang (catupgrd.sql script usually takes about 10 min to run and it has been running for 30 min).
  • 25-Jan-2011 14:00 - LCGR standby database is prepared to be open while the problem is investigated. Upgrade process progress is very slow.
  • 25-Jan-2011 14:15 - Found an Oracle note describing the same symptoms (metalink 785689.1). Thanks to that we can estimate how long it will take to finish.
  • 25-Jan-2011 15:00 - We propose to the users community to allow the upgrade process to continue till 16:15.
  • 25-Jan-2011 16:10 - The upgrade process finished. We continue with the last upgrade steps. Iptables are re-enabled.
  • 25-Jan-2011 17:00 - LCGR database is back in production. Users are notified.

Analysis

  • Problem observed with firewall is not yet understood.
  • Problem with the upgrade process was caused by excessive messages in the AQ tables that require removal during the upgrade process. There were two available solutions proposed by Oracle in order to fix this issue: (1) allow the upgrade process to continue till the end or (2) manually clean-up of the messaged. Second option was initially discarded as required the recovery of the instance and will take more time than the estimated time for the process to finish.
  • IT SSB updates were posted with some delays.

Follow up

  • The cause of the excessive messages in the AQ tables is still being investigated (26.01.2011).
    • Follow up (8.02.2011):
      • Caused by bugs 7379282 and 7494199 (not yet fixed)
      • FAN has two methods for publishing events to clients, the Oracle Notification Service (ONS), which is used by Java Database Connectivity (JDBC) clients including the Oracle Application Server 10g, and Oracle Streams, Advanced Queueing which is used by Oracle Call Interface (OCI) and Oracle Data Provider for .NET (ODP.NET) clients. When using Advanced Queueing, you must enable the service to use the queue by setting AQ_HA_NOTIFICATIONS to true.
      • 8 of 23 services in LCGR database with aq_ha_notification='YES'. These services where causing orphan rows in the sys.reg$ table.
    • Workaround applied: AQ_HA_NOTIFICATIONS disabled for the identified services as they do not respond to FAN events. Orphan rows deleted from the sys.reg$ table.

  • The issue of information flow was followed up with all groups and discussed at T1 service coordination meetings.

-- EvaDafonte - 26-Jan-2011

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2013-01-16 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback