Database services unavailable or degraded after network intervention

Description

Some database services were unavailable or degraded as a side effect of the network intervention (http://itssb.web.cern.ch/planned-intervention/computer-centre-router-upgrade-gpnlcg/18-09-2013).

Impact

  • Database Services degraded:
    • Accelerator Databases:
      • ACCCON
      • ACCMEAS
      • ENCVORCL
      • LASER
      • ACCLOGTEST
      • SUSI
      • TIM
      • ZORA
    • Physics Databases:
      • adcr_adg_rac7
      • alionr_adg_rac9
      • atlarc
      • atonr_adg_rac7
      • cmsarc
      • cmsonr_dg_rac9
      • compr_dg_rac9
      • d3r
      • dgtest_adg_rac7
      • dgtest
      • dwsdb
      • dgtest
      • int2r
      • int6r
      • int9r
      • int12r
      • intr
      • lhcbonr
      • lhcbr
      • pdbr_adg_rac9
      • pdbr
      • test2

  • Database Services unavailable:
    • AIS Databases:
      • AISDBD
      • AISDBT
      • EDHD
      • EDHP_DG_GEN3
    • CASTOR Databases
      • ALICESTG
      • ATLASSTG
      • CASTORNS
      • CMSSTG
      • ITDCSTG
      • LHCBSTG
      • PUBSTG
      • REPACKDB
    • IT Databases:
      • CSDB
      • CSR
      • ITCORE
      • LEMONRAC
    • Physics Databases:
      • cmsr
      • cmsonr_adg_rac10
      • compr
      • int11r
      • int8r
      • lcgr
      • lhcbonr_dg_rac7
      • rdtest1

  • List of virtual production systems affected by the network intervention:
    • Databases on demand:
      • dbvrts1014 - DRUPAL Production Slave/Master MySql database
      • dbvrts1016 - CDSDEVDB and PACMAN
      • dbvrts1017 - AFSCONSO
      • dbvrts1022 - GNIDBD
      • dbvrts1027 - DRUPAL MySql databases
      • dbvrts1028 - DRUPAL MySql databases
      • dbvrts1042 - SFTDB
      • dbvrts1049 - SCALACM
      • dbvrts1053 - MBSD
      • dbvrts1054 - LEMONMGR
      • dbvrts1055 - ASIMPERF
      • dbvrts1057 - OPTODB
    • Oracle databases
      • dbvrts1023 - GIS DB
      • dbvrts1024 - AUDITP
    • AIS Applications
      • dbvrts1036 - Qualiac payment transfer

Time line of the incident

  • 18-Sept-13 05:00 - Computer Center GPN routers processed from 05:00 to 06:00.
  • 18-Sept-13 07:00 - Computer Center LCG routers processed from 07:00 to 10:00.
  • 18-Sept-13 07:30 - Eric Grancher passed by the Computer Center and checked with CS people that the intervention was going well and in time.
  • 18-Sept-13 08:00 - We are informed that the access control system is having problems, but databases (susidb/zoradb/pdb) are fine. However, a number of databases have had one or more instances stopped.
  • 18-Sept-13 08:05 - Przemek noticed that LCGR and CMSR hosts had the storage not mounted and database instances not started.
  • 18-Sept-13 08:08 - Work is being distributed between DBAs in order to check and manually restart databases.
  • 18-Sept-13 12:10 - All databases are back.

Analysis

  • Switches were rebooted before the other switches had come back. This explains why the storage/interconnect network was cut altogether and unfortunately lead to virtualisation / RAC issues / etc.
  • Databases were further affected due to 2 cases:
    • (#1) Some machines where CRS was not fully restarting because of network heartbeat.
    • (#2) Some machines were CRS would not continue to try restarting.

Follow up

  • (#1) It has been found to be a bug in CRS, see 11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (Doc ID 1479380.1). This bug is only fixed in 12.1.0.2. but there is a workaround for 11g.
  • (#2) crsctl stop crs / kill crs processes / start crs / start cluster fixes the issue.
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2013-09-19 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback