ATLAS offline production database (ATLR) high load
Description
Atlas offline DB, ATLR, suffered high load and instance reboots during the nights of 11th and 12th. High load comes as spikes from Tier0 jobs.
Impact
- All services running on ATLR database. Mainly:
- Tier 0 reconstruction and access to conditions data
- Tier 0 monitoring and control processes
Time line of the incident
- 11-Oct-11 17:30 - First high load spikes observed on the ATLR database.
- 12-Oct-11 00:21 - ATLR database starts to be unresponsive. Alarms triggered by the database monitor tools.
- 12-Oct-11 01:00 - Node 3 (ATLR) rebooted by cluster, services relocated to remaining nodes. This increases the load on the remaining nodes.
- 12-Oct-11 01:20 - Dba informs Atlas about the problem affecting the database: enormous number of connections to ATLAS_COOL_READER_TZ account.
- 12-Oct-11 01:40 - The account ATLAS_COOL_READER_TZ is locked by the dba in order to stabilize the database and the maximum number of sessions to the account decreased (from 600 to 300). ATLAS_COOL_READER_TZ is unlocked again.
- 12-Oct-11 02:00 - Number of sessions increases again causing database unresponsiveness. Dba decides to lock the account to ensure database availability for other applications and informs Atlas about the decision.
- 12-Oct-11 02:10 - Database still had problems with inbound connections and listeners. Analysis ongoing. Listeners restarted.
- 12-Oct-11 03:30 - Instances are rebooted manually and in a rolling fashion to fix the problem.
- 12-Oct-11 08:30 - Node 3 (ATLR) rebooted by cluster, services relocated to remaining nodes.
- 12-Oct-11 09:00 - The account ATLAS_COOL_READER_TZ is unlocked. Number of session is decreased to 150.
- 12-Oct-11 19:30 - High load observed again on nodes 2 and 3 of ATLR
- 12-Oct-11 19:49 - Node 3 (ATLR) rebooted by cluster, services relocated to remaining nodes.
- 12-Oct-11 21:30 - New errors reported by the database: TNS:listener does not currently know of service requested in connect descriptor.
- 12-Oct-11 21:50 - Job submission was limited by Atlas to guarantee database availability during the night.
- 12-Oct-11 23:00 - Listener errors might be caused by sniped sessions (due to job failures during database unavailability). Manually cleaned by the dba. Automatic job to clean sniped sessions enabled every 5 minutes.
Analysis
- High load affecting ATLAS offline database (ATLR) comes as spikes from Tier0 jobs and have been observed often since end of August.
- There have been several meetings to try to understand the jobs submission internals and to optimize the number of sessions that the database could handle under those spikes.
- Atlas is actively testing the use of Frontier to reduce this load.
- During the night of 11th to 12th October, database was unavailable due to high load. Account was locked to ensure database availability.
- When having so many very short connections with duration less than a second (connect - fetch - disconnect), the database load increased and the throughput degrades and thus the jobs get delayed as well.
- Even if the number of sessions was decreased twice, it seems that the limit of sessions is not kept strictly by Oracle.
- High number of sniped sessions, probably due to job failures, where causing problems with the database listener (resources consumption).
Follow up
- Meeting with Atlas Oct 13th to further analyze the problem.
- As a temporary solution, 2 extra nodes were added to the ATLR database to alleviate the load on the database.
- Atlas deployed a Frontier server (2 nodes) dedicated to the T0 jobs in the first week of October. Currently under tests from the T0 group in comparing the output from the direct Oracle access and the FronTier access. Having a concrete date on switching the jobs to FronTier depends on the outcome of the tests. The aim is to have it as soon as possible.
--
EvaDafonte - 13-Oct-2011