Performance problems on the Opendays ticketing application infrastructure

Description

From August 15th to August 20th, the Opendays ticketing application infrastructure experienced bad performances which resulted in slow request serving times.

Impact

  • All the users trying to book tickets with the application experienced slow page load times, peaking at 1/2 minutes during the busiest times (ie when the tickets were made available).

Time line of the incident

  • 2013-08-15, afternoon - Opendays application goes live, performance problems are immediately experienced
  • 2013-08-16, lunchtime - Apex Java listener parameters changed to have a larger connection pool
  • 2013-08-16, late afternoon - Front-end technology is changed from APEX Java listener to OHS mod_plsql + extra security rules
  • 2013-08-16, evening - External users experience "403 Forbidden" errors when accessing the application
  • 2013-08-17, morning - Extra security rules fixed to solve the "403 Forbidden" errors
  • 2013-08-19, morning - The application still suffers performance problems
  • 2013-08-19, afternoon - Extra security rules and application code are optimised to speed things up
  • 2013-08-20, morning - The application still suffers performance problems
  • 2013-08-20, morning - The front-end application infrastructure capacity is increased by a factor of ~6.66, solving the performance problems

Analysis

In the end, it was "only" a load problem. However, the debugging was very complicated for two main reasons:

  • The setup of the Opendays application is very special compared to CERN standard APEX applications;
  • Two of the middleware experts were on leave at the time, and the third one had to take sick leave.

Examining only the first reason, the Opendays ticketing application differs from a standard APEX application because it needs to be accessed from external accounts. This is not possible to achieve just for one APEX application, it's rather an "all or nothing" setting. To solve this issue, we deployed (for the first time at CERN) an "APEX Java listener" front-end which allowed us to configure a different SSO filter at the container level, pointing to the "Opendays authentication" rather than to the standard (and used for example on apex-sso) "CERN authentication".

This new setup was initially thought to be the root cause of the problem, so the first major action we took was to move the application to a standard APEX front-end, based on OHS (Oracle HTTP Server, basically a modified Apache 2.2) with mod_plsql. In order to do so we had to put in place some extra security rules to avoid "publishing" all the APEX applications via the Opendays ticketing system URL. This "extra security check" has been implemented (in two steps as the first setup resulted in the above mentioned 403 errors between 2013-08-16 afternoon and 2013-08-17 morning) at the front-end level.

The "extra security check" was then thought to be the root cause of the problem, together with the presence of "slow requests" to non-existing items on the web application. The rules have been then optimised, and the application modified by GS/AIS.

The application still performing badly, we started looking at other causes. It was very interesting because the system side was fine, very little loaded and with a very limited number of Apache processes (a point that was more relevant than was realised, unfortunately). Similarly, the database, where most of the code runs (as it is the case for all the APEX applications) was also lightly loaded. Whilst discussing various theories with one of the middleware experts who was kindly available on the phone over holidays (as he was also during the operations of the previous days), we realised that the OHS was running in the MPM "worker mode", which means that requests are served with threads and not with processes (hence the low Apache processes count noted earlier) and that the webserver was thus running at the maximum configured capacity. We then proceeded to increase the threads capacity by a factor of ~6.66 (1000 instead of the configured 150), and this fixed the problem. For the record, at the moment we are serving about 250 concurrent requests, peaking to 450/500 when new tickets are released.

The reason why the different configuration mode was not spotted earlier is twofold: on one hand we first focussed on performance problems with the non-standard parts of the application, and on the other we don't have a lot of production experience with OHS yet, as at the moment we are primarily using "standard" Apaches but in MPM "prefork mode" (56 installations), with only 4 installations of OHS.

Follow up

  • Check all OHS installations and increase the capacity. Done.
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2013-08-23 - EricGrancher
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback