VOMS Post Mortem for Downtime 14th of December 2008

Impact

All VOMS Core Services on voms.cern.ch and lcg-voms.cern.ch were down for around 7 hours from 00:01 on 14th December 2008.

Operators and Notifications

  • The operators correctly received notifications of the failures and tried service restarts with out success.
  • SLS was reporting the downtime correctly.
  • SMS messages were sent to the services managers though not read till after the service recovered.

The service apparently corrected itself at 07:00 am or so.

Post Analysis at VOMS level.

The voms logs clearly showed :
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP. Preparing : SELECT version FROM version"
consequently the database admins were requested to investigate the situation as to what may have happend.

Post Analysis at the DB level

Miguel reported:

The problem was due full usage of temporary tablespace on LCG database. This space is used for doing big sorts and also to temporary store long objects (LOBs) before they are written to the correct data file. The temporary space is shared among all database users, reason why VOMS was affected.

A leak exists either on Oracle or application side, generated by some queries of SAM application. This leak makes the temporary space used by those queries not to be freed at the end of a transaction but only when a connection is finishes. The problem over the weekend was solved when SAM developers restarted their Tomcat server at 6:30am of Sunday (thus restarting all the oracle connections).

As the problematic queries are done using a connection pool, it is not easy to discover the responsible query.

Extra monitoring was set up to try to better understand the problem and SAME developers will restart TomCat server once a week as they were doing before the error happen. They left tomcat for the first time running for 12 days without restart.

Possibly a service request with Oracle will be open once we have evidences the leak is not on the application side.

Followup

Ask DB admins if this will at least be detected on their side in the future as consequence of the extra monitoring. i.e will it influence SLS status, cause a call out via operators , etc, ...

Response

From Miguel:

The extra monitoring is just to better analyse the problem, it does not generate any alarms. SAM people should now be restarting their tomcat weekly, which solves de problem. At same time they are trying to investigate their code for some leak.

-- SteveTraylen - 18 Dec 2008

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-12-18 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback