VOMS Post Mortem for Downtime 14th of December 2008
Impact
All
VOMS Core Services on
voms.cern.ch and
lcg-voms.cern.ch were down for around 7 hours
from 00:01 on 14th December 2008.
Operators and Notifications
- The operators correctly received notifications of the failures and tried service restarts with out success.
- SLS was reporting the downtime correctly.
- SMS messages were sent to the services managers though not read till after the service recovered.
The service apparently corrected itself at 07:00 am or so.
Post Analysis at VOMS level.
The voms logs clearly showed :
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP. Preparing : SELECT version FROM version"
consequently the database admins were requested to investigate the situation as to what may have happend.
Post Analysis at the DB level
Miguel reported:
The problem was due full usage of temporary tablespace on LCG database. This space is used for doing big sorts and also to temporary store long objects (LOBs) before
they are written to the correct data file. The temporary space is shared among all database users, reason why
VOMS was affected.
A leak exists either on Oracle or application side, generated by some queries of SAM application. This leak makes the temporary space used by those queries
not to be freed at the end of a transaction but only when a connection is finishes. The problem over the weekend was solved when SAM developers
restarted their Tomcat server at 6:30am of Sunday (thus restarting all the oracle connections).
As the problematic queries are done using a connection pool, it is not easy to discover the responsible query.
Extra monitoring was set up to try to better understand the problem and SAME developers will restart
TomCat server once a week as they were doing before
the error happen. They left tomcat for the first time running for 12 days without restart.
Possibly a service request with Oracle will be open once we have evidences the leak is not on the application side.
Followup
Ask DB admins if this will at least be detected on their side in the future as consequence of the extra monitoring. i.e will it influence
SLS status, cause a call out via operators , etc, ...
Response
From Miguel:
The extra monitoring is just to better analyse the problem, it does not
generate any alarms. SAM people should now be restarting their tomcat
weekly, which solves de problem. At same time they are trying to
investigate their code for some leak.
--
SteveTraylen - 18 Dec 2008