Instead of accessing real ConditionDB information MC simulation jobs at non-T1 sites access a SQLlite file available in the shared area. Very often this access does not succeed and this is
usually evident if in the Gauss logs you see lines like:


GiGa                       INFO Stacking Action Object is not required to be
loaded
COOLConfSvc                INFO Persistency Connection Retrial Period set to 60s
COOLConfSvc                INFO Persistency Connection Retrial Time-Out set to 900s
Persistency/RelationalPlugi...  ERROR SQLiteStatement::prepare 5 database is locked
Persistency/RelationalPlugi...  ERROR SQLiteStatement::fetchNext 21 database is lockedDDDB                      ERROR Problems opening database
DDDB                      ERROR cool::DatabaseDoesNotExist: The database does
not exist
GiGa.TrackSeq.RichG4P...  FATAL  Exception with tag=DetectorDataSvc is caught
GiGa.TrackSeq.RichG4P...  ERROR DetectorDataSvc GaudiException in loadObject()
/dd StatusCode=FAILURE


In the following we suggest a template that one can copy&paste on the GGUS ticket for the site explaining how this problem can take place and basic investiogation and measures one can adopt to tackle it.

One of the most recursive problems with SQLite in NFS finds its best candidate as source of
the problem in the OS locking mechanism.
In general, once lockd server can't complete one RPC,
it waits forever, preventing any other locking to work. All clients
which try lock will stall/timeout/crash depending on configuration and
software versions.

The site admins should try to answer:
 Are the lockd services started correctly? Are there any firewall rules
blocking the lockd requests? Can anybody try to use the "lockfile" command to
see if it hangs?

The simplest way (which I know) to test this is python code:

import fcntl
fp = open("lock-test.txt", "a")
fcntl.lockf(fp.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)

If that code is not working in NFS mounted directory (obviously
it must be rw for that particular test), lock daemon
is stuck (or locking is configured wrong).

Note, that in case there are several load balancing NFS servers, one
can have problem while other not. So the test will fail not on all
nodes.

A work-around can be to have the NFS share mounted with the option "nolock".
That should pretend that the locking is working.

-- RobertoSantinel - 17 Dec 2008

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-06-27 - MartinBessner
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback