Information System Troubleshooting Guide

Which BDII version?

This page is about BDII v4 used on gLite 3.1. For BDII v5 used on gLite 3.2 please refer to:

https://twiki.cern.ch/twiki/bin/view/EGEE/InfoTrouble

Troubleshooting Steps

Before attempting to troubleshoot problems with the information system, it is important to have a general overview of the information system, in particular working knowledge of the BDII and GIP.

Information flows from the resource level BDII to the top level BDII via the site level BDII. For this reason a top down approach for troubleshooting is followed.

If the information in the top level BDII is correct then there is usually no problem. If the information is correct in the site level BDII but not correct in the top level BDII, then the problem is probably with the top level BDII. Similarly, if the information is correct in the resource level BDII but not correct in the site level BDII, then the problem is probably with the site level BDII.

The BDII obtains information by running the GIP. The GIP should be executed with the same user as the BDII uses. If the GIP returns the information correctly the problem is with the BDII, otherwise the problem is in the GIP.

If the problem is with the BDII, check the BDII log file for error messages. The BDII stores the output from querying the different URLs in the BDIIs temporary directory with a file having the name for this URL. This file can be checked to see if the problem is with query or inserting the result into the BDII.

If the problem is with the GIP, check that all the providers and plugins used by the GIP are running correctly. It is important to run these with the same user as the BDII uses to spot permission problems. The GIP stores the output from these in the GIP cache directory with a file having name of the provider/plugin. This file can be checked to see if the problem is with running the providers/plugins or with the GIP itself.

Common Problems

BDII fails to start

If the BDII fails to start, this could be an underlying problem with the LDAP database. Try to start the slapd server with the default slapd.conf file.

/usr/sbin/slapd -f /etc/openldap/slapd.conf -d 255
If this fails, there is a problem with the LDAP installation. Note that this has been experienced when using virtual machines. To solve this problem online forums related to the LDAP and the OS distribution can be useful.

If the LDAP installation has been verified, the slapd.conf file used by the BDII should be tested.

/usr/sbin/slapd -f  /opt/bdii/var/2171/bdii-slapd.conf -d 255
If this fails there is a problem with the BDII slapd.conf file.

Unable to initialize mutex Error

There were reports about the following error:
...
bdb_db_init: Initializing BDB database
bdb(o=grid): unable to initialize mutex: Function not implemented
bdb(o=grid): /opt/bdii/var/2171/__db.001: unable to initialize environment lock: Function not implemented
...
This issue may be fixed using the FAQ provided by Oracle : http://www.oracle.com/technology/products/berkeley-db/faq/db_faq.html#12

Entry's missing in the BDII

If invalid LDIF is produced, then the entry will be rejected when it is being inserted in to the LDAP database. Rejected entries will be recorded in the BDII log file.

Default values shown instead of dynamic values

The dynamic plugin might have a problem or there is a miss-match with the dn's. Check that the dn's produced by the dynamic plug-in are the same as in the static ldif file. The dynamic plugin should be executed with the same user as the BDII uses to spot permission problems. This should show up any errors. Run the following command to spot any errors
su edguser /opt/glite/libexec/glite-info-wrapper  > /dev/null

Run the following commands to ensure that the permissions are correct.

chown -R edguser:edguser /opt/glite/var/lock/gip/
chown -R edguser:edguser /opt/glite/var/tmp/gip/
chown -R edguser:edguser /opt/glite/var/cache/gip/

BDII started but no response from port 2170

Run netstat -l to see if the ports are slapds are running. You should see at least one port from the range 2171 - 2173 listening. These are ports that the LDAP servers are listening on.

tcp        0      0 localhost.localdomain:2171  *:*
LISTEN
tcp        0      0 localhost.localdomain:2172  *:*
LISTEN
Try running an ldapsearch locally on both of these ports. For example, run the following if you see 2171 and 2172 listed.
ldapsearch -x -h localhost -p 2171 -b mds-vo-name=local,o=grid
ldapsearch -x -h localhost -p 2172 -b mds-vo-name=local,o=grid

If those return LDAP information but there is no response from port 2170, the problem night be with the bdii-fwd process. Check that the bdii-fwd process is running (ps aux | grep bdii-fwd).

If this doesn't solve the problem and and the bdii-fwd process IS running, try this

The BDII is overloaded with queries

Due to the critical nature of the information system with respect to the operation of the grid, the BDII should be installed as a stand-alone service to ensure that problems with other services do not affect the BDII. In no circumstances should the BDII be co-hosted with a service which has the potential to generate a high load. If there are too many queries to a BDII and the load is too high, multiple instances of the BDII can be deployed high a dns load balanced BDII service behind a "round robin" dns alias.

To evaluate the load of the slapd only it can be run stand alone slapd on port 2170.

/etc/rc.d/init.d/bdii stop
ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=grid > dump.ldif
slapadd -c -f   /opt/bdii/var/2171/bdii-slapd.conf -l dump.ldif
slapd -f /opt/bdii/var/2171/bdii-slapd.conf -h ldap://`hostname -f`:2170 -u edguser 

To enable detailed logging for slapd for the incoming queries ...

Add in slapd.conf, before the database part.
  loglevel 256

Add in  /etc/syslog
 local4.*                /var/log/slapd.log

restart the syslog syslog daemon.
   service syslog restart

restart the slapd

The log file can be parsed by this script which will generate a summary

BDB backend dies on memory allocation error (slapd doesn't start)

This issue has been seen on a virtual machine with limited memory.

slapd -f /opt/bdii/var/2171/bdii-slapd.conf -d 25

bdb_db_open: dbenv_open(/opt/bdii/var/2171)
bdb_db_open: dbenv_open(/opt/bdii/var/2171/infosys)
bdb(o=infosys): mmap: Cannot allocate memory
bdb(o=infosys): PANIC: Cannot allocate memory
bdb_db_open: dbenv_open failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30978)
backend_startup: bi_db_open(1) failed! (-30978)
slapd shutdown: initiated
====> bdb_cache_release_all
====> bdb_cache_release_all
bdb(o=infosys): DB_ENV->lock_id_free interface requires an environment configured for the locking subsystem
slapd shutdown: freeing system resources.
bdb(o=infosys): txn_checkpoint interface requires an environment configured for the transaction subsystem
bdb_db_destroy: txn_checkpoint failed: Invalid argument (22) 
The solution is to reduce the cache memory allocation specified in /opt/bdii/etc/DB_CONFIG which is by default 1GB:
# format : N_GBytes N_Bytes N_segments

# the default : 
cache 1 0 1

# 50 MB
cache 0 50000000 1
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2009-12-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback