Information System Troubleshooting Guide
Which BDII version?
This page is about BDII
v4 used on gLite
3.1. For BDII
v5 used on gLite
3.2 please refer to:
https://twiki.cern.ch/twiki/bin/view/EGEE/InfoTrouble
Troubleshooting Steps
Before attempting to troubleshoot problems with the information system, it is important to have a general
overview of the information system, in particular working knowledge of the BDII and GIP.
Information flows from the
resource level BDII to the
top level BDII via the
site level BDII. For this reason a top down approach for troubleshooting is followed.
If the information in the
top level BDII is correct then there is usually no problem. If the information is correct in the
site level BDII but not correct in the
top level BDII, then the problem is probably with the
top level BDII. Similarly, if the information is correct in the
resource level BDII but not correct in the
site level BDII, then the problem is probably with the
site level BDII.
The BDII obtains information by running the GIP. The GIP should be executed with the same user as the BDII uses. If the GIP returns the information correctly the problem is with the BDII, otherwise the problem is in the GIP.
If the problem is with the BDII, check the BDII log file for error messages. The BDII stores the output from querying the different URLs in the BDIIs temporary directory with a file having the name for this URL. This file can be checked to see if the problem is with query or inserting the result into the BDII.
If the problem is with the GIP, check that all the providers and plugins used by the GIP are running correctly. It is important to run these with the same user as the BDII uses to spot permission problems. The GIP stores the output from these in the GIP cache directory with a file having name of the provider/plugin. This file can be checked to see if the problem is with running the providers/plugins or with the GIP itself.
Common Problems
BDII fails to start
If the BDII fails to start, this could be an underlying problem with the LDAP database. Try to start the
slapd server with the default
slapd.conf file.
/usr/sbin/slapd -f /etc/openldap/slapd.conf -d 255
If this fails, there is a problem with the LDAP installation. Note that this has been experienced when using virtual machines. To solve this problem online forums related to the LDAP and the OS distribution can be useful.
If the LDAP installation has been verified, the
slapd.conf file used by the BDII should be tested.
/usr/sbin/slapd -f /opt/bdii/var/2171/bdii-slapd.conf -d 255
If this fails there is a problem with the BDII slapd.conf file.
Unable to initialize mutex Error
There were reports about the following error:
...
bdb_db_init: Initializing BDB database
bdb(o=grid): unable to initialize mutex: Function not implemented
bdb(o=grid): /opt/bdii/var/2171/__db.001: unable to initialize environment lock: Function not implemented
...
This issue may be fixed using the FAQ provided by Oracle :
http://www.oracle.com/technology/products/berkeley-db/faq/db_faq.html#12
Entry's missing in the BDII
If invalid LDIF is produced, then the entry will be rejected when it is being inserted in to the LDAP database. Rejected entries will be recorded in the BDII log file.
Default values shown instead of dynamic values
The dynamic plugin might have a problem or there is a miss-match with the dn's. Check that the dn's produced by the dynamic plug-in are the same as in the static ldif file. The dynamic plugin should be executed with the same user as the BDII uses to spot permission problems. This should show up any errors. Run the following command to spot any errors
su edguser /opt/glite/libexec/glite-info-wrapper > /dev/null
Run the following commands to ensure that the permissions are correct.
chown -R edguser:edguser /opt/glite/var/lock/gip/
chown -R edguser:edguser /opt/glite/var/tmp/gip/
chown -R edguser:edguser /opt/glite/var/cache/gip/
BDII started but no response from port 2170
Run netstat -l to see if the ports are slapds are running. You should see at least one port from the range 2171 - 2173 listening. These are ports that the LDAP servers are listening on.
tcp 0 0 localhost.localdomain:2171 *:*
LISTEN
tcp 0 0 localhost.localdomain:2172 *:*
LISTEN
Try running an ldapsearch locally on both of these ports. For example, run the following if you see 2171 and 2172 listed.
ldapsearch -x -h localhost -p 2171 -b mds-vo-name=local,o=grid
ldapsearch -x -h localhost -p 2172 -b mds-vo-name=local,o=grid
If those return LDAP information but there is no response from port 2170, the problem night be with the bdii-fwd process. Check that the bdii-fwd process is running (ps aux | grep bdii-fwd).
If this doesn't solve the problem and and the bdii-fwd process
IS running, try
this
The BDII is overloaded with queries
Due to the critical nature of the information system with respect to the operation of the grid, the BDII should be installed as a stand-alone service to ensure that problems with other services do not affect the BDII. In no circumstances should the BDII be co-hosted with a service which has the potential to generate a high load. If there are too many queries to a BDII and the load is too high, multiple instances of the BDII can be deployed high a dns load balanced BDII service behind a "round robin" dns alias.
To evaluate the load of the slapd only it can be run stand alone slapd on port 2170.
/etc/rc.d/init.d/bdii stop
ldapsearch -LLL -x -h lcg-bdii -p 2170 -b o=grid > dump.ldif
slapadd -c -f /opt/bdii/var/2171/bdii-slapd.conf -l dump.ldif
slapd -f /opt/bdii/var/2171/bdii-slapd.conf -h ldap://`hostname -f`:2170 -u edguser
To enable detailed logging for slapd for the incoming queries ...
Add in slapd.conf, before the database part.
loglevel 256
Add in /etc/syslog
local4.* /var/log/slapd.log
restart the syslog syslog daemon.
service syslog restart
restart the slapd
The log file can be parsed by this
script
which will generate a summary
BDB backend dies on memory allocation error (slapd doesn't start)
This issue has been seen on a virtual machine with limited memory.
slapd -f /opt/bdii/var/2171/bdii-slapd.conf -d 25
bdb_db_open: dbenv_open(/opt/bdii/var/2171)
bdb_db_open: dbenv_open(/opt/bdii/var/2171/infosys)
bdb(o=infosys): mmap: Cannot allocate memory
bdb(o=infosys): PANIC: Cannot allocate memory
bdb_db_open: dbenv_open failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30978)
backend_startup: bi_db_open(1) failed! (-30978)
slapd shutdown: initiated
====> bdb_cache_release_all
====> bdb_cache_release_all
bdb(o=infosys): DB_ENV->lock_id_free interface requires an environment configured for the locking subsystem
slapd shutdown: freeing system resources.
bdb(o=infosys): txn_checkpoint interface requires an environment configured for the transaction subsystem
bdb_db_destroy: txn_checkpoint failed: Invalid argument (22)
The solution is to reduce the cache memory allocation specified in
/opt/bdii/etc/DB_CONFIG
which is by default 1GB:
# format : N_GBytes N_Bytes N_segments
# the default :
cache 1 0 1
# 50 MB
cache 0 50000000 1