XLDAP / nscd / nss_ldap issue
Description
A standard Quattor node reconfig run on lxbatch caused a very high load on the XLDAP servers, which triggered problems with the
nscd
daemon on other SLC5 nodes.
Impact
- Logins to lxplus were not possible for a period of around 2 hours
- Checkins to SVN and CVS were affected for a period of around 2 hours
- Lxbatch lost several hundred jobs (failing on job startup) - batch queues were set inactive for around 1.5 hours.
Background
The
ncm-authconfig
component is responsible for configuring up the password "file" resolution setup on the CERN Quattor managed hosts and for configuring the
nscd
(nameserver caching daemon) which reduces the load on the LDAP servers which provide the password file to Linux hosts. A workaround introduced at NIKHEF/SARA some time ago in the component (which is to flush the
nscd
cache after daemon restart) caused a very high load on the LDAP servers.
This in turn caused the
nscd
daemon on un-related hosts to stop responding the
getpwnam
system call. This caused logins to fail and caused random issues for any program that makes use of that system call. The reasons that cause
nscd
to stop responding on completely unrelated hosts in case of high load on the LDAP servers is not understood.
This specific trigger (a batch-wide Quattor node reconfig) has been performed once or twice every week for the last year. The version of
ncm-authcomfig
(with the cache flush) has been in place since that time and has caused no problems until now. We are not aware of any other change in the underlying infrastructure, so the reasons why this standard operation caused the issue now are not understood.
Time line of the incident
All times CEST, May 24th.
- 10.55 - Standard Quattor reconfig of lxbatch applied to pick up reconfig
-
nc-client --cl lxbatch --tag spma_ncm
- 11.00 Logins start to fail on lxplus - service manager manually restarts all 'stuck'
nscd
daemons
- 11.15 Batch jobs start to fail as the LSF jobs startup code fails in
getpwnam
.
- 11.30 Batch queues suspended
- 14.00 Load on XLDAP servers susbsides - batch queues re-opened
Follow up
Meeting:
-
ncm-authconfig
will be changed to avoid flushing its cache immediately after restart: this was applied and had no effect.
- Stuck
nscd
daemons should be escalated to Linux support for analysis and report to Redhat (nothing in Redhat's public tracker).
- Batch reconfig (from
not.d
) should be spread over a longer time to reduce potential impact. See https://savannah.cern.ch/bugs/index.php?83125
To understand:
- Why the
nscd
daemons get stuck and stop responding
- What change caused a standard weekly operation to suddenly start causing problems to the Linux password resolution mechanism - this is now understood as the deployment of the glExec worker node software which has a
getent passwd
full LDAP scan.
Links