TWiki
>
PESgroup Web
>
ServiceIncidentReports
>
IncidentLdapNscd24052011
(2014-11-20,
TWikiAdminUser
)
(raw view)
E
dit
A
ttach
P
DF
<!-- * Set ALLOWTOPICVIEW = Main.AllUsersGroup * Set ALLOWTOPICCHANGE = Main.AllUsersGroup --> ---+ XLDAP / nscd / nss_ldap issue %TOC% ---++ Description A standard Quattor node reconfig run on lxbatch caused a very high load on the XLDAP servers, which triggered problems with the =nscd= daemon on other SLC5 nodes. ---++ Impact * Logins to lxplus were not possible for a period of around 2 hours * Checkins to SVN and CVS were affected for a period of around 2 hours * Lxbatch lost several hundred jobs (failing on job startup) - batch queues were set inactive for around 1.5 hours. ---++ Background The =ncm-authconfig= component is responsible for configuring up the password "file" resolution setup on the CERN Quattor managed hosts and for configuring the =nscd= (nameserver caching daemon) which reduces the load on the LDAP servers which provide the password file to Linux hosts. A workaround introduced at NIKHEF/SARA some time ago in the component (which is to flush the =nscd= cache after daemon restart) caused a very high load on the LDAP servers. This in turn caused the =nscd= daemon on un-related hosts to stop responding the =getpwnam= system call. This caused logins to fail and caused random issues for any program that makes use of that system call. The reasons that cause =nscd= to stop responding on completely unrelated hosts in case of high load on the LDAP servers is not understood. This specific trigger (a batch-wide Quattor node reconfig) has been performed once or twice every week for the last year. The version of =ncm-authcomfig= (with the cache flush) has been in place since that time and has caused no problems until now. We are not aware of any other change in the underlying infrastructure, so the reasons why this standard operation caused the issue now are not understood. ---++ Time line of the incident All times CEST, May 24th. * 10.55 - Standard Quattor reconfig of lxbatch applied to pick up reconfig * =nc-client --cl lxbatch --tag spma_ncm= * 11.00 Logins start to fail on lxplus - service manager manually restarts all 'stuck' =nscd= daemons * 11.15 Batch jobs start to fail as the LSF jobs startup code fails in =getpwnam=. * 11.30 Batch queues suspended * 14.00 Load on XLDAP servers susbsides - batch queues re-opened ---++ Follow up Meeting: * =ncm-authconfig= will be changed to avoid flushing its cache immediately after restart: this was applied and had no effect. * Stuck =nscd= daemons should be escalated to Linux support for analysis and report to Redhat (nothing in Redhat's public tracker). * Batch reconfig (from =not.d=) should be spread over a longer time to reduce potential impact. See https://savannah.cern.ch/bugs/index.php?83125 To understand: * Why the =nscd= daemons get stuck and stop responding * What change caused a standard weekly operation to suddenly start causing problems to the Linux password resolution mechanism - this is now understood as the deployment of the glExec worker node software which has a =getent passwd= full LDAP scan. ---++ Links
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r6 - 2014-11-20
-
TWikiAdminUser
Log In
PESgroup
No permission to view
PESgroup.WebLeftBar
Cern Search
TWiki Search
Google Search
PESgroup
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback