XLDAP / nscd / nss_ldap issue
Description
A standard Quattor node reconfig run on lxbatch caused a very high load on the XLDAP servers, which triggered problems with the
nscd
daemon on other SLC5 nodes.
Impact
- Logins to lxplus were not possible for a period of around 2 hours
- Checkins to SVN and CVS were affected for a period of around 2 hours
- Lxbatch blackholed lots (tbd) jobs - batch queues were set inactive for around 1.5 hours.
Background
...
Time line of the incident
All times CEST, May 24th.
10.55 - Standard Quattor reconfig of lxbatch applied to pick up reconfig
nc-client --cl lxbatch --tag spma_ncm
Analysis
Follow up
Meeting:
-
ncm-authconfig
will be changed to avoid flushing its cache immediately after restart
- Stuck 'nscd' daemons should be escalated to Linux support for analysis and report to Redhat (nothing in Redhat's public tracker).
- Batch reconfig (from not.d) should bespread over a longer time to reduce potential impact
Links