XLDAP / nscd / nss_ldap issue

Description

A standard Quattor node reconfig run on lxbatch caused a very high load on the XLDAP servers, which triggered problems with the nscd daemon on other SLC5 nodes.

Impact

  • Logins to lxplus were not possible for a period of around 2 hours
  • Checkins to SVN and CVS were affected for a period of around 2 hours
  • Lxbatch lost several hundred jobs (failing on job startup) - batch queues were set inactive for around 1.5 hours.

Background

The ncm-authconfig component is responsible for configuring up the password "file" resolution setup on the CERN Quattor managed hosts and for configuring the nscd (nameserver caching daemon) which reduces the load on the LDAP servers which provide the password file to Linux hosts. A workaround introduced at NIKHEF/SARA some time ago in the component (which is to flush the nscd cache after daemon restart) caused a very high load on the LDAP servers.

This in turn caused the nscd daemon on un-related hosts to stop responding the getpwnam system call. This caused logins to fail and caused random issues for any program that makes use of that system call. The reasons that cause nscd to stop responding on completely unrelated hosts in case of high load on the LDAP servers is not understood.

This specific trigger (a batch-wide Quattor node reconfig) has been performed once or twice every week for the last year. The version of ncm-authcomfig (with the cache flush) has been in place since that time and has caused no problems until now. We are not aware of any other change in the underlying infrastructure, so the reasons why this standard operation caused the issue now are not understood.

Time line of the incident

All times CEST, May 24th.

  • 10.55 - Standard Quattor reconfig of lxbatch applied to pick up reconfig
    • nc-client --cl lxbatch --tag spma_ncm

  • 11.00 Logins start to fail on lxplus - service manager manually restarts all 'stuck' nscd daemons

  • 11.15 Batch jobs start to fail as the LSF jobs startup code fails in getpwnam.

  • 11.30 Batch queues suspended

  • 14.00 Load on XLDAP servers susbsides - batch queues re-opened

Follow up

Meeting:

  • ncm-authconfig will be changed to avoid flushing its cache immediately after restart: this was applied and had no effect.
  • Stuck nscd daemons should be escalated to Linux support for analysis and report to Redhat (nothing in Redhat's public tracker).
  • Batch reconfig (from not.d) should be spread over a longer time to reduce potential impact. See https://savannah.cern.ch/bugs/index.php?83125

To understand:

  • Why the nscd daemons get stuck and stop responding
  • What change caused a standard weekly operation to suddenly start causing problems to the Linux password resolution mechanism - this is now understood as the deployment of the glExec worker node software which has a getent passwd full LDAP scan.

Links

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-11-20 - TWikiAdminUser
No permission to view PESgroup.WebLeftBar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PESgroup All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback