[CentOS] nscd segfaulting on centos 4.5

Wed Oct 10 16:52:08 UTC 2007
Craig White <craig at tobyhouse.com>

On Wed, 2007-10-10 at 10:19 -0500, jlee wrote:
> 
> Craig White wrote:
> > On Wed, 2007-10-10 at 08:16 -0400, Andy Harrison wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >>
> >>
> >> On 10/9/07, jlee  wrote:
> >>> output from /var/log/messages
> >>> Oct  9 12:56:38 lyra kernel: nscd[11660]: segfault at 0000002b401fee8b rip 000000552aab7966 rsp 00000000408029e0 error 4
> >>> Oct  9 13:16:38 lyra kernel: nscd[12540]: segfault at 0000002b401fee8b rip 000000552aab7966 rsp 00000000408029e0 error 4
> >>
> >> I'm starting to have this problem as well.  I have two mail servers
> >> running courier and postfix.  They've been up for a couple weeks but I
> >> just put them into production monday this week, two days ago.
> >>
> >> Oct  9 07:34:49 ash kernel: nscd[3455]: segfault at 0000000040201000
> >> rip 0000555555563274 rsp 00000000401a1df0 error 6
> >> Oct  9 07:35:20 ash nscd: 27206 invalid persistent database file
> >> "/var/db/nscd/passwd": verification failed
> >>
> >>
> >> Oct 10 07:33:37 oak kernel: nscd[25051]: segfault at 0000000040201000
> >> rip 0000555555563274 rsp 00000000401a73a0 error 6
> >> Oct 10 07:33:48 oak nscd: 29526 invalid persistent database file
> >> "/var/db/nscd/passwd": verification failed
> >>
> >> The first time it had happened, I was using the stock /etc/nscd.conf
> >> file.  The second time it happened on the other server, I had doubled
> >> the max-db-size passwd value to 67108864.
> >>
> >> Both servers are running CentOS 5, firewall disabled and no SELinux .
> >>
> >> Linux ash 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64
> >> x86_64 x86_64 GNU/Linux
> >>
> >> (24)[11:58am] # yum list nscd
> >> nscd.x86_64                              2.5-12                 installed
> >>
> >>
> >>
> >> # ls -la /etc/ldap*
> >> lrwxrwxrwx 1 root root   18 Sep 27 15:14 /etc/ldap.conf -> openldap/ldap.conf
> >> lrwxrwxrwx 1 root root   20 Sep 27 15:14 /etc/ldap.secret ->
> >> openldap/ldap.secret
> >> # ls -la /etc/openldap/ldap.*
> >> - -rw-r--r-- 1 root root 8974 Sep 27 13:55 /etc/openldap/ldap.conf
> >> - -rw------- 1 root root   10 Sep 27 13:56 /etc/openldap/ldap.secret
> >>
> >>
> >> My ldap.conf
> >> # grep '^[^#]' /etc/ldap.conf
> >> base dc=xxxxxxx,dc=xxx
> >> uri ldap://ldap-1.xxxxxxx.xxx
> >> binddn cn=foo,ou=bar,dc=xxxxxxx,dc=xxx
> >> bindpw xxxxxxxx
> >> rootbinddn cn=foo,ou=bar,dc=xxxxxxx,dc=xxx
> >> scope sub
> >> timelimit 30
> >> bind_timelimit 30
> >> bind_policy soft
> >> idle_timelimit 3600
> >> pam_check_host_attr yes
> >> nss_base_passwd dc=xxxxxxx,dc=net?sub
> >> nss_base_shadow dc=xxxxxxx,dc=net?sub
> >> pam_password clear
> >> nss_base_group          ou=Group,dc=xxxxxxx,dc=xxx?one
> >> TLS_REQCERT request
> >> TLS_CACERT /usr/local/etc/openldap/certs/cacert.pem
> >>
> >> The two previous servers did not have this particular problem.  They
> >> were not identical hardware, but identical os install and config,
> >>
> >> Any clues?
> > ---
> > I don't generally use nscd any longer but since it is a dynamic system,
> > why not just stop nscd and delete the db and then restart nscd service
> > since it is certain to recreate it? (or perhaps move it out of the way
> > to be safe)...
> > 
> > /sbin/service nscd stop
> > mv /var/db/nscd/* /tmp
> > /sbin/service nscd start
> > 

> 
> I tried deleting the db files on one of th boxes after seeing this on the web, but nscd
> segfaulted less than half an hour later. This problem seems to happen only
> with x86_64 boxes. Another box here is x86_32 and has no issues with nscd.
> 
> I would like to drop this service but there are critical apps that require
> it since authentication comes through openldap. It does not seem to be hardware specific
> since the two x86_64 boxes have different mobo, one abit and one asus.
> 
> The logger is turned on for nscd but nothing looks unusual in them, and it has been
> difficult finding which pid precedes the segfault.
> 
> Can malformed addresses cause nscd to segfault?
----
I don't know the answer to that but it would seem that if that were the
case, the problem would exist with i386 version.

I suppose you will have to attach an strace to the pid and then create a
bugzilla entry with attached strace - probably on the upstream provider.

As for 'critical apps that require' nscd...I don't personally know of
any and if we are talking about CentOS-5 which has 2.3.27 version of
openldap...the 2.3.x versions are very fast and I'm not certain that
nscd is of all that much benefit (but I don't know because I have never
tested it out).

-- 
Craig White <craig at tobyhouse.com>