[CentOS] DJBDNS: very weird dnscache issue

Tue Jan 13 15:53:28 UTC 2015
Boris Epstein <borepstein at gmail.com>

Hello all,

We have put a DNS server online running  DJBDNS v1.06
(ndjbdns-1.06-1.el6.x86_64) on a 64-bit CentOS 6.6 server. We have done
some limited testing on the machine which it passed - i.e., dnscache was
talking to tinydns, the queries went through fine, etc.

As soon as we put it online subjecting it to live load the following
happened:

1) Within a short time period (about a minute) the dnscache process reached
the CPU utilisation level of 100%.

2) The process would then die reporting the following message to the log:

dnscache: BUG: out of in progress slots

NOTE: Random sampling indicates that at no point sampled did the load
exceed 200 requests per second. In tests conducted earlier the DNS server
successfully demonstrated speeds in tens of thousands of requests per
second.

We then proceeded to edit the following parameters in the dnscache.conf as
they seemed to be the only ones that seemed relevant: DATALIMIT and
CACHESIZE. They are described as limints (in bytes) on the total data
memory allocation and cache, default values are 80000000 and 50000000
respectively.

Playing with these demonstrated some highly counterintuitive results:

1) Setting the values lower (say, an order of magnitude lower) made the
dnscache process run longer.

2) Shortening the relative gap between the two values (for instance,
setting DATALIMIT at 52000 and CACHE at 50000) made it run for about an
hour vs about 1 minute, load seeming to be about the same.

3) Running it with DATALIMIT not set was possible though it eventually
failed anyways.

4) Running it with CACHESIZE not set was not possible at all.

So the issue is currently still not resolved and we are stuck.

Any advice will be much appreciated.

Cheers,

Boris.