[CentOS] Apparent BIND problem doing RBL lookups for Postfix - PartII

Mon May 17 18:10:17 UTC 2010
listserv.traffic at sloop.net <listserv.traffic at sloop.net>

Recap of config (There's a "New" section below that covers new
data...)

---
Current config:

CentOS 5, running BIND 9.3.6

*** (We updated everything to most recent versions when this was
initially posted, mid April, and it made no difference in the
symptoms.)

i386

 Hardware:
 P4, 2.8Ghz, 1G memory
 Sata drives - non mirrored etc.

 Load is light, usually under 0.1

 --
 This box is running Postfix as our mail server. BIND (9.3.6) [Latest.]

 --
 Problem:
 Postfix is doing RBL lookups on zen.spamhaus.org.
 Everything goes along groovy - but then lookups start failing.

  Early in the process, we get stuff like this: [We have a "successful"
 lookup, and then a failure...]
 ---
 Apr 14 14:25:05 mail postfix/smtpd[22281]: NOQUEUE: reject: RCPT
 from bzq-79-183-5-119.red.bezeqint.net[79.183.5.119]: 554 5.7.1
 Service unavailable; Client host [79.183.5.119] blocked using
 zen.spamhaus.org; from=xxx
 to=yyy proto=SMTP
 helo=<bzq-79-183-5-119.red.bezeqint.net>

 Apr 14 14:25:07 mail postfix/smtpd[22804]: warning:
 33.229.242.205.zen.spamhaus.org: RBL lookup error: Host or domain
 name not found. Name service error for
 name=33.229.242.205.zen.spamhaus.org type=A: Host not found, try again
 ---
 As you can see, we had a lookup succeed and then just right after,
 one fail - claiming it got no answer from BIND. I get others after
 this that SUCCEED - so it's not in 100% failure mode yet.

 After time eventually all the zen queries  [or most all] fail.
 [It appears as though after around 4 hours, most all queries to zen are failing.]

 A bind restart fixes the problem. [Hmmm...]
 ---

 First, someone's going to ask - perhaps Zen's blocking you. I don't
 think so. Here's why.
 -We're non-commercial, using the definition set my spamhaus,
 -mail connects TOTAL are well less than 100K a day. (Less than 10K in actuality)
 -and thus having more than 300K queries is pretty unlikely.
 -Also, let me remind you that a restart of the bind service seems to
 make the failures go away for a while, so if zen were blocking our
 queries, I'd think that wouldn't make a difference.

 [Also, from the updates below, we can run an alternate distro as a
 dedicated DNS box, and it queries zen just fine. So, we're NOT being
 rate limited.]

 ---
 I certainly suspect a problem with BIND, but I can't find it, and
 have no idea where to go from here.
 I simply don't know where to look any more. If BIND were having a
 problem, say allocating memory, or something, shouldn't it be in a debug level 5 log?

=====
New information:

Tried running a separate DNS box on Fedora 12 - again with all the
current patches.

We then point the DNS server on the postfix box at our stand-alone
Fedora 12 box.

The exact same symptoms occur on the FC12 box.

---
Next, tried a Ubuntu box also running the latest patches and pointed
the Postix box there. Problem solved - or at least mostly so. [We
still get around a 2% failure rate - timeouts - but it is always
quite low, and stays at a constant level.]

So, as was suggested in this thread it appears to be a RH specific
implementation bug.

I have a WAG that it might be related to UDP fragmentation on DNSSec
packets - but I have no idea if that's realistic or not. [Part of why
I lean this way is that this isn't reported widely as a problem, and
so I'd assume it's a combination of effects bug - perhaps related to
how our firewall passes fragmented UDP replies.]

I obviously have more testing to do, but I welcome any comments...

TIA
-Greg