Recap of config (There's a "New" section below that covers new data...) --- Current config: CentOS 5, running BIND 9.3.6 *** (We updated everything to most recent versions when this was initially posted, mid April, and it made no difference in the symptoms.) i386 Hardware: P4, 2.8Ghz, 1G memory Sata drives - non mirrored etc. Load is light, usually under 0.1 -- This box is running Postfix as our mail server. BIND (9.3.6) [Latest.] -- Problem: Postfix is doing RBL lookups on zen.spamhaus.org. Everything goes along groovy - but then lookups start failing. Early in the process, we get stuff like this: [We have a "successful" lookup, and then a failure...] --- Apr 14 14:25:05 mail postfix/smtpd[22281]: NOQUEUE: reject: RCPT from bzq-79-183-5-119.red.bezeqint.net[79.183.5.119]: 554 5.7.1 Service unavailable; Client host [79.183.5.119] blocked using zen.spamhaus.org; from=xxx to=yyy proto=SMTP helo=<bzq-79-183-5-119.red.bezeqint.net> Apr 14 14:25:07 mail postfix/smtpd[22804]: warning: 33.229.242.205.zen.spamhaus.org: RBL lookup error: Host or domain name not found. Name service error for name=33.229.242.205.zen.spamhaus.org type=A: Host not found, try again --- As you can see, we had a lookup succeed and then just right after, one fail - claiming it got no answer from BIND. I get others after this that SUCCEED - so it's not in 100% failure mode yet. After time eventually all the zen queries [or most all] fail. [It appears as though after around 4 hours, most all queries to zen are failing.] A bind restart fixes the problem. [Hmmm...] --- First, someone's going to ask - perhaps Zen's blocking you. I don't think so. Here's why. -We're non-commercial, using the definition set my spamhaus, -mail connects TOTAL are well less than 100K a day. (Less than 10K in actuality) -and thus having more than 300K queries is pretty unlikely. -Also, let me remind you that a restart of the bind service seems to make the failures go away for a while, so if zen were blocking our queries, I'd think that wouldn't make a difference. [Also, from the updates below, we can run an alternate distro as a dedicated DNS box, and it queries zen just fine. So, we're NOT being rate limited.] --- I certainly suspect a problem with BIND, but I can't find it, and have no idea where to go from here. I simply don't know where to look any more. If BIND were having a problem, say allocating memory, or something, shouldn't it be in a debug level 5 log? ===== New information: Tried running a separate DNS box on Fedora 12 - again with all the current patches. We then point the DNS server on the postfix box at our stand-alone Fedora 12 box. The exact same symptoms occur on the FC12 box. --- Next, tried a Ubuntu box also running the latest patches and pointed the Postix box there. Problem solved - or at least mostly so. [We still get around a 2% failure rate - timeouts - but it is always quite low, and stays at a constant level.] So, as was suggested in this thread it appears to be a RH specific implementation bug. I have a WAG that it might be related to UDP fragmentation on DNSSec packets - but I have no idea if that's realistic or not. [Part of why I lean this way is that this isn't reported widely as a problem, and so I'd assume it's a combination of effects bug - perhaps related to how our firewall passes fragmented UDP replies.] I obviously have more testing to do, but I welcome any comments... TIA -Greg