[CentOS] CentOS based router dropping connections

Tue Jul 24 21:27:15 UTC 2007
Jesse Cantara <jesse_cantara at esupport.com>

To reply to myself, I'm pulling my hair out about this one, here's some 
more information:

I've simplified the problem into just simply wanting to download files 
from the server at the hosting facility. No iptables, no port 
forwarding, just download a file through apache directly from the 
server. I was still getting errors even trying to do that from the Dell 
860 server, which (among all the other things I tested and read about) 
made me think it was that server (well, the driver on the server).

So yesterday, I built up a simple/cheap replacement server to stand in 
while I fix this one, went to the hosting facility, pulled out the 
"problem" server, and brought it back to the office. Everything seemed 
to work fine with the replacement server, confirming my suspicions that 
it was the TG3 driver... but only for a couple of hours. Now I'm right 
back to square-one, dropping connections! The replacement server is 
having the exact same problems! Arg!

The problem only seems to exhibit itself when the server is "busy" 
(which is most of the time, so it's hard to diagnose). Right after I'd 
replaced the "problem" server, the site stayed non-busy for a few hours, 
and everything seemed to work just fine. Just FYI, it's a 10 Mbit drop 
from the hosting facility, and during the daytime we're at around 100% 
use from about 10AM to 8PM.

So basically, what I can figure from all of the evidence at this point 
is the problem is either:
default configuration of the network in CentOS isn't proper for what I'm 
doing (can't handle the traffic or number of connections). I get a 
decent amount of traffic, maxing out a 10 Mbit connection all day long. 
I don't know exactly where to check to diagnose if this is the case 
though. Can anybody point me where to find things like the system usage 
of the network (memory, any buffers, # of connections, etc)? the things 
I know to check look normal, but that's basically just ifconfig, and 
your standard /var/log/message and dmesg log files.
the network drop from the hosting facility is "bad" somehow, either the 
cable physically, or the way in which they are limiting me to 10 Mbit.

Any ideas?

Thanks for all your help, and any help in advance,

Jesse Cantara wrote:
> Actually, I spoke too soon.
> Setting the NIC to 100 Mbit did not fix the issue, I just happened to 
> misdiagnose a fix, because it seemed to be working for quite some time, 
> but it is back to the old problems.
> Basically, I'm at wits end right now. I'm going to go down to the 
> colocation and see if they can test the network drop into our cabinet. 
> If it's not that, then I'm convinced it's the tg3 driver.
> -Jesse
> Jesse Cantara wrote:
>> The problem ended up being the "tg3" Broadcom NIC kernel module 
>> driver. It doesn't work properly at Gigabit speeds. Turning it down to 
>> 100 Megabit fixed the issue. Does anybody know where I should report 
>> this bug?
>> Thanks for all your help,
>> -Jesse
>> William L. Maltby wrote:
>>> On Fri, 2007-07-20 at 12:29 -0400, Jesse Cantara wrote:
>>>> Hi Bob,
>>>> <snip>
>>>> The issue I'm having is that external traffic is being forwarded 
>>>> properly, BUT that it drops the connection occasionally. It's not 
>>>> consistent (maybe 2 out of 5 downloads from the internet through the 
>>>> router to the webserver will drop), and the connections are being 
>>>> made, so it's not a fundamental configuration issue. It's something 
>>>> more sneaky. I'm thinking that there's something in the kernel or 
>>>> network driver that isn't functioning properly, or maybe a buffer 
>>>> that is becoming full and abandoning the connection?
>>>> <snip>
>>>> -Jesse
>>>> Bob Chiodini wrote:
>>>>> Jesse Cantara wrote:
>>>>>> Hello,
>>>>>> I am trying to figure out a problem I'm having using CentOS on a 
>>>>>> machine as a router. The short story is: any traffic routed 
>>>>>> through the router seems to get disconnected at random occasionally.
>>>>>> <snip>
>>> Someone recently posted a thread about a similar complaint to the lists
>>> recently. IIRC, the [SOLVED] post mentioned a problem with MTU being
>>> smaller than some of the packets received at one point, causing
>>> fragmentation, and the next step not being to reassemble the packet
>>> because of a certain flag being set.
>>> I don't remember which bit the flag was and no little about this, but I
>>> remember the general gist.
>>> Maybe your problem is similar?
>>> HTH
>>> -- 
>>> Bill
>>> _______________________________________________
>>> CentOS mailing list
>>> CentOS at centos.org
>>> http://lists.centos.org/mailman/listinfo/centos
>> _______________________________________________
>> CentOS mailing list
>> CentOS at centos.org
>> http://lists.centos.org/mailman/listinfo/centos
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos