Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
The hardware setup is: I have two switches, the router sits between them, the webserver on the LAN switch. The machine I'm using for the router is a Dell 860 1U rackmount with two NICs, one NIC on the internet, one NIC on the LAN.
The routing setup is: I'm using IPTABLES for routing, with the following command: iptables -t nat -A PREROUTING -p tcp -m tcp -i eth1 --dport 6680 -j DNAT --to 192.168.1.10:80 Basically, I'm forwarding port 6680 on to the webserver (.10) on the LAN.
What I have tested so far: If I'm at the router, I can download files from the webserver just fine, so the webserver setup and physical connection is OK. If I'm at the router, I can download files from the internet just fine, so the physical connection to the outside is OK as well. If I'm on the outside of the router (on the internet) I can download files directly from the router just fine.
The issue is when I try to download a file from the webserver via the router (port 6680). It will work sometimes, but other times it will randomly disconnect me, at random points during the download.
Watching the traffic on a packet-sniffer shows that right before the download fails, my client computer trying to download the file keeps resending "ACK" messages, the router keeps sending the next sequence of packets, and eventually the router sends a bunch of "RST" packets.
There aren't any strange messages in /var/log/messages or dmesg in either the router or the webserver
I need some help diagnosing this problem. Here's some info about the router: CentOS 5 latest kernel 2.6.18-8.1.8.el5 iptables v1.3.5
I've tried testing as much as I can before asking for help, but I'm at the end of what I know to try. Any leads as to where to look to diagnose, or what might cause this would help.
Thanks in advance, -Jesse
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
The hardware setup is: I have two switches, the router sits between them, the webserver on the LAN switch. The machine I'm using for the router is a Dell 860 1U rackmount with two NICs, one NIC on the internet, one NIC on the LAN.
The routing setup is: I'm using IPTABLES for routing, with the following command: iptables -t nat -A PREROUTING -p tcp -m tcp -i eth1 --dport 6680 -j DNAT --to 192.168.1.10:80 Basically, I'm forwarding port 6680 on to the webserver (.10) on the LAN.
What I have tested so far: If I'm at the router, I can download files from the webserver just fine, so the webserver setup and physical connection is OK. If I'm at the router, I can download files from the internet just fine, so the physical connection to the outside is OK as well. If I'm on the outside of the router (on the internet) I can download files directly from the router just fine.
The issue is when I try to download a file from the webserver via the router (port 6680). It will work sometimes, but other times it will randomly disconnect me, at random points during the download.
Watching the traffic on a packet-sniffer shows that right before the download fails, my client computer trying to download the file keeps resending "ACK" messages, the router keeps sending the next sequence of packets, and eventually the router sends a bunch of "RST" packets.
There aren't any strange messages in /var/log/messages or dmesg in either the router or the webserver
I need some help diagnosing this problem. Here's some info about the router: CentOS 5 latest kernel 2.6.18-8.1.8.el5 iptables v1.3.5
I've tried testing as much as I can before asking for help, but I'm at the end of what I know to try. Any leads as to where to look to diagnose, or what might cause this would help.
Thanks in advance, -Jesse
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Jesse,
What IP address are you using when you try to access the webserver (via port 6680) from the router, the public or the private?
If I read the iptables man page correctly, I would not expect the router to mangle the packets generated locally for the PREROUTING table since the packets are not "really" arriving at the eth1 interface. Maybe the problem is that some packets are getting through at all. What happens if you try to access the webserver from a machine on the LAN, but using the public IP address and port 6680?
Why not use port 80 and the private IP when accessing the webserver from the router, and anywhere else in the LAN, and address the webserver via 6680 when coming in from the internet. If I read your test scenarios correctly, both of those conditions work correctly and I assume that is your intent.
Bob...
Hi Bob,
When I was on the router testing from there, the IP I was using was the private IP.
That's not a big concern of mine though, I'm aware that locally-generated traffic won't be "forwarded" correctly.
The issue I'm having is that external traffic is being forwarded properly, BUT that it drops the connection occasionally. It's not consistent (maybe 2 out of 5 downloads from the internet through the router to the webserver will drop), and the connections are being made, so it's not a fundamental configuration issue. It's something more sneaky. I'm thinking that there's something in the kernel or network driver that isn't functioning properly, or maybe a buffer that is becoming full and abandoning the connection?
The part I added about connecting to the webserver from the router was just to prove that I had tested that the connection at least physically works like that, when taking the router out of the equation.
-Jesse
Bob Chiodini wrote:
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
The hardware setup is: I have two switches, the router sits between them, the webserver on the LAN switch. The machine I'm using for the router is a Dell 860 1U rackmount with two NICs, one NIC on the internet, one NIC on the LAN.
The routing setup is: I'm using IPTABLES for routing, with the following command: iptables -t nat -A PREROUTING -p tcp -m tcp -i eth1 --dport 6680 -j DNAT --to 192.168.1.10:80 Basically, I'm forwarding port 6680 on to the webserver (.10) on the LAN.
What I have tested so far: If I'm at the router, I can download files from the webserver just fine, so the webserver setup and physical connection is OK. If I'm at the router, I can download files from the internet just fine, so the physical connection to the outside is OK as well. If I'm on the outside of the router (on the internet) I can download files directly from the router just fine.
The issue is when I try to download a file from the webserver via the router (port 6680). It will work sometimes, but other times it will randomly disconnect me, at random points during the download.
Watching the traffic on a packet-sniffer shows that right before the download fails, my client computer trying to download the file keeps resending "ACK" messages, the router keeps sending the next sequence of packets, and eventually the router sends a bunch of "RST" packets.
There aren't any strange messages in /var/log/messages or dmesg in either the router or the webserver
I need some help diagnosing this problem. Here's some info about the router: CentOS 5 latest kernel 2.6.18-8.1.8.el5 iptables v1.3.5
I've tried testing as much as I can before asking for help, but I'm at the end of what I know to try. Any leads as to where to look to diagnose, or what might cause this would help.
Thanks in advance, -Jesse
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Jesse,
What IP address are you using when you try to access the webserver (via port 6680) from the router, the public or the private?
If I read the iptables man page correctly, I would not expect the router to mangle the packets generated locally for the PREROUTING table since the packets are not "really" arriving at the eth1 interface. Maybe the problem is that some packets are getting through at all. What happens if you try to access the webserver from a machine on the LAN, but using the public IP address and port 6680?
Why not use port 80 and the private IP when accessing the webserver from the router, and anywhere else in the LAN, and address the webserver via 6680 when coming in from the internet. If I read your test scenarios correctly, both of those conditions work correctly and I assume that is your intent.
Bob... _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Fri, 2007-07-20 at 12:29 -0400, Jesse Cantara wrote:
Hi Bob,
<snip>
The issue I'm having is that external traffic is being forwarded properly, BUT that it drops the connection occasionally. It's not consistent (maybe 2 out of 5 downloads from the internet through the router to the webserver will drop), and the connections are being made, so it's not a fundamental configuration issue. It's something more sneaky. I'm thinking that there's something in the kernel or network driver that isn't functioning properly, or maybe a buffer that is becoming full and abandoning the connection?
<snip>
-Jesse
Bob Chiodini wrote:
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
<snip>
Someone recently posted a thread about a similar complaint to the lists recently. IIRC, the [SOLVED] post mentioned a problem with MTU being smaller than some of the packets received at one point, causing fragmentation, and the next step not being to reassemble the packet because of a certain flag being set.
I don't remember which bit the flag was and no little about this, but I remember the general gist.
Maybe your problem is similar?
HTH -- Bill
The problem ended up being the "tg3" Broadcom NIC kernel module driver. It doesn't work properly at Gigabit speeds. Turning it down to 100 Megabit fixed the issue. Does anybody know where I should report this bug?
Thanks for all your help, -Jesse
William L. Maltby wrote:
On Fri, 2007-07-20 at 12:29 -0400, Jesse Cantara wrote:
Hi Bob,
<snip>
The issue I'm having is that external traffic is being forwarded properly, BUT that it drops the connection occasionally. It's not consistent (maybe 2 out of 5 downloads from the internet through the router to the webserver will drop), and the connections are being made, so it's not a fundamental configuration issue. It's something more sneaky. I'm thinking that there's something in the kernel or network driver that isn't functioning properly, or maybe a buffer that is becoming full and abandoning the connection?
<snip>
-Jesse
Bob Chiodini wrote:
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
<snip>
Someone recently posted a thread about a similar complaint to the lists recently. IIRC, the [SOLVED] post mentioned a problem with MTU being smaller than some of the packets received at one point, causing fragmentation, and the next step not being to reassemble the packet because of a certain flag being set.
I don't remember which bit the flag was and no little about this, but I remember the general gist.
Maybe your problem is similar?
HTH
Bill
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Actually, I spoke too soon.
Setting the NIC to 100 Mbit did not fix the issue, I just happened to misdiagnose a fix, because it seemed to be working for quite some time, but it is back to the old problems.
Basically, I'm at wits end right now. I'm going to go down to the colocation and see if they can test the network drop into our cabinet. If it's not that, then I'm convinced it's the tg3 driver.
-Jesse
Jesse Cantara wrote:
The problem ended up being the "tg3" Broadcom NIC kernel module driver. It doesn't work properly at Gigabit speeds. Turning it down to 100 Megabit fixed the issue. Does anybody know where I should report this bug?
Thanks for all your help, -Jesse
William L. Maltby wrote:
On Fri, 2007-07-20 at 12:29 -0400, Jesse Cantara wrote:
Hi Bob,
<snip>
The issue I'm having is that external traffic is being forwarded properly, BUT that it drops the connection occasionally. It's not consistent (maybe 2 out of 5 downloads from the internet through the router to the webserver will drop), and the connections are being made, so it's not a fundamental configuration issue. It's something more sneaky. I'm thinking that there's something in the kernel or network driver that isn't functioning properly, or maybe a buffer that is becoming full and abandoning the connection?
<snip>
-Jesse
Bob Chiodini wrote:
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
<snip>
Someone recently posted a thread about a similar complaint to the lists recently. IIRC, the [SOLVED] post mentioned a problem with MTU being smaller than some of the packets received at one point, causing fragmentation, and the next step not being to reassemble the packet because of a certain flag being set.
I don't remember which bit the flag was and no little about this, but I remember the general gist.
Maybe your problem is similar?
HTH
Bill
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Jesse Cantara spake the following on 7/23/2007 11:39 AM:
Actually, I spoke too soon.
Setting the NIC to 100 Mbit did not fix the issue, I just happened to misdiagnose a fix, because it seemed to be working for quite some time, but it is back to the old problems.
Basically, I'm at wits end right now. I'm going to go down to the colocation and see if they can test the network drop into our cabinet. If it's not that, then I'm convinced it's the tg3 driver.
-Jesse
I have seen some problems with the tg3 driver and some newer hardware. Does the manufacturer of the nic/mb have a linux driver available? It might fix the issue.
Did anyone suggest replacing the possibly bad NIC (or shut it down) with a known linux supported and properly functioning NIC unit?
- rh
Jesse Cantara wrote:
Actually, I spoke too soon.
Setting the NIC to 100 Mbit did not fix the issue, I just happened to misdiagnose a fix, because it seemed to be working for quite some time, but it is back to the old problems.
Basically, I'm at wits end right now. I'm going to go down to the colocation and see if they can test the network drop into our cabinet. If it's not that, then I'm convinced it's the tg3 driver.
-Jesse
Jesse Cantara wrote:
The problem ended up being the "tg3" Broadcom NIC kernel module driver. It doesn't work properly at Gigabit speeds. Turning it down to 100 Megabit fixed the issue. Does anybody know where I should report this bug?
Thanks for all your help, -Jesse
William L. Maltby wrote:
On Fri, 2007-07-20 at 12:29 -0400, Jesse Cantara wrote:
Hi Bob,
<snip>
The issue I'm having is that external traffic is being forwarded properly, BUT that it drops the connection occasionally. It's not consistent (maybe 2 out of 5 downloads from the internet through the router to the webserver will drop), and the connections are being made, so it's not a fundamental configuration issue. It's something more sneaky. I'm thinking that there's something in the kernel or network driver that isn't functioning properly, or maybe a buffer that is becoming full and abandoning the connection?
<snip>
-Jesse
Bob Chiodini wrote:
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
<snip>
Someone recently posted a thread about a similar complaint to the lists recently. IIRC, the [SOLVED] post mentioned a problem with MTU being smaller than some of the packets received at one point, causing fragmentation, and the next step not being to reassemble the packet because of a certain flag being set.
I don't remember which bit the flag was and no little about this, but I remember the general gist.
Maybe your problem is similar?
HTH
Bill
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi Jesse,
FWIW I have an IBM346 server at a client running RHEL4 using the Broadcom NICS and the tg3 driver and have not experienced any dropped packages over the past 18 months.
ChrisG
To reply to myself, I'm pulling my hair out about this one, here's some more information:
I've simplified the problem into just simply wanting to download files from the server at the hosting facility. No iptables, no port forwarding, just download a file through apache directly from the server. I was still getting errors even trying to do that from the Dell 860 server, which (among all the other things I tested and read about) made me think it was that server (well, the driver on the server).
So yesterday, I built up a simple/cheap replacement server to stand in while I fix this one, went to the hosting facility, pulled out the "problem" server, and brought it back to the office. Everything seemed to work fine with the replacement server, confirming my suspicions that it was the TG3 driver... but only for a couple of hours. Now I'm right back to square-one, dropping connections! The replacement server is having the exact same problems! Arg!
The problem only seems to exhibit itself when the server is "busy" (which is most of the time, so it's hard to diagnose). Right after I'd replaced the "problem" server, the site stayed non-busy for a few hours, and everything seemed to work just fine. Just FYI, it's a 10 Mbit drop from the hosting facility, and during the daytime we're at around 100% use from about 10AM to 8PM.
So basically, what I can figure from all of the evidence at this point is the problem is either: default configuration of the network in CentOS isn't proper for what I'm doing (can't handle the traffic or number of connections). I get a decent amount of traffic, maxing out a 10 Mbit connection all day long. I don't know exactly where to check to diagnose if this is the case though. Can anybody point me where to find things like the system usage of the network (memory, any buffers, # of connections, etc)? the things I know to check look normal, but that's basically just ifconfig, and your standard /var/log/message and dmesg log files. or: the network drop from the hosting facility is "bad" somehow, either the cable physically, or the way in which they are limiting me to 10 Mbit.
Any ideas?
Thanks for all your help, and any help in advance, -Jesse
Jesse Cantara wrote:
Actually, I spoke too soon.
Setting the NIC to 100 Mbit did not fix the issue, I just happened to misdiagnose a fix, because it seemed to be working for quite some time, but it is back to the old problems.
Basically, I'm at wits end right now. I'm going to go down to the colocation and see if they can test the network drop into our cabinet. If it's not that, then I'm convinced it's the tg3 driver.
-Jesse
Jesse Cantara wrote:
The problem ended up being the "tg3" Broadcom NIC kernel module driver. It doesn't work properly at Gigabit speeds. Turning it down to 100 Megabit fixed the issue. Does anybody know where I should report this bug?
Thanks for all your help, -Jesse
William L. Maltby wrote:
On Fri, 2007-07-20 at 12:29 -0400, Jesse Cantara wrote:
Hi Bob,
<snip>
The issue I'm having is that external traffic is being forwarded properly, BUT that it drops the connection occasionally. It's not consistent (maybe 2 out of 5 downloads from the internet through the router to the webserver will drop), and the connections are being made, so it's not a fundamental configuration issue. It's something more sneaky. I'm thinking that there's something in the kernel or network driver that isn't functioning properly, or maybe a buffer that is becoming full and abandoning the connection?
<snip>
-Jesse
Bob Chiodini wrote:
Jesse Cantara wrote:
Hello,
I am trying to figure out a problem I'm having using CentOS on a machine as a router. The short story is: any traffic routed through the router seems to get disconnected at random occasionally.
<snip>
Someone recently posted a thread about a similar complaint to the lists recently. IIRC, the [SOLVED] post mentioned a problem with MTU being smaller than some of the packets received at one point, causing fragmentation, and the next step not being to reassemble the packet because of a certain flag being set.
I don't remember which bit the flag was and no little about this, but I remember the general gist.
Maybe your problem is similar?
HTH
Bill
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Jesse Cantara wrote:
So basically, what I can figure from all of the evidence at this point is the problem is either: default configuration of the network in CentOS isn't proper for what I'm doing (can't handle the traffic or number of connections). I get a decent amount of traffic, maxing out a 10 Mbit connection all day long. I don't know exactly where to check to diagnose if this is the case though. Can anybody point me where to find things like the system usage of the network (memory, any buffers, # of connections, etc)? the things I know to check look normal, but that's basically just ifconfig, and your standard /var/log/message and dmesg log files. or: the network drop from the hosting facility is "bad" somehow, either the cable physically, or the way in which they are limiting me to 10 Mbit.
check with the facility to see if that drop is 10Mbit HALF duplex, and if so, make sure your server's NIC is configured as such.
I had a problem like this in a coloc many years ago, with a much older linux version.
On 7/24/07, John R Pierce pierce@hogranch.com wrote:
Jesse Cantara wrote:
So basically, what I can figure from all of the evidence at this point is the problem is either: default configuration of the network in CentOS isn't proper for what I'm doing (can't handle the traffic or number of connections). I get a decent amount of traffic, maxing out a 10 Mbit connection all day long. I don't know exactly where to check to diagnose if this is the case though. Can anybody point me where to find things like the system usage of the network (memory, any buffers, # of connections, etc)? the things I know to check look normal, but that's basically just ifconfig, and your standard /var/log/message and dmesg log files. or: the network drop from the hosting facility is "bad" somehow, either the cable physically, or the way in which they are limiting me to 10 Mbit.
check with the facility to see if that drop is 10Mbit HALF duplex, and if so, make sure your server's NIC is configured as such.
I had a problem like this in a coloc many years ago, with a much older linux version.
While not the exact same issue I had a problem similar to this between two switches one was a Cisco 4006 and the other was a 3Com 3300 they were using a media converter that was 10 mb over fiber and for some reason the 3com would not negotiate properly with the media converter it was plugged into. It kept jumping between full and half and sometimes it would try to go to 100mb. As soon as I turnded off auto negotiate and set it to 10 mb full all the dropped packets disappeared. It was under similar conditions where it would all be fine with a low load but as soon as it was running close to max it would drop packets repeatedly and the link would seem to fail until the load dropped off (because people thought it was down) then it would become stable again until the traffic went back up.
John's suggestion looks like a solid one below. If the 'problem' server is behaving find in your office I would really look at this as a probable solution.
Rob
(ps Hopefuly it clears it up. In our case the problem had been happening for over a year and the connection fed an elementary school. I found out about the problem about a month into working at the place and had it fixed within a day or two. The previous outsourced IT dept could never track it down because they were never there when it happened. They would come in after school was out and it would work fine for them without the high packet load and they would just claim it was user error.)
Is it dropping packets or Ethernet frames?
Iptables may be dropping packets, check cat /proc/net/ip_conntrack | wc -w see how many connections iptables is keeping track of. The default value held by this entry varies heavily depending on how much memory you have. On 128 MB of RAM you will get 8192 possible entries, and at 256 MB of RAM, you will get 16376 entries. You can read and set your settings through the /proc/sys/net/ipv4/ip_conntrack_max setting.
Check to see if there are any errors on the Ethernet device (ifconfig)
RX packets:29024644 errors:0 dropped:0 overruns:0 frame:0
TX packets:28064715 errors:0 dropped:0 overruns:0 carrier:0
Craig
_____
From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Rob Lines Sent: Thursday, July 26, 2007 8:36 AM To: CentOS mailing list Subject: Re: [CentOS] CentOS based router dropping connections
On 7/24/07, John R Pierce pierce@hogranch.com wrote:
Jesse Cantara wrote:
So basically, what I can figure from all of the evidence at this point is the problem is either: default configuration of the network in CentOS isn't proper for what I'm doing (can't handle the traffic or number of connections). I get a decent amount of traffic, maxing out a 10 Mbit connection all day long. I don't know exactly where to check to diagnose if this is the case though. Can anybody point me where to find things like the system usage of the network (memory, any buffers, # of connections, etc)? the things I know to check look normal, but that's basically just ifconfig, and your standard /var/log/message and dmesg log files. or: the network drop from the hosting facility is "bad" somehow, either the cable physically, or the way in which they are limiting me to 10 Mbit.
check with the facility to see if that drop is 10Mbit HALF duplex, and if so, make sure your server's NIC is configured as such.
I had a problem like this in a coloc many years ago, with a much older linux version.
While not the exact same issue I had a problem similar to this between two switches one was a Cisco 4006 and the other was a 3Com 3300 they were using a media converter that was 10 mb over fiber and for some reason the 3com would not negotiate properly with the media converter it was plugged into. It kept jumping between full and half and sometimes it would try to go to 100mb. As soon as I turnded off auto negotiate and set it to 10 mb full all the dropped packets disappeared. It was under similar conditions where it would all be fine with a low load but as soon as it was running close to max it would drop packets repeatedly and the link would seem to fail until the load dropped off (because people thought it was down) then it would become stable again until the traffic went back up.
John's suggestion looks like a solid one below. If the 'problem' server is behaving find in your office I would really look at this as a probable solution.
Rob
(ps Hopefuly it clears it up. In our case the problem had been happening for over a year and the connection fed an elementary school. I found out about the problem about a month into working at the place and had it fixed within a day or two. The previous outsourced IT dept could never track it down because they were never there when it happened. They would come in after school was out and it would work fine for them without the high packet load and they would just claim it was user error.)
Hmmmmm sounds like possible duplex issue
I try to add this to my
/etc/sysconfig/network-scripts/ifcfg-eth0 file at the bottom
ETHTOOL_OPTS="speed 100 duplex full autoneg off"
So that it always boots as we desire.
You can modify as you need or run ethtool as desired
man ethtool
- rh