Hi all,
I have a box with a quad port Netxen NIC running Centos 5. All four interfaces are slaves of bond0 and bond0 is used by two vlan interfaces.
All was working just fine until just recently when everything just stopped working. ethtool reports all the individual interfaces are just fine. The switch is not complaining either. But I cannot ping anywhere not can others 'see' the box. Any ideas?
I have turned off iptables, rebooted the switches but still the thing won't work.
Christopher
Christopher Chan wrote:
And now the thing is working again...
It's not working again.
Running tcpdump -i vlan seems to trigger something to get the network working again but as soon as I stop tcpdump...nada, zip, zilch.
Any ideas? I see no errors in the logs whether of the switch or the box, just about everything reports fine. Would the loading of the kernel bridge module cause this?
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
And now the thing is working again...
It's not working again.
Running tcpdump -i vlan seems to trigger something to get the network working again but as soon as I stop tcpdump...nada, zip, zilch.
Any ideas? I see no errors in the logs whether of the switch or the box, just about everything reports fine. Would the loading of the kernel bridge module cause this?
Running tcpdump would put the interface in promiscuous mode. Does your setup need this to work?
Les Mikesell wrote:
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
And now the thing is working again...
It's not working again.
Running tcpdump -i vlan seems to trigger something to get the network working again but as soon as I stop tcpdump...nada, zip, zilch.
Any ideas? I see no errors in the logs whether of the switch or the box, just about everything reports fine. Would the loading of the kernel bridge module cause this?
Running tcpdump would put the interface in promiscuous mode. Does your setup need this to work?
I don't think so. The thing was working fine since December last year until this morning. Then poof! I just realized I forgot to boot older kernels to check for the same problem...
On Tuesday, July 06, 2010 09:21 PM, Chan Chung Hang Christopher wrote:
Les Mikesell wrote:
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
And now the thing is working again...
It's not working again.
Running tcpdump -i vlan seems to trigger something to get the network working again but as soon as I stop tcpdump...nada, zip, zilch.
Any ideas? I see no errors in the logs whether of the switch or the box, just about everything reports fine. Would the loading of the kernel bridge module cause this?
Running tcpdump would put the interface in promiscuous mode. Does your setup need this to work?
I don't think so. The thing was working fine since December last year until this morning. Then poof! I just realized I forgot to boot older kernels to check for the same problem...
Box behaving for the moment after tcpdump was run on one of the interfaces and then stopped. I'll just wait for the next weirdo event.
On 06/07/10 22:48, Les Mikesell wrote:
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
And now the thing is working again...
It's not working again.
Running tcpdump -i vlan seems to trigger something to get the network working again but as soon as I stop tcpdump...nada, zip, zilch.
If you have two machines on the same network with the same IP address you get behaviour like this. Had this happen once when an engineer reset a UPSs and it took on the IP address of a main switch. arpwatch is your friend.
K
On Thursday, July 08, 2010 09:26 AM, Kahlil Hodgson wrote:
On 06/07/10 22:48, Les Mikesell wrote:
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
And now the thing is working again...
It's not working again.
Running tcpdump -i vlan seems to trigger something to get the network working again but as soon as I stop tcpdump...nada, zip, zilch.
If you have two machines on the same network with the same IP address you get behaviour like this. Had this happen once when an engineer reset a UPSs and it took on the IP address of a main switch. arpwatch is your friend.
Unfortunately all addresses, both internal and Internet, on this box are static and assigned so there is no hope of a collision. The dhcp server does not serve any address in the same range that the box uses internally.
On 08/07/10 14:58, Christopher Chan wrote:
If you have two machines on the same network with the same IP address you get behaviour like this. Had this happen once when an engineer reset a UPSs and it took on the IP address of a main switch. arpwatch is your friend.
Unfortunately all addresses, both internal and Internet, on this box are static and assigned so there is no hope of a collision. The dhcp server does not serve any address in the same range that the box uses internally.
I was referring to the case where another box (or network device) on the same network (i.e. plugged into the same switch/router/hub) has been given a static IP address the same as that used by the problem box. This could be a new server, a printer, a UPS, or any number of other network devices. It could also be a device being reset to factory settings which conflicts with the problem box.
I'm you have another Linux machine on the same network that is not having the same problem, try installing arpwatch. It should pick up the conflict with 30mins or so.
K
On Thursday, July 08, 2010 01:32 PM, Kahlil Hodgson wrote:
On 08/07/10 14:58, Christopher Chan wrote:
If you have two machines on the same network with the same IP address you get behaviour like this. Had this happen once when an engineer reset a UPSs and it took on the IP address of a main switch. arpwatch is your friend.
Unfortunately all addresses, both internal and Internet, on this box are static and assigned so there is no hope of a collision. The dhcp server does not serve any address in the same range that the box uses internally.
I was referring to the case where another box (or network device) on the same network (i.e. plugged into the same switch/router/hub) has been given a static IP address the same as that used by the problem box. This could be a new server, a printer, a UPS, or any number of other network devices. It could also be a device being reset to factory settings which conflicts with the problem box.
No new boxes. Not possible for any other box to be assigned the same ip internally via dhcp and definitely not the same Internet ip. Perhaps you care to explain why BOTH vlan interfaces stopped working? The odd chance that two other boxes each took one of the other ip address?
I'm you have another Linux machine on the same network that is not having the same problem, try installing arpwatch. It should pick up the conflict with 30mins or so.
The box with the problem just so happens to be the only box using bonding, 802.1q and a four port Qlogic Netxen NIC. I think the chances of there being a problem between these three more likely than some 'ghost' boxes getting assigned the same ip addresses when I am the only admin around.
On 08/07/10 15:41, Christopher Chan wrote:
No new boxes. Not possible for any other box to be assigned the same ip internally via dhcp and definitely not the same Internet ip.
Exactly. DHCP server would check for a conflict before assigning an address and is definitely not the source of the problem.
Perhaps you care to explain why BOTH vlan interfaces stopped working? The odd chance that two other boxes each took one of the other ip address?
Did not know that both had stopped. Conflicting IP addresses was just a suggestion. May not be the problem at all. With bonding, breaking one might break both down at the MAC level ...
Hmmm ... which bond mode are you using?
The box with the problem just so happens to be the only box using bonding, 802.1q and a four port Qlogic Netxen NIC. I think the chances of there being a problem between these three more likely than some 'ghost' boxes getting assigned the same ip addresses when I am the only admin around.
If you are the only admin, then its not that likely. Then again, I once had a power spike reset a wireless router on my network without me knowing. Default settings were close by not quite right, and it took me a couple of days to track down the problem :-(
If it was working, then suddenly stops, then something must have changed. I gather you have some configuration and change management system in place? Backups of conf files?
K
Did not know that both had stopped. Conflicting IP addresses was just a suggestion. May not be the problem at all. With bonding, breaking one might break both down at the MAC level ...
Hmmm ... which bond mode are you using?
Why mode 4 of course.
The box with the problem just so happens to be the only box using bonding, 802.1q and a four port Qlogic Netxen NIC. I think the chances of there being a problem between these three more likely than some 'ghost' boxes getting assigned the same ip addresses when I am the only admin around.
If you are the only admin, then its not that likely. Then again, I once had a power spike reset a wireless router on my network without me knowing. Default settings were close by not quite right, and it took me a couple of days to track down the problem :-(
Too bad there are no defaults that use the subnet assigned to the school or the 192.168.0.0/16 (no, not my idea - inherited)
If it was working, then suddenly stops, then something must have changed. I gather you have some configuration and change management system in place? Backups of conf files?
Hahaha, that was the best part. It just stopped. And stayed that way too after a reboot, reboot of switches and only started working again when I ran tcpdump for some reason.
But another colleague did find this in the iLo report:
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 3)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 4)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 1)
Repaired Network 07/06/2010 12:01 07/06/2010 12:00 1 Network Adapter Link Down (Slot 10, Port 2)
Time to ask the HP chap what this is all about.
On 07/08/2010 05:08 PM, Christopher Chan wrote:
Hmmm ... which bond mode are you using?
Why mode 4 of course.
Ouch. Never used that mode.
<snip> mode=4 (802.3ad) IEEE 802.3ad Dynamic link aggregation. Creates aggregation groups that share the same speed and duplex settings. Utilizes all slaves in the active aggregator according to the 802.3ad specification.
Pre-requisites: 1. Ethtool support in the base drivers for retrieving the speed and duplex of each slave. 2. A switch that supports IEEE 802.3ad Dynamic link aggregation. Most switches will require some type of configuration to enable 802.3ad mode. </snip>
So I gather the bonding on the CentOS box is cooperating with the switches in some non-trivial fashion.
Too bad there are no defaults that use the subnet assigned to the school or the 192.168.0.0/16 (no, not my idea - inherited)
That is a big network. Might make sense in a school though. How many nodes on it? Any chance a <ahem> staff member plugged an unauthorised piece of hardware in somewhere.
If it was working, then suddenly stops, then something must have changed. I gather you have some configuration and change management system in place? Backups of conf files?
Hahaha, that was the best part. It just stopped. And stayed that way too after a reboot, reboot of switches and only started working again when I ran tcpdump for some reason.
tcpdump is probably putting your interface into promiscuous mode which is triggering something. Perhaps ARP packets.
I think something (perhaps obscure) has changed, you may just not be aware of it. Comparing your event timeline against your configuration change management systems may help.
But another colleague did find this in the iLo report:
You're the only admin but you have a colleague with access to an iLo report? That puts a big question mark over a previous assertion :-)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 3)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 4)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 1)
Repaired Network 07/06/2010 12:01 07/06/2010 12:00 1 Network Adapter Link Down (Slot 10, Port 2)
Time to ask the HP chap what this is all about.
Looks like the bonding failover process is doing what it should.
A bit more info on you setup might help.
1. What is the purpose of the box with the fat network? 2. are all 4 interfaces being used? 3. are they plugged into the same switch? 4. you've got at least 2 networks, plus 2 vlans, plus a public internet connection to this box?
K
On Thursday, July 08, 2010 05:09 PM, Kahlil Hodgson wrote:
On 07/08/2010 05:08 PM, Christopher Chan wrote:
Hmmm ... which bond mode are you using?
Why mode 4 of course.
Ouch. Never used that mode.
Huh? Like why? It's the recommended mode unless the switch does not suppoprt it or the boards don't.
<snip> mode=4 (802.3ad) IEEE 802.3ad Dynamic link aggregation. Creates aggregation groups that share the same speed and duplex settings. Utilizes all slaves in the active aggregator according to the 802.3ad specification.
Pre-requisites:
- Ethtool support in the base drivers for retrieving
the speed and duplex of each slave. 2. A switch that supports IEEE 802.3ad Dynamic link aggregation. Most switches will require some type of configuration to enable 802.3ad mode.
</snip>
So I gather the bonding on the CentOS box is cooperating with the switches in some non-trivial fashion.
And it works just fine thank you very much.
Too bad there are no defaults that use the subnet assigned to the school or the 192.168.0.0/16 (no, not my idea - inherited)
That is a big network. Might make sense in a school though. How many nodes on it? Any chance a<ahem> staff member plugged an unauthorised piece of hardware in somewhere.
Nada, zip, zilch. School is closed and the issue is now very reliably demonstrated that running tcpdump makes it behave and the network is gone the moment you stop tcpdump. So there are no external factors to this problem. Been on the phone with HP. I will be upgrading the hp packages to the latest version to see if that fixes things.
If it was working, then suddenly stops, then something must have changed. I gather you have some configuration and change management system in place? Backups of conf files?
Hahaha, that was the best part. It just stopped. And stayed that way too after a reboot, reboot of switches and only started working again when I ran tcpdump for some reason.
tcpdump is probably putting your interface into promiscuous mode which is triggering something. Perhaps ARP packets.
Yeah, it is triggering something alright.
I think something (perhaps obscure) has changed, you may just not be aware of it. Comparing your event timeline against your configuration change management systems may help.
No changes have been made to the box whether by me or by my colleague at the HQ. I checked the logs too. No reboot prior to the manifestation of the problem. Stumped really here...
But another colleague did find this in the iLo report:
You're the only admin but you have a colleague with access to an iLo report? That puts a big question mark over a previous assertion :-)
He is not physically on site so he cannot add anything. Nor have the logs shown anything done by him.
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 3)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 4)
Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters Redundancy Reduced (Slot 10, Port 1)
Repaired Network 07/06/2010 12:01 07/06/2010 12:00 1 Network Adapter Link Down (Slot 10, Port 2)
Time to ask the HP chap what this is all about.
Looks like the bonding failover process is doing what it should.
A bit more info on you setup might help.
- What is the purpose of the box with the fat network?
Besides being able to saturate the network, what other reason can there be?
- are all 4 interfaces being used?
Oh yes!
- are they plugged into the same switch?
Yup.
- you've got at least 2 networks, plus 2 vlans, plus a public internet
connection to this box?
The vlans use bond0 as their phy interface. One vlan is internal and the other is the Internet subnet.
Christopher Chan wrote:
On Thursday, July 08, 2010 05:09 PM, Kahlil Hodgson wrote:
On 07/08/2010 05:08 PM, Christopher Chan wrote:
Hmmm ... which bond mode are you using?
Why mode 4 of course.
Ouch. Never used that mode.
Huh? Like why? It's the recommended mode unless the switch does not suppoprt it or the boards don't.
Oh sorry, got a bit grouchy there. I don't like overtime and was getting tired too. Did not read your mail properly.
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
On Thursday, July 08, 2010 05:09 PM, Kahlil Hodgson wrote:
On 07/08/2010 05:08 PM, Christopher Chan wrote:
Hmmm ... which bond mode are you using?
Why mode 4 of course.
Ouch. Never used that mode.
Huh? Like why? It's the recommended mode unless the switch does not suppoprt it or the boards don't.
Oh sorry, got a bit grouchy there. I don't like overtime and was getting tired too. Did not read your mail properly.
I think some bridge or vlan scenarios require promiscuous mode (and the corresponding disabling of hardware acceleration). Maybe the real issue is that something accidentally disabled it and you now only work when tcpdump re-enables it. I'm not sure how this is supposed to be managed atomically when multiple programs may manipulate it and it needs to be propagated across multiple bonded nics, but maybe something went wrong there. At least some things log the change so maybe you can get a hint about when it was turned on and off.
-- Les Mikesell lesmikesell@gmail.com
On Thu, 2010-07-08 at 07:51 -0500, Les Mikesell wrote:
I think some bridge or vlan scenarios require promiscuous mode (and the corresponding disabling of hardware acceleration). Maybe the real issue is that something accidentally disabled it and you now only work when tcpdump re-enables it. I'm not sure how this is supposed to be managed atomically when multiple programs may manipulate it and it needs to be propagated across multiple bonded nics, but maybe something went wrong there. At least some things log the change so maybe you can get a hint about when it was turned on and off.
---
Check out /proc/net/bonding/bond/YOUR_BOND. Make sure your slave IDs are the same as in aggregator ID. If not it will cause the problem your having. Bad NIC hardware also it's failing over for a reason as the log showed.
John
JohnS wrote:
On Thu, 2010-07-08 at 07:51 -0500, Les Mikesell wrote:
I think some bridge or vlan scenarios require promiscuous mode (and the corresponding disabling of hardware acceleration). Maybe the real issue is that something accidentally disabled it and you now only work when tcpdump re-enables it. I'm not sure how this is supposed to be managed atomically when multiple programs may manipulate it and it needs to be propagated across multiple bonded nics, but maybe something went wrong there. At least some things log the change so maybe you can get a hint about when it was turned on and off.
Check out /proc/net/bonding/bond/YOUR_BOND. Make sure your slave IDs are the same as in aggregator ID. If not it will cause the problem your having. Bad NIC hardware also it's failing over for a reason as the log showed.
Okay, I'll take a look tomorrow when I get in to work.
On Thursday, July 08, 2010 09:40 PM, JohnS wrote:
On Thu, 2010-07-08 at 07:51 -0500, Les Mikesell wrote:
I think some bridge or vlan scenarios require promiscuous mode (and the corresponding disabling of hardware acceleration). Maybe the real issue is that something accidentally disabled it and you now only work when tcpdump re-enables it. I'm not sure how this is supposed to be managed atomically when multiple programs may manipulate it and it needs to be propagated across multiple bonded nics, but maybe something went wrong there. At least some things log the change so maybe you can get a hint about when it was turned on and off.
Check out /proc/net/bonding/bond/YOUR_BOND. Make sure your slave IDs are the same as in aggregator ID. If not it will cause the problem your having. Bad NIC hardware also it's failing over for a reason as the log showed.
They check out. What did help besides running tcpdump forever was to do a 'service network restart'. That made the network behave. I wonder what's going on...
At Fri, 09 Jul 2010 10:30:06 +0800 CentOS mailing list centos@centos.org wrote:
On Thursday, July 08, 2010 09:40 PM, JohnS wrote:
On Thu, 2010-07-08 at 07:51 -0500, Les Mikesell wrote:
I think some bridge or vlan scenarios require promiscuous mode (and the corresponding disabling of hardware acceleration). Maybe the real issue is that something accidentally disabled it and you now only work when tcpdump re-enables it. I'm not sure how this is supposed to be managed atomically when multiple programs may manipulate it and it needs to be propagated across multiple bonded nics, but maybe something went wrong there. At least some things log the change so maybe you can get a hint about when it was turned on and off.
Check out /proc/net/bonding/bond/YOUR_BOND. Make sure your slave IDs are the same as in aggregator ID. If not it will cause the problem your having. Bad NIC hardware also it's failing over for a reason as the log showed.
They check out. What did help besides running tcpdump forever was to do a 'service network restart'. That made the network behave. I wonder what's going on...
Are there 'services' that the network 'depends' on, but which are are started *later* then network? Running 'service network restart' as a cure suggests this. Do you have any special or custom init scripts relating to your bonding (maybe something that loads special kernel modules or something like that)?
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Are there 'services' that the network 'depends' on, but which are are started *later* then network? Running 'service network restart' as a cure suggests this. Do you have any special or custom init scripts relating to your bonding (maybe something that loads special kernel modules or something like that)?
Hmm, now that you mention it, I highly suspect the qemu/libvirt network but I have already shot down these two services along with dnsmasq. What else will setup the 192.168.122.0 space?
Les Mikesell wrote:
Chan Chung Hang Christopher wrote:
Christopher Chan wrote:
On Thursday, July 08, 2010 05:09 PM, Kahlil Hodgson wrote:
On 07/08/2010 05:08 PM, Christopher Chan wrote:
Hmmm ... which bond mode are you using?
Why mode 4 of course.
Ouch. Never used that mode.
Huh? Like why? It's the recommended mode unless the switch does not suppoprt it or the boards don't.
Oh sorry, got a bit grouchy there. I don't like overtime and was getting tired too. Did not read your mail properly.
I think some bridge or vlan scenarios require promiscuous mode (and the corresponding disabling of hardware acceleration). Maybe the real issue is that something accidentally disabled it and you now only work when tcpdump re-enables it. I'm not sure how this is supposed to be managed atomically when multiple programs may manipulate it and it needs to be propagated across multiple bonded nics, but maybe something went wrong there. At least some things log the change so maybe you can get a hint about when it was turned on and off.
/me wonders if the loading of the bridge and another related module has anything to do with this.
I'll prepare a list of targets for rmmod.
HiChristopher,
On 08/07/10 10:25, Christopher Chan wrote:
Why mode 4 of course.
Huh? Like why? It's the recommended mode unless the switch does not suppoprt it or the boards don't.
I never realised this is the recommended mode. Do you have pointers where it is recommended so that I can read on why?
Cheers
Hakan Koseoglu wrote:
HiChristopher,
On 08/07/10 10:25, Christopher Chan wrote:
Why mode 4 of course.
Huh? Like why? It's the recommended mode unless the switch does not suppoprt it or the boards don't.
I never realised this is the recommended mode. Do you have pointers where it is recommended so that I can read on why?
Maybe 'the recommended' is a bit too much. But here is a read.
http://useopensource.blogspot.com/2010/02/linux-nic-teaming-recommendations....