[CentOS] Networking just stopped working

Thu Jul 8 09:25:25 UTC 2010
Christopher Chan <christopher.chan at bradbury.edu.hk>

On Thursday, July 08, 2010 05:09 PM, Kahlil Hodgson wrote:
> On 07/08/2010 05:08 PM, Christopher Chan wrote:
>>> Hmmm ... which bond mode are you using?
>>
>> Why mode 4 of course.
>
> Ouch.  Never used that mode.

Huh? Like why? It's the recommended mode unless the switch does not 
suppoprt it or the boards don't.

>
> <snip>
> mode=4 (802.3ad)
> IEEE 802.3ad Dynamic link aggregation. Creates aggregation groups that
> share the same speed and duplex settings. Utilizes all slaves in the
> active aggregator according to the 802.3ad specification.
>
> 	Pre-requisites:
> 	1. Ethtool support in the base drivers for retrieving
> 	the speed and duplex of each slave.
> 	2. A switch that supports IEEE 802.3ad Dynamic link
> 	aggregation.
> 	Most switches will require some type of configuration
> 	to enable 802.3ad mode.
> </snip>
>
> So I gather the bonding on the CentOS box is cooperating with the
> switches in some non-trivial fashion.

And it works just fine thank you very much.


>
>> Too bad there are no defaults that use the subnet assigned to the school
>> or the 192.168.0.0/16 (no, not my idea - inherited)
>
> That is a big network.  Might make sense in a school though.  How many
> nodes on it?  Any chance a<ahem>  staff member plugged an unauthorised
> piece of hardware in somewhere.

Nada, zip, zilch. School is closed and the issue is now very reliably 
demonstrated that running tcpdump makes it behave and the network is 
gone the moment you stop tcpdump. So there are no external factors to 
this problem. Been on the phone with HP. I will be upgrading the hp 
packages to the latest version to see if that fixes things.

>
>>> If it was working, then suddenly stops, then something must have
>>> changed.  I gather you have some configuration and change management
>>> system in place?  Backups of conf files?
>>
>> Hahaha, that was the best part. It just stopped. And stayed that way too
>> after a reboot, reboot of switches and only started working again when I
>> ran tcpdump for some reason.
>
> tcpdump is probably putting your interface into promiscuous mode which
> is triggering something. Perhaps ARP packets.

Yeah, it is triggering something alright.


>
> I think something (perhaps obscure) has changed, you may just not be
> aware of it.  Comparing your event timeline against your configuration
> change management systems may help.

No changes have been made to the box whether by me or by my colleague at 
the HQ. I checked the logs too. No reboot prior to the manifestation of 
the problem. Stumped really here...


>
>> But another colleague did find this in the iLo report:
>
> You're the only admin but you have a colleague with access to an iLo
> report?  That puts a big question mark over a previous assertion :-)

He is not physically on site so he cannot add anything. Nor have the 
logs shown anything done by him.

>
>> Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters
>> Redundancy Reduced (Slot 10, Port 3)
>>
>> Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters
>> Redundancy Reduced (Slot 10, Port 4)
>>
>> Repaired Network 07/06/2010 12:35 07/06/2010 12:00 2 Network Adapters
>> Redundancy Reduced (Slot 10, Port 1)
>>
>> Repaired Network 07/06/2010 12:01 07/06/2010 12:00 1 Network Adapter
>> Link Down (Slot 10, Port 2)
>>
>> Time to ask the HP chap what this is all about.
>
> Looks like the bonding failover process is doing what it should.
>
> A bit more info on you setup might help.
>
> 1. What is the purpose of the box with the fat network?

Besides being able to saturate the network, what other reason can there be?


> 2. are all 4 interfaces being used?

Oh yes!


> 3. are they plugged into the same switch?

Yup.


> 4. you've got at least 2 networks, plus 2 vlans, plus a public internet
> connection to this box?
>

The vlans use bond0 as their phy interface. One vlan is internal and the 
other is the Internet subnet.