[CentOS] bonding driver with arp detection

Wed Sep 9 21:04:54 UTC 2009
nate <centos at linuxpowered.net>

Hello -

Was wondering if anyone is running the bonding network driver in
active/backup mode using arp_validate?

I'm trying to deal with really crappy network switches from Dell
and I thought I could work around their faults in the short term
by switching from link monitoring to arp monitoring. But I ran
into a situation just now that seems even arp monitoring isn't
enough.

This is my config for the system:

CentOS 5.2 base
2.6.18-128.1.10.el5 kernel (I think it's a 5.3 kernel)

Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.16.1.1, 10.16.1.254

Slave Interface: eth0
MII Status: up
Link Failure Count: 2
Permanent HW addr: 00:21:9b:8d:f1:0c

Slave Interface: eth1
MII Status: up
Link Failure Count: 1


--

Both IPs are supposed to be redundant, .1 is a pair of stacked
piece of shit Dell gigabit switches, the other is a pair of
F5 LTM load balancers.

System had been running fine for about the past 9 days since I
enabled this stuff, and then for some reason could no longer
talk to 10.16.1.254, looking at tcpdump I saw the system almost
flooding the link for arp requests for that address and maybe
getting one in 10 answered. Communication with 10.16.1.1 was
fine by contrast. 40 other systems on the same LAN communicate
with both addresses constantly so I know both were more or
less OK, it was something with the switch itself(have seen
behavior on multiple Dell switches where they decide to stop
forwarding traffic, which is what prompted me to switch from
link monitoring to arp monitoring)

At the time the system was running on eth0, so I brought that
interface down and it immediately failed over to eth1 and
things were ok again. I have since failed it back to eth0 and
things are still fine.

What I'd like to do if possible is configure the bonding driver
to fail if either of the arp attempts fails, as far as I can
see the default is even if 1 succeeds then the driver thinks it's
ok. Looking at the arp_validate option it seems it only applies
to slaves, not to the active link.

Is there any thing I can do to make the driver fail if even one
of the two addresses is not responding?

I have noticed that in some cases the fail over does work, checking
several systems they all have at least 1 link failure detected
for each interface.

Longer term my goal is to replace the switches entirely, been
pushing that for about a month now.

just goes to show you get what you pay for when you buy crap
equipment(wasn't my idea), sigh.

thanks

nate