[CentOS] bnx2 losing connectivity

Mon Dec 14 22:50:46 UTC 2009
nate <centos at linuxpowered.net>

Hoping someone else has seen this before.

I have a few dozen Dell R610 systems with CentOS 5.2 that are
using kernels from 5.3 and 5.4 (2.6.18-128.1.10.el5 & 2.6.18-164.6.1.el5),
that at random lose layer 2 network connectivity either partially
or totally. Running tcpdump on the interface reveals only ARP
broadcasts, no responses. Switch reports no packets being
received on the interface.

Systems can run for days/weeks or even months without an issue then
drop off the network. At first I thought it was the Dell switches
which we had lots of problems with but it has happened on two other
brands of switches as well(Cisco and Extreme), so I no longer believe
it's the switch but rather the systems.

The workaround is to restart the network on the system. I have even
configured the bonding driver to do ARP requests and fail over to
the backup link in the event that fails but wasn't successful there
either as both links can go down, and/or the system can go into
"degraded" state where it can reach some systems but not others.

I have ESXi systems running on the same hardware and to-date have not
seen any of them drop off the same way.

System can be under high traffic load at the time or completely
idle, it doesn't seem to make a difference. No log entries indicating
what might be going on.

I have a case open with Dell but am not expecting a whole lot from
them, maybe I'll get lucky though. They asked me to upgrade the NIC
firmware which I did on a batch of systems to no avail(the release
notes for the firmware said nothing about any fixes that sounded
like my issue).

Driver versions:
ESXi (vSphere):
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.6.9 (December 8, 2007)

Most linux systems(5.3 kernel):
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.7.9-1 (July 18, 2008)

Some linux systems(5.4 kernel):
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.9.3 (March 17, 2009)

Happens across at least a dozen systems spread over 4 data centers.

Never seen this sort of behavior before in the hundreds and hundreds
of systems I've run. These systems are all new, the R610 hardware
was released around May 2009, and we've been having issues since
day 1, but only recently have been able to rule the switches out as
the cause.

The latest driver on Broadcom's site is 1.9.20b which seems odd since
CentOS 5.4 seems to come with 1.9.3(the date on the Broadcom site is
more recent than the date on the linux kernel driver in 5.4) Most of
the fixes in the recent driver versions seem to focus around iSCSI,
which I'm not using.

lspci says:
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709
Gigabit Ethernet (rev 20)
        Subsystem: Dell Unknown device 0236
        Flags: bus master, fast devsel, latency 0, IRQ 114
        Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number c9-dc-93-fe-ff-9b-21-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

I suppose I could go build the latest driver from their site and see
how it goes..