[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
adi at ddns.com.au
Tue Jan 31 00:12:09 UTC 2017
On 31/01/17 10:49, Kevin Stange wrote:
> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
> 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and
Xen with kernel 4.4.
> Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate
slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
> I am having certain issues on certain hardware where there's no shutting
> down the affected NICs. Trying to do so or unload the igb module hangs
> the entire box. But in that case they're throwing AER errors instead of
> just unit hangs:
> pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
> type=Transaction Layer, id=0401(Requester ID)
> igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000
> igb 0000:04:00.1:  Completion Timeout (First)
> igb 0000:04:00.1: broadcast error_detected message
> igb 0000:04:00.1: broadcast slot_reset message
> igb 0000:04:00.1: broadcast resume message
> igb 0000:04:00.1: AER: Device recovery successful
This is interesting. We've never had any problems with the 1Gb NICs, but
we're only using 10Gb for the storage network. Could it be a common
problem with either the adapters, or the drivers which only replicate
running the Xen enabled kernel?
> Switching to Broadcom would be a possibility, though it's tricky because
> two of the NICs are onboard, so we'd need to replace the dual-port 1G
> card with a quad-port 1G card. Since you're saying you're all 10G,
> maybe you don't know, but if you have any specific Broadcom 1G cards
> you've had good fortune with, I'd be interested in knowing which models.
> Broadcom cards are rarely labeled as such which makes finding them a
> bit more difficult than Intel ones.
We've purchased a number of servers with Broadcom BCM957810A1008G, sold
by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up
& down like a yo-yo so far.
> So far the one hypervisor with pci=nomsi has been quiet but that doesn't
> mean it's fixed. I need to give it 6 weeks or so. :)
It'd be more like 6-9 months for us, making it terrible to debug it :-/
More information about the CentOS-virt