[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Tue Jan 31 00:12:09 UTC 2017
Adi Pircalabu <adi at ddns.com.au>

On 31/01/17 10:49, Kevin Stange wrote:
> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
> 4.4 kernel.  Any chance you have tested with that one?

Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and 
Xen with kernel 4.4.

> Did you ever try without MTU=9000 (default 1500 instead)?

Yes, also with all sorts of configuration combinations like LACP rate 
slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.

> I am having certain issues on certain hardware where there's no shutting
> down the affected NICs.  Trying to do so or unload the igb module hangs
> the entire box.  But in that case they're throwing AER errors instead of
> just unit hangs:
> pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
> type=Transaction Layer, id=0401(Requester ID)
> igb 0000:04:00.1:   device [8086:10a7] error status/mask=00004000/00000000
> igb 0000:04:00.1:    [14] Completion Timeout     (First)
> igb 0000:04:00.1: broadcast error_detected message
> igb 0000:04:00.1: broadcast slot_reset message
> igb 0000:04:00.1: broadcast resume message
> igb 0000:04:00.1: AER: Device recovery successful

This is interesting. We've never had any problems with the 1Gb NICs, but 
we're only using 10Gb for the storage network. Could it be a common 
problem with either the adapters, or the drivers which only replicate 
running the Xen enabled kernel?

> Switching to Broadcom would be a possibility, though it's tricky because
> two of the NICs are onboard, so we'd need to replace the dual-port 1G
> card with a quad-port 1G card.  Since you're saying you're all 10G,
> maybe you don't know, but if you have any specific Broadcom 1G cards
> you've had good fortune with, I'd be interested in knowing which models.
>   Broadcom cards are rarely labeled as such which makes finding them a
> bit more difficult than Intel ones.

We've purchased a number of servers with Broadcom BCM957810A1008G, sold 
by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up 
& down like a yo-yo so far.

> So far the one hypervisor with pci=nomsi has been quiet but that doesn't
> mean it's fixed.  I need to give it 6 weeks or so. :)

It'd be more like 6-9 months for us, making it terrible to debug it :-/

Adi Pircalabu