On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
I am having certain issues on certain hardware where there's no shutting down the affected NICs. Trying to do so or unload the igb module hangs the entire box. But in that case they're throwing AER errors instead of just unit hangs:
pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000 igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0401(Requester ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000 igb 0000:04:00.1: [14] Completion Timeout (First) igb 0000:04:00.1: broadcast error_detected message igb 0000:04:00.1: broadcast slot_reset message igb 0000:04:00.1: broadcast resume message igb 0000:04:00.1: AER: Device recovery successful
This is interesting. We've never had any problems with the 1Gb NICs, but we're only using 10Gb for the storage network. Could it be a common problem with either the adapters, or the drivers which only replicate running the Xen enabled kernel?
Switching to Broadcom would be a possibility, though it's tricky because two of the NICs are onboard, so we'd need to replace the dual-port 1G card with a quad-port 1G card. Since you're saying you're all 10G, maybe you don't know, but if you have any specific Broadcom 1G cards you've had good fortune with, I'd be interested in knowing which models. Broadcom cards are rarely labeled as such which makes finding them a bit more difficult than Intel ones.
We've purchased a number of servers with Broadcom BCM957810A1008G, sold by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up & down like a yo-yo so far.
So far the one hypervisor with pci=nomsi has been quiet but that doesn't mean it's fixed. I need to give it 6 weeks or so. :)
It'd be more like 6-9 months for us, making it terrible to debug it :-/
Adi Pircalabu