On 31/01/17 10:49, Kevin Stange wrote: > You said 3.x kernels specifically. The kernel on Xen Made Easy now is a > 4.4 kernel. Any chance you have tested with that one? Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4. > Did you ever try without MTU=9000 (default 1500 instead)? Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement. > I am having certain issues on certain hardware where there's no shutting > down the affected NICs. Trying to do so or unload the igb module hangs > the entire box. But in that case they're throwing AER errors instead of > just unit hangs: > > pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000 > igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), > type=Transaction Layer, id=0401(Requester ID) > igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000 > igb 0000:04:00.1: [14] Completion Timeout (First) > igb 0000:04:00.1: broadcast error_detected message > igb 0000:04:00.1: broadcast slot_reset message > igb 0000:04:00.1: broadcast resume message > igb 0000:04:00.1: AER: Device recovery successful This is interesting. We've never had any problems with the 1Gb NICs, but we're only using 10Gb for the storage network. Could it be a common problem with either the adapters, or the drivers which only replicate running the Xen enabled kernel? > Switching to Broadcom would be a possibility, though it's tricky because > two of the NICs are onboard, so we'd need to replace the dual-port 1G > card with a quad-port 1G card. Since you're saying you're all 10G, > maybe you don't know, but if you have any specific Broadcom 1G cards > you've had good fortune with, I'd be interested in knowing which models. > Broadcom cards are rarely labeled as such which makes finding them a > bit more difficult than Intel ones. We've purchased a number of servers with Broadcom BCM957810A1008G, sold by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up & down like a yo-yo so far. > So far the one hypervisor with pci=nomsi has been quiet but that doesn't > mean it's fixed. I need to give it 6 weeks or so. :) It'd be more like 6-9 months for us, making it terrible to debug it :-/ Adi Pircalabu