On 01/30/2017 04:17 PM, Adi Pircalabu wrote:
On 28/01/17 05:21, Kevin Stange wrote:
On 01/27/2017 06:08 AM, Karel Hendrych wrote:
Have you tried to eliminate all power management features all over?
I've been trying to find and disable all power management features but having relatively little luck with that solving the problems. Stabbing the the dark I've tried different ACPI settings, including completely disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off on the kernel command line. Are there other kernel options that might be useful to try?
May I chip in here? In our environment we're randomly seeing:
Welcome. It's a relief to know someone else has been having a similar nightmare! Perhaps that's not encouraging...
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit Hang Jan 17 23:40:14 xen01 kernel: Tx Queue <0> Jan 17 23:40:14 xen01 kernel: TDH, TDT <9a>, <127> Jan 17 23:40:14 xen01 kernel: next_to_use <127> Jan 17 23:40:14 xen01 kernel: next_to_clean <98> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx_buffer_info[next_to_clean] Jan 17 23:40:14 xen01 kernel: time_stamp <218443db3> Jan 17 23:40:14 xen01 kernel: jiffies <218445368> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx hang 1 detected on queue 0, resetting adapter Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Reset adapter Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1 eth6: PCIe transaction pending bit also did not clear. Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1: master disable timed out Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for interface eth6, disabling it in 200 ms. Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely down for interface eth6, disabling it [...] repeated every second or so.
Are the devices connected to the same network infrastructure?
There are two onboard NICs and two NICs on a dual-port card in each server. All devices connect to a cisco switch pair in VSS and the links are paired in LACP.
We've been experienced ixgbe stability issues on CentOS 6.x with various 3.x kernels for years with different ixgbe driver versions and, to date, the only way to completely get rid of the issue was to switch from Intel to Broadcom. Just like in your case, the problem pops up randomly and the only reliable temporary fix is to reboot the affected Xen node. Another temporary fix that worked several times but not always was to migrate / shutdown the domUs, deactivate the volume groups, log out of all the iSCSI targets, "ifdown bond1" and "modprobe -r ixgbe" followed by "ifup bond1".
The set up is:
- Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2
- Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos
- LACP bonding to connect to the NFS & iSCSI storage using Brocade
VDX6740T fabric. MTU=9000
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Did you ever try without MTU=9000 (default 1500 instead)?
I am having certain issues on certain hardware where there's no shutting down the affected NICs. Trying to do so or unload the igb module hangs the entire box. But in that case they're throwing AER errors instead of just unit hangs:
pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000 igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0401(Requester ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000 igb 0000:04:00.1: [14] Completion Timeout (First) igb 0000:04:00.1: broadcast error_detected message igb 0000:04:00.1: broadcast slot_reset message igb 0000:04:00.1: broadcast resume message igb 0000:04:00.1: AER: Device recovery successful
Spammed continuously.
Switching to Broadcom would be a possibility, though it's tricky because two of the NICs are onboard, so we'd need to replace the dual-port 1G card with a quad-port 1G card. Since you're saying you're all 10G, maybe you don't know, but if you have any specific Broadcom 1G cards you've had good fortune with, I'd be interested in knowing which models. Broadcom cards are rarely labeled as such which makes finding them a bit more difficult than Intel ones.
There has to be something common.
The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI and NFS traffic, as well as some basic management stuff over SSH, and they are configured with an MTU of 9000 on the native VLAN. It's a lot of features, but I can't really turn them off and then actually have enough load on the NICs to reproduce the issue. Several of these servers were installed and being burned in for 3 months without ever having an issue, but suddenly collapsed when I tried to bring 20 or so real-world VMs up on them.
There "appears" to be some sort of load-dependent pattern here too, but it's impossible to confirm it. The only stability improvement I was able to use "dom0_max_vcpus=1 dom0_vcpus_pin". Haven't tried pci=nomsi yet.
So far the one hypervisor with pci=nomsi has been quiet but that doesn't mean it's fixed. I need to give it 6 weeks or so. :)
Thanks for your input on the issue!