I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs.
Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably.
- Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
Kernel messages upon failure:
pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First)
This spams to the console continuously until hard booting.
- Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210>
This spams to the console continuously until hard booting.
- Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10>
This type of system, the NIC automatically recovers and I don't need to reboot.
So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes.
On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet.
I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw.
Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues?
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs.
Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably.
- Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
Kernel messages upon failure:
pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First)
This spams to the console continuously until hard booting.
- Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210>
This spams to the console continuously until hard booting.
- Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10>
This type of system, the NIC automatically recovers and I don't need to reboot.
So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes.
On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet.
I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw.
Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues?
-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs.
Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably.
- Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
Kernel messages upon failure:
pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First)
This spams to the console continuously until hard booting.
- Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210>
This spams to the console continuously until hard booting.
- Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10>
This type of system, the NIC automatically recovers and I don't need to reboot.
So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes.
On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet.
I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw.
Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues?
Honeslty I would email Intel and see if they can help. This looks like the NIC decides something is wrong, throws off an PCIe error and then resets itself.
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
Did you have these issues with an earlier kernel?
The fix was ff1e22e7a638a0782f54f81a6c9cb139aca2da35 Author: Boris Ostrovsky boris.ostrovsky@oracle.com Date: Fri Mar 18 10:11:07 2016 -0400
xen/events: Mask a moving irq
and then there was a fix to this fix: commit f0f393877c71ad227d36705d61d1e4062bc29cf5 Author: Ross Lagerwall ross.lagerwall@citrix.com Date: Tue May 10 16:11:00 2016 +0100
xen/events: Don't move disabled irqs
-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt
CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
<snip>
Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues?
Honeslty I would email Intel and see if they can help. This looks like the NIC decides something is wrong, throws off an PCIe error and then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
Did you have these issues with an earlier kernel?
The last kernel these boxes ran was 2.6.18-412.el5xen under CentOS 5 and they were very stable, however the differences between 2.6.18 and 3.18 are immense, especially with features like ASPM and other power management code. We've run into ASPM issues on systems before going from CentOS 5 to the CentOS 6 kernel 2.6.32, but not this particular hardware, which is why my first thought was to look at ASPM.
They've all been upgraded to CentOS 6 and running the virt SIG kernel kernel-3.18.44-20.el6.x86_64. I haven't run any previous versions 3.18 or tried any other kernels.
It surprises me that we would have all these issues if there isn't a more widespread problem considering the hardware is fairly maintain and covers a lot of NIC chips.
On 01/24/2017 11:16 AM, Kevin Stange wrote:
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
<snip> >> >> Has anyone experienced similar issues with this configuration, and if so, >> does anyone have tips on how to resolve the issues? > > Honeslty I would email Intel and see if they can help. This looks like > the NIC decides something is wrong, throws off an PCIe error and > then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
I had stopped irqbalance yesterday afternoon, but had a hypervisor's NICs fail anyway in early morning this morning, so I'm pretty sure this is not the right tree to bark up.
On 01/25/2017 11:49 AM, Kevin Stange wrote:
On 01/24/2017 11:16 AM, Kevin Stange wrote:
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
<snip> >> >> Has anyone experienced similar issues with this configuration, and if so, >> does anyone have tips on how to resolve the issues? > > Honeslty I would email Intel and see if they can help. This looks like > the NIC decides something is wrong, throws off an PCIe error and > then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
I had stopped irqbalance yesterday afternoon, but had a hypervisor's NICs fail anyway in early morning this morning, so I'm pretty sure this is not the right tree to bark up.
Here is a set of drivers/fireware from Intel for those NICs:
https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver...
I will see if I can get a CentOS-6 build of the latest version of that from our older SRPM:
http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6...
I am currently very busy with several c5, c6, c7 updates and the i686 altarch c7 tree .. but I have this on my list. In the meantime, maybe someone else could also see if those drivers help you (or you could try to compile / install it).
Do you have another machine that you can use to see if you can duplicate the issue NOT running the xen.gz hypervisor boot, but just the straight kernel?
Thanks, Johnny Hughes
On 01/26/2017 09:32 AM, Johnny Hughes wrote:
On 01/25/2017 11:49 AM, Kevin Stange wrote:
On 01/24/2017 11:16 AM, Kevin Stange wrote:
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
<snip> >> >> Has anyone experienced similar issues with this configuration, and if so, >> does anyone have tips on how to resolve the issues? > > Honeslty I would email Intel and see if they can help. This looks like > the NIC decides something is wrong, throws off an PCIe error and > then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
I had stopped irqbalance yesterday afternoon, but had a hypervisor's NICs fail anyway in early morning this morning, so I'm pretty sure this is not the right tree to bark up.
Here is a set of drivers/fireware from Intel for those NICs:
https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver...
I will see if I can get a CentOS-6 build of the latest version of that from our older SRPM:
http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6...
I am currently very busy with several c5, c6, c7 updates and the i686 altarch c7 tree .. but I have this on my list. In the meantime, maybe someone else could also see if those drivers help you (or you could try to compile / install it).
Do you have another machine that you can use to see if you can duplicate the issue NOT running the xen.gz hypervisor boot, but just the straight kernel?
Actually .. I think this is the driver for you:
https://downloadcenter.intel.com/download/13663
And this explains how to make it work:
http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-prod...
On 01/26/2017 09:35 AM, Johnny Hughes wrote:
On 01/26/2017 09:32 AM, Johnny Hughes wrote:
On 01/25/2017 11:49 AM, Kevin Stange wrote:
On 01/24/2017 11:16 AM, Kevin Stange wrote:
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps!
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt@centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
<snip> >> >> Has anyone experienced similar issues with this configuration, and if so, >> does anyone have tips on how to resolve the issues? > > Honeslty I would email Intel and see if they can help. This looks like > the NIC decides something is wrong, throws off an PCIe error and > then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
I had stopped irqbalance yesterday afternoon, but had a hypervisor's NICs fail anyway in early morning this morning, so I'm pretty sure this is not the right tree to bark up.
Here is a set of drivers/fireware from Intel for those NICs:
https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver...
I will see if I can get a CentOS-6 build of the latest version of that from our older SRPM:
http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6...
I am currently very busy with several c5, c6, c7 updates and the i686 altarch c7 tree .. but I have this on my list. In the meantime, maybe someone else could also see if those drivers help you (or you could try to compile / install it).
Do you have another machine that you can use to see if you can duplicate the issue NOT running the xen.gz hypervisor boot, but just the straight kernel?
I can't actually reproduce this problem reliably. It happens randomly when the servers are up and running anywhere between a few hours and a month or more, and I haven't been able to isolate any specific way to cause it to happen. As a result I can't really test different solutions on different servers to see what helps. I was hoping other people were seeing it so that I could get some direction. If I can reproduce it, it won't take me very long to identify what the cause is. Right now if I do upgrade the drivers on the systems I won't really know if it's fixed until I don't see another issue for several months.
Actually .. I think this is the driver for you:
https://downloadcenter.intel.com/download/13663
And this explains how to make it work:
http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-prod...
The different combinations of NICs overlap both the e1000e and igb drivers, but the most egregious issues have been with the igb ones. I'll try to give this a shot and report back if I still see issues with a server after doing so, but it might be a week or two before I find out.
On 01/26/2017 02:08 PM, Kevin Stange wrote:
On 01/26/2017 09:35 AM, Johnny Hughes wrote:
On 01/26/2017 09:32 AM, Johnny Hughes wrote:
On 01/25/2017 11:49 AM, Kevin Stange wrote:
On 01/24/2017 11:16 AM, Kevin Stange wrote:
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote: > Kevin Stange, > It can be either kernel or update the NIC driver or firmware of the NIC > card. Hope that helps! > > Xlord > -----Original Message----- > From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin > Stange > Sent: Tuesday, January 24, 2017 1:04 AM > To: centos-virt@centos.org > Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / > Linux 3.18 >
<snip> >> >> Has anyone experienced similar issues with this configuration, and if so, >> does anyone have tips on how to resolve the issues? > > Honeslty I would email Intel and see if they can help. This looks like > the NIC decides something is wrong, throws off an PCIe error and > then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
I had stopped irqbalance yesterday afternoon, but had a hypervisor's NICs fail anyway in early morning this morning, so I'm pretty sure this is not the right tree to bark up.
Here is a set of drivers/fireware from Intel for those NICs:
https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver...
I will see if I can get a CentOS-6 build of the latest version of that from our older SRPM:
http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6...
I am currently very busy with several c5, c6, c7 updates and the i686 altarch c7 tree .. but I have this on my list. In the meantime, maybe someone else could also see if those drivers help you (or you could try to compile / install it).
Do you have another machine that you can use to see if you can duplicate the issue NOT running the xen.gz hypervisor boot, but just the straight kernel?
I can't actually reproduce this problem reliably. It happens randomly when the servers are up and running anywhere between a few hours and a month or more, and I haven't been able to isolate any specific way to cause it to happen. As a result I can't really test different solutions on different servers to see what helps. I was hoping other people were seeing it so that I could get some direction. If I can reproduce it, it won't take me very long to identify what the cause is. Right now if I do upgrade the drivers on the systems I won't really know if it's fixed until I don't see another issue for several months.
Actually .. I think this is the driver for you:
https://downloadcenter.intel.com/download/13663
And this explains how to make it work:
http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-prod...
The different combinations of NICs overlap both the e1000e and igb drivers, but the most egregious issues have been with the igb ones. I'll try to give this a shot and report back if I still see issues with a server after doing so, but it might be a week or two before I find out.
So the NICs giving issues in most cases were igb drivers. I've tried replacing the drivers on some HVs with the version you suggested, but it doesn't seem to have helped with stability. Any other ideas?
Have you tried to eliminate all power management features all over?
Are the devices connected to the same network infrastructure?
There has to be something common.
I've been using Intel NICs with Xen/CentOS for ages with no issues.
Karel
On 27.1.2017 02:57, Kevin Stange wrote:
On 01/26/2017 02:08 PM, Kevin Stange wrote:
On 01/26/2017 09:35 AM, Johnny Hughes wrote:
On 01/26/2017 09:32 AM, Johnny Hughes wrote:
On 01/25/2017 11:49 AM, Kevin Stange wrote:
On 01/24/2017 11:16 AM, Kevin Stange wrote:
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote: >> Kevin Stange, >> It can be either kernel or update the NIC driver or firmware of the NIC >> card. Hope that helps! >> >> Xlord >> -----Original Message----- >> From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin >> Stange >> Sent: Tuesday, January 24, 2017 1:04 AM >> To: centos-virt@centos.org >> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / >> Linux 3.18 >>
<snip> >> >> Has anyone experienced similar issues with this configuration, and if so, >> does anyone have tips on how to resolve the issues? > > Honeslty I would email Intel and see if they can help. This looks like > the NIC decides something is wrong, throws off an PCIe error and > then resets itself.
This happens for several different NICs. Is there a good contact at Intel for this kind of thing, or should I just try to reach them through their web site?
> It could also be an error in the Linux stack which would "eat" an > interrupt when migrating interrupts (which was fixed > upstream, see below). Are you running irqbalance? Could you try > turning it off?
irqbalance is enabled on these servers. I'll try disabling it.
I had stopped irqbalance yesterday afternoon, but had a hypervisor's NICs fail anyway in early morning this morning, so I'm pretty sure this is not the right tree to bark up.
Here is a set of drivers/fireware from Intel for those NICs:
https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver...
I will see if I can get a CentOS-6 build of the latest version of that from our older SRPM:
http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6...
I am currently very busy with several c5, c6, c7 updates and the i686 altarch c7 tree .. but I have this on my list. In the meantime, maybe someone else could also see if those drivers help you (or you could try to compile / install it).
Do you have another machine that you can use to see if you can duplicate the issue NOT running the xen.gz hypervisor boot, but just the straight kernel?
I can't actually reproduce this problem reliably. It happens randomly when the servers are up and running anywhere between a few hours and a month or more, and I haven't been able to isolate any specific way to cause it to happen. As a result I can't really test different solutions on different servers to see what helps. I was hoping other people were seeing it so that I could get some direction. If I can reproduce it, it won't take me very long to identify what the cause is. Right now if I do upgrade the drivers on the systems I won't really know if it's fixed until I don't see another issue for several months.
Actually .. I think this is the driver for you:
https://downloadcenter.intel.com/download/13663
And this explains how to make it work:
http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-prod...
The different combinations of NICs overlap both the e1000e and igb drivers, but the most egregious issues have been with the igb ones. I'll try to give this a shot and report back if I still see issues with a server after doing so, but it might be a week or two before I find out.
So the NICs giving issues in most cases were igb drivers. I've tried replacing the drivers on some HVs with the version you suggested, but it doesn't seem to have helped with stability. Any other ideas?
On 01/27/2017 06:08 AM, Karel Hendrych wrote:
Have you tried to eliminate all power management features all over?
I've been trying to find and disable all power management features but having relatively little luck with that solving the problems. Stabbing the the dark I've tried different ACPI settings, including completely disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off on the kernel command line. Are there other kernel options that might be useful to try?
Are the devices connected to the same network infrastructure?
There are two onboard NICs and two NICs on a dual-port card in each server. All devices connect to a cisco switch pair in VSS and the links are paired in LACP.
There has to be something common.
The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI and NFS traffic, as well as some basic management stuff over SSH, and they are configured with an MTU of 9000 on the native VLAN. It's a lot of features, but I can't really turn them off and then actually have enough load on the NICs to reproduce the issue. Several of these servers were installed and being burned in for 3 months without ever having an issue, but suddenly collapsed when I tried to bring 20 or so real-world VMs up on them.
The other NICs in the system that are connected don't exhibit issues and run only VM network interfaces. They are also in LACP and running VLAN tags, but normal 1500 MTU.
So far it seems to correlate with NICs on the expansion cards, but it's a coincidence that these cards are the ones with the storage and management traffic. I'm trying to swap some of this load to the onboard NICs to see if the issues migrate over with it, or if they stay with the expansion cards.
If the issue exists on both NIC types, then it rules out the specific NIC chipset as the culprit. It could point to the driver, but upgrading it to a newer version did not help and actually appeared to make everything worse. This issue might actually be more to do with the PCIe bridge than the NICs, but these are still different motherboards with different PCIe bridges (5520 vs C600) experiencing the same issues.
I've been using Intel NICs with Xen/CentOS for ages with no issues.
I figured that must be so. Everyone uses Intel NICs. If this was a common issue, it would probably be causing a lot of people a lot of trouble.
Are there other kernel options that might be useful to try?
pci=nomsi
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
On 27 January 2017 at 18:21, Kevin Stange kevin@steadfast.net wrote:
On 01/27/2017 06:08 AM, Karel Hendrych wrote:
Have you tried to eliminate all power management features all over?
I've been trying to find and disable all power management features but having relatively little luck with that solving the problems. Stabbing the the dark I've tried different ACPI settings, including completely disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off on the kernel command line. Are there other kernel options that might be useful to try?
Are the devices connected to the same network infrastructure?
There are two onboard NICs and two NICs on a dual-port card in each server. All devices connect to a cisco switch pair in VSS and the links are paired in LACP.
There has to be something common.
The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI and NFS traffic, as well as some basic management stuff over SSH, and they are configured with an MTU of 9000 on the native VLAN. It's a lot of features, but I can't really turn them off and then actually have enough load on the NICs to reproduce the issue. Several of these servers were installed and being burned in for 3 months without ever having an issue, but suddenly collapsed when I tried to bring 20 or so real-world VMs up on them.
The other NICs in the system that are connected don't exhibit issues and run only VM network interfaces. They are also in LACP and running VLAN tags, but normal 1500 MTU.
So far it seems to correlate with NICs on the expansion cards, but it's a coincidence that these cards are the ones with the storage and management traffic. I'm trying to swap some of this load to the onboard NICs to see if the issues migrate over with it, or if they stay with the expansion cards.
If the issue exists on both NIC types, then it rules out the specific NIC chipset as the culprit. It could point to the driver, but upgrading it to a newer version did not help and actually appeared to make everything worse. This issue might actually be more to do with the PCIe bridge than the NICs, but these are still different motherboards with different PCIe bridges (5520 vs C600) experiencing the same issues.
I've been using Intel NICs with Xen/CentOS for ages with no issues.
I figured that must be so. Everyone uses Intel NICs. If this was a common issue, it would probably be causing a lot of people a lot of trouble.
-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt
On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
Are there other kernel options that might be useful to try?
pci=nomsi
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
Incidentally, already found that one and I'm trying it currently on one of the boxes. So far there's been no issues, but it's only been since Friday.
Also, I found this:
https://xen.crc.id.au/support/guides/install/
There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl to see how stable it is, also only since Friday. I'm not using anything else he's packaged from his repo.
On a related note, does the SIG have plans to replace the 3.18 kernel which is marked as projected EOL of January 2017 (https://www.kernel.org/category/releases.html)?
On 01/30/2017 12:59 PM, Kevin Stange wrote:
On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
Are there other kernel options that might be useful to try?
pci=nomsi
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
Incidentally, already found that one and I'm trying it currently on one of the boxes. So far there's been no issues, but it's only been since Friday.
Also, I found this:
https://xen.crc.id.au/support/guides/install/
There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl to see how stable it is, also only since Friday. I'm not using anything else he's packaged from his repo.
On a related note, does the SIG have plans to replace the 3.18 kernel which is marked as projected EOL of January 2017 (https://www.kernel.org/category/releases.html)?
I am currently working on a 4.4 kernel as a replacement for the 3.18 kernel. I have it working well no el7, but not yet working well on el6. I hope to have something to release in the first 2 weeks of Feb. for testing.
On 01/30/2017 02:15 PM, Johnny Hughes wrote:
On 01/30/2017 12:59 PM, Kevin Stange wrote:
On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
Are there other kernel options that might be useful to try?
pci=nomsi
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
Incidentally, already found that one and I'm trying it currently on one of the boxes. So far there's been no issues, but it's only been since Friday.
Also, I found this:
https://xen.crc.id.au/support/guides/install/
There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl to see how stable it is, also only since Friday. I'm not using anything else he's packaged from his repo.
On a related note, does the SIG have plans to replace the 3.18 kernel which is marked as projected EOL of January 2017 (https://www.kernel.org/category/releases.html)?
I am currently working on a 4.4 kernel as a replacement for the 3.18 kernel. I have it working well no el7, but not yet working well on el6. I hope to have something to release in the first 2 weeks of Feb. for testing.
What kind of issues are you having with 4.4? Since I'm testing that "Xen Made Easy" build of 4.4, are there any things I should watch out for? Might be worth looking at what he did for his builds to see if that helps get yours working better.
http://au1.mirror.crc.id.au/repo/el6/SRPM/
On 28/01/17 05:21, Kevin Stange wrote:
On 01/27/2017 06:08 AM, Karel Hendrych wrote:
Have you tried to eliminate all power management features all over?
I've been trying to find and disable all power management features but having relatively little luck with that solving the problems. Stabbing the the dark I've tried different ACPI settings, including completely disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off on the kernel command line. Are there other kernel options that might be useful to try?
May I chip in here? In our environment we're randomly seeing:
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit Hang Jan 17 23:40:14 xen01 kernel: Tx Queue <0> Jan 17 23:40:14 xen01 kernel: TDH, TDT <9a>, <127> Jan 17 23:40:14 xen01 kernel: next_to_use <127> Jan 17 23:40:14 xen01 kernel: next_to_clean <98> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx_buffer_info[next_to_clean] Jan 17 23:40:14 xen01 kernel: time_stamp <218443db3> Jan 17 23:40:14 xen01 kernel: jiffies <218445368> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx hang 1 detected on queue 0, resetting adapter Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Reset adapter Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1 eth6: PCIe transaction pending bit also did not clear. Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1: master disable timed out Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for interface eth6, disabling it in 200 ms. Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely down for interface eth6, disabling it [...] repeated every second or so.
Are the devices connected to the same network infrastructure?
There are two onboard NICs and two NICs on a dual-port card in each server. All devices connect to a cisco switch pair in VSS and the links are paired in LACP.
We've been experienced ixgbe stability issues on CentOS 6.x with various 3.x kernels for years with different ixgbe driver versions and, to date, the only way to completely get rid of the issue was to switch from Intel to Broadcom. Just like in your case, the problem pops up randomly and the only reliable temporary fix is to reboot the affected Xen node. Another temporary fix that worked several times but not always was to migrate / shutdown the domUs, deactivate the volume groups, log out of all the iSCSI targets, "ifdown bond1" and "modprobe -r ixgbe" followed by "ifup bond1".
The set up is: - Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2 - Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos - LACP bonding to connect to the NFS & iSCSI storage using Brocade VDX6740T fabric. MTU=9000
There has to be something common.
The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI and NFS traffic, as well as some basic management stuff over SSH, and they are configured with an MTU of 9000 on the native VLAN. It's a lot of features, but I can't really turn them off and then actually have enough load on the NICs to reproduce the issue. Several of these servers were installed and being burned in for 3 months without ever having an issue, but suddenly collapsed when I tried to bring 20 or so real-world VMs up on them.
There "appears" to be some sort of load-dependent pattern here too, but it's impossible to confirm it. The only stability improvement I was able to use "dom0_max_vcpus=1 dom0_vcpus_pin". Haven't tried pci=nomsi yet.
The other NICs in the system that are connected don't exhibit issues and run only VM network interfaces. They are also in LACP and running VLAN tags, but normal 1500 MTU.
So far it seems to correlate with NICs on the expansion cards, but it's a coincidence that these cards are the ones with the storage and management traffic. I'm trying to swap some of this load to the onboard NICs to see if the issues migrate over with it, or if they stay with the expansion cards.
If the issue exists on both NIC types, then it rules out the specific NIC chipset as the culprit. It could point to the driver, but upgrading it to a newer version did not help and actually appeared to make everything worse. This issue might actually be more to do with the PCIe bridge than the NICs, but these are still different motherboards with different PCIe bridges (5520 vs C600) experiencing the same issues.
I've been using Intel NICs with Xen/CentOS for ages with no issues.
I figured that must be so. Everyone uses Intel NICs. If this was a common issue, it would probably be causing a lot of people a lot of trouble.
Adi Pircalabu
On 01/30/2017 04:17 PM, Adi Pircalabu wrote:
On 28/01/17 05:21, Kevin Stange wrote:
On 01/27/2017 06:08 AM, Karel Hendrych wrote:
Have you tried to eliminate all power management features all over?
I've been trying to find and disable all power management features but having relatively little luck with that solving the problems. Stabbing the the dark I've tried different ACPI settings, including completely disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off on the kernel command line. Are there other kernel options that might be useful to try?
May I chip in here? In our environment we're randomly seeing:
Welcome. It's a relief to know someone else has been having a similar nightmare! Perhaps that's not encouraging...
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit Hang Jan 17 23:40:14 xen01 kernel: Tx Queue <0> Jan 17 23:40:14 xen01 kernel: TDH, TDT <9a>, <127> Jan 17 23:40:14 xen01 kernel: next_to_use <127> Jan 17 23:40:14 xen01 kernel: next_to_clean <98> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx_buffer_info[next_to_clean] Jan 17 23:40:14 xen01 kernel: time_stamp <218443db3> Jan 17 23:40:14 xen01 kernel: jiffies <218445368> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx hang 1 detected on queue 0, resetting adapter Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Reset adapter Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1 eth6: PCIe transaction pending bit also did not clear. Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1: master disable timed out Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for interface eth6, disabling it in 200 ms. Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely down for interface eth6, disabling it [...] repeated every second or so.
Are the devices connected to the same network infrastructure?
There are two onboard NICs and two NICs on a dual-port card in each server. All devices connect to a cisco switch pair in VSS and the links are paired in LACP.
We've been experienced ixgbe stability issues on CentOS 6.x with various 3.x kernels for years with different ixgbe driver versions and, to date, the only way to completely get rid of the issue was to switch from Intel to Broadcom. Just like in your case, the problem pops up randomly and the only reliable temporary fix is to reboot the affected Xen node. Another temporary fix that worked several times but not always was to migrate / shutdown the domUs, deactivate the volume groups, log out of all the iSCSI targets, "ifdown bond1" and "modprobe -r ixgbe" followed by "ifup bond1".
The set up is:
- Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2
- Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos
- LACP bonding to connect to the NFS & iSCSI storage using Brocade
VDX6740T fabric. MTU=9000
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Did you ever try without MTU=9000 (default 1500 instead)?
I am having certain issues on certain hardware where there's no shutting down the affected NICs. Trying to do so or unload the igb module hangs the entire box. But in that case they're throwing AER errors instead of just unit hangs:
pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000 igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0401(Requester ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000 igb 0000:04:00.1: [14] Completion Timeout (First) igb 0000:04:00.1: broadcast error_detected message igb 0000:04:00.1: broadcast slot_reset message igb 0000:04:00.1: broadcast resume message igb 0000:04:00.1: AER: Device recovery successful
Spammed continuously.
Switching to Broadcom would be a possibility, though it's tricky because two of the NICs are onboard, so we'd need to replace the dual-port 1G card with a quad-port 1G card. Since you're saying you're all 10G, maybe you don't know, but if you have any specific Broadcom 1G cards you've had good fortune with, I'd be interested in knowing which models. Broadcom cards are rarely labeled as such which makes finding them a bit more difficult than Intel ones.
There has to be something common.
The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI and NFS traffic, as well as some basic management stuff over SSH, and they are configured with an MTU of 9000 on the native VLAN. It's a lot of features, but I can't really turn them off and then actually have enough load on the NICs to reproduce the issue. Several of these servers were installed and being burned in for 3 months without ever having an issue, but suddenly collapsed when I tried to bring 20 or so real-world VMs up on them.
There "appears" to be some sort of load-dependent pattern here too, but it's impossible to confirm it. The only stability improvement I was able to use "dom0_max_vcpus=1 dom0_vcpus_pin". Haven't tried pci=nomsi yet.
So far the one hypervisor with pci=nomsi has been quiet but that doesn't mean it's fixed. I need to give it 6 weeks or so. :)
Thanks for your input on the issue!
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
I am having certain issues on certain hardware where there's no shutting down the affected NICs. Trying to do so or unload the igb module hangs the entire box. But in that case they're throwing AER errors instead of just unit hangs:
pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000 igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0401(Requester ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000 igb 0000:04:00.1: [14] Completion Timeout (First) igb 0000:04:00.1: broadcast error_detected message igb 0000:04:00.1: broadcast slot_reset message igb 0000:04:00.1: broadcast resume message igb 0000:04:00.1: AER: Device recovery successful
This is interesting. We've never had any problems with the 1Gb NICs, but we're only using 10Gb for the storage network. Could it be a common problem with either the adapters, or the drivers which only replicate running the Xen enabled kernel?
Switching to Broadcom would be a possibility, though it's tricky because two of the NICs are onboard, so we'd need to replace the dual-port 1G card with a quad-port 1G card. Since you're saying you're all 10G, maybe you don't know, but if you have any specific Broadcom 1G cards you've had good fortune with, I'd be interested in knowing which models. Broadcom cards are rarely labeled as such which makes finding them a bit more difficult than Intel ones.
We've purchased a number of servers with Broadcom BCM957810A1008G, sold by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up & down like a yo-yo so far.
So far the one hypervisor with pci=nomsi has been quiet but that doesn't mean it's fixed. I need to give it 6 weeks or so. :)
It'd be more like 6-9 months for us, making it terrible to debug it :-/
Adi Pircalabu
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that 4.4 build over the next few weeks to report on any issues. I'm hoping something happened between 3.18 and 4.4 that fixed underlying problems.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then. I tried it on one box which hasn't had the issue again yet, but that doesn't guarantee anything.
I am having certain issues on certain hardware where there's no shutting down the affected NICs. Trying to do so or unload the igb module hangs the entire box. But in that case they're throwing AER errors instead of just unit hangs:
pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000 igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0401(Requester ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00004000/00000000 igb 0000:04:00.1: [14] Completion Timeout (First) igb 0000:04:00.1: broadcast error_detected message igb 0000:04:00.1: broadcast slot_reset message igb 0000:04:00.1: broadcast resume message igb 0000:04:00.1: AER: Device recovery successful
This is interesting. We've never had any problems with the 1Gb NICs, but we're only using 10Gb for the storage network. Could it be a common problem with either the adapters, or the drivers which only replicate running the Xen enabled kernel?
Since I've never run the 3.18 kernel on a box of this type without running in a dom0 and since I can't reproduce this kind of issue without a fair amount of NIC load over a tremendous period of time, it's impossible to test if it's tied to Xen.
However, I know this hardware works well under 2.6.32-*.el6 and 3.10.0-*.el7 kernels without stability problems, as it did with 2.6.18-*.el5xen (Xen 3.4.4).
I suspect the above errors are actually due to something PCIe related, and I have a subset of boxes which are actually being impacted by two distinct problems with equivalent impact, which increases the likelihood that the boxes will die. Another set of boxes only ever sees the unit hangs which seem unrecoverable even unloading/reloading the driver. A third set has random recoverable unit hangs only. With so much diversity, it's even harder to pin any specific causes to the problems.
The fact we're both pushing NFS and iSCSI traffic over these links makes me wonder if there's something about that kind of traffic that increases the chances of causing these issues. When I put VM network traffic over the same NICs, they seem a lot less prone to failures, but also end up pushing less traffic in general.
Switching to Broadcom would be a possibility, though it's tricky because two of the NICs are onboard, so we'd need to replace the dual-port 1G card with a quad-port 1G card. Since you're saying you're all 10G, maybe you don't know, but if you have any specific Broadcom 1G cards you've had good fortune with, I'd be interested in knowing which models. Broadcom cards are rarely labeled as such which makes finding them a bit more difficult than Intel ones.
We've purchased a number of servers with Broadcom BCM957810A1008G, sold by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up & down like a yo-yo so far.
So far the one hypervisor with pci=nomsi has been quiet but that doesn't mean it's fixed. I need to give it 6 weeks or so. :)
It'd be more like 6-9 months for us, making it terrible to debug it :-/
I had a bunch of these on relatively light VM load for 3 months for "burn in" with no issues but they've been pretty aggressively failing since I started to try to put real loads on them. Still, it's odd because some of the boxes with identical hardware and similar VM loads have not yet blown up after 3 or more weeks, and maybe they won't for several months.
On 01/30/2017 06:41 PM, Kevin Stange wrote:
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that 4.4 build over the next few weeks to report on any issues. I'm hoping something happened between 3.18 and 4.4 that fixed underlying problems.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then. I tried it on one box which hasn't had the issue again yet, but that doesn't guarantee anything.
I was able to discover something new, which might not conclusively prove anything, but it at least seems to rule out the pci=nomsi kernel option from being effective.
I had one server booted with that option as well as MTU 1500. It was stable for quite a long time, so I decided to try turning the MTU back to 9000 and within 12 hours, the interface on the expansion NIC with the jumbo MTU failed.
The other NIC in the LACP bundle is onboard and didn't fail. The other NIC on the dual-port expansion card also didn't fail. This leads me to believe that ONE of the bugs I'm experiencing is related to 82575EB + jumbo frames.
I still think I'm also having a PCI-e issue that is separate and additional on top of that, and which has not reared its head recently, making it difficult for me to gather any new data.
One of the things I've done that seemed to help a lot with stability was balance the LACP so that one NIC from onboard and one NIC from expansion card is in each LAG. Previously we just had the first LAG onboard and the second on the expansion card. This way, at least, given the expansion NIC's propensity toward failing first, I don't have to crash the server and all running VMs to recover.
I've seen absolutely no issues yet with the 4.4 kernel either, but I am not willing to call that a win because of the quiet from even the servers on which no tweaks have been applied yet.
I will continue the story as I have more material! :)
Kevin Stange, Good attempt, jumbo frame is extremely important if hosting for IaaS, not to mention other provider who need feather for network specific application.
Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Saturday, February 11, 2017 3:30 AM To: centos-virt@centos.org Subject: Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On 01/30/2017 06:41 PM, Kevin Stange wrote:
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that 4.4 build over the next few weeks to report on any issues. I'm hoping something happened between 3.18 and 4.4 that fixed underlying problems.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then. I tried it on one box which hasn't had the issue again yet, but that doesn't guarantee anything.
I was able to discover something new, which might not conclusively prove anything, but it at least seems to rule out the pci=nomsi kernel option from being effective.
I had one server booted with that option as well as MTU 1500. It was stable for quite a long time, so I decided to try turning the MTU back to 9000 and within 12 hours, the interface on the expansion NIC with the jumbo MTU failed.
The other NIC in the LACP bundle is onboard and didn't fail. The other NIC on the dual-port expansion card also didn't fail. This leads me to believe that ONE of the bugs I'm experiencing is related to 82575EB + jumbo frames.
I still think I'm also having a PCI-e issue that is separate and additional on top of that, and which has not reared its head recently, making it difficult for me to gather any new data.
One of the things I've done that seemed to help a lot with stability was balance the LACP so that one NIC from onboard and one NIC from expansion card is in each LAG. Previously we just had the first LAG onboard and the second on the expansion card. This way, at least, given the expansion NIC's propensity toward failing first, I don't have to crash the server and all running VMs to recover.
I've seen absolutely no issues yet with the 4.4 kernel either, but I am not willing to call that a win because of the quiet from even the servers on which no tweaks have been applied yet.
I will continue the story as I have more material! :)
-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt
On 11/02/17 06:29, Kevin Stange wrote:
On 01/30/2017 06:41 PM, Kevin Stange wrote:
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that 4.4 build over the next few weeks to report on any issues. I'm hoping something happened between 3.18 and 4.4 that fixed underlying problems.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then. I tried it on one box which hasn't had the issue again yet, but that doesn't guarantee anything.
I was able to discover something new, which might not conclusively prove anything, but it at least seems to rule out the pci=nomsi kernel option from being effective.
I had one server booted with that option as well as MTU 1500. It was stable for quite a long time, so I decided to try turning the MTU back to 9000 and within 12 hours, the interface on the expansion NIC with the jumbo MTU failed.
The other NIC in the LACP bundle is onboard and didn't fail. The other NIC on the dual-port expansion card also didn't fail. This leads me to believe that ONE of the bugs I'm experiencing is related to 82575EB + jumbo frames.
I still think I'm also having a PCI-e issue that is separate and additional on top of that, and which has not reared its head recently, making it difficult for me to gather any new data.
One of the things I've done that seemed to help a lot with stability was balance the LACP so that one NIC from onboard and one NIC from expansion card is in each LAG. Previously we just had the first LAG onboard and the second on the expansion card. This way, at least, given the expansion NIC's propensity toward failing first, I don't have to crash the server and all running VMs to recover.
I've seen absolutely no issues yet with the 4.4 kernel either, but I am not willing to call that a win because of the quiet from even the servers on which no tweaks have been applied yet.
Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, though: what kernel was this machine running at the time?
Adi Pircalabu
On 02/12/2017 05:07 PM, Adi Pircalabu wrote:
On 11/02/17 06:29, Kevin Stange wrote:
On 01/30/2017 06:41 PM, Kevin Stange wrote:
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that 4.4 build over the next few weeks to report on any issues. I'm hoping something happened between 3.18 and 4.4 that fixed underlying problems.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then. I tried it on one box which hasn't had the issue again yet, but that doesn't guarantee anything.
I was able to discover something new, which might not conclusively prove anything, but it at least seems to rule out the pci=nomsi kernel option from being effective.
I had one server booted with that option as well as MTU 1500. It was stable for quite a long time, so I decided to try turning the MTU back to 9000 and within 12 hours, the interface on the expansion NIC with the jumbo MTU failed.
The other NIC in the LACP bundle is onboard and didn't fail. The other NIC on the dual-port expansion card also didn't fail. This leads me to believe that ONE of the bugs I'm experiencing is related to 82575EB + jumbo frames.
I still think I'm also having a PCI-e issue that is separate and additional on top of that, and which has not reared its head recently, making it difficult for me to gather any new data.
One of the things I've done that seemed to help a lot with stability was balance the LACP so that one NIC from onboard and one NIC from expansion card is in each LAG. Previously we just had the first LAG onboard and the second on the expansion card. This way, at least, given the expansion NIC's propensity toward failing first, I don't have to crash the server and all running VMs to recover.
I've seen absolutely no issues yet with the 4.4 kernel either, but I am not willing to call that a win because of the quiet from even the servers on which no tweaks have been applied yet.
Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, though: what kernel was this machine running at the time?
Kernel running at the time was the Virt SIG's 3.18.44-20 kernel.
As a further note, within an additional 24 hours, the onboard Intel 82576 that was switched to enable jumbo frames also failed and we had to reboot the server. The expansion and onboard ports without jumbo frames did not fail. Since reboot, it's on the 4.4.47 kernel from Xen Made Easy now with jumbo frames and has not exhibited issues since Friday.
Kevin Stange Sound interesting.
Xlord
-----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin Stange Sent: Tuesday, February 14, 2017 2:09 AM To: Adi Pircalabu adi@ddns.com.au; Discussion about the virtualization on CentOS centos-virt@centos.org Subject: Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On 02/12/2017 05:07 PM, Adi Pircalabu wrote:
On 11/02/17 06:29, Kevin Stange wrote:
On 01/30/2017 06:41 PM, Kevin Stange wrote:
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
On 31/01/17 10:49, Kevin Stange wrote:
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a 4.4 kernel. Any chance you have tested with that one?
Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that 4.4 build over the next few weeks to report on any issues. I'm hoping something happened between 3.18 and 4.4 that fixed underlying
problems.
Did you ever try without MTU=9000 (default 1500 instead)?
Yes, also with all sorts of configuration combinations like LACP rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then. I tried it on one box which hasn't had the issue again yet, but that doesn't guarantee anything.
I was able to discover something new, which might not conclusively prove anything, but it at least seems to rule out the pci=nomsi kernel option from being effective.
I had one server booted with that option as well as MTU 1500. It was stable for quite a long time, so I decided to try turning the MTU back to 9000 and within 12 hours, the interface on the expansion NIC with the jumbo MTU failed.
The other NIC in the LACP bundle is onboard and didn't fail. The other NIC on the dual-port expansion card also didn't fail. This leads me to believe that ONE of the bugs I'm experiencing is related to 82575EB + jumbo frames.
I still think I'm also having a PCI-e issue that is separate and additional on top of that, and which has not reared its head recently, making it difficult for me to gather any new data.
One of the things I've done that seemed to help a lot with stability was balance the LACP so that one NIC from onboard and one NIC from expansion card is in each LAG. Previously we just had the first LAG onboard and the second on the expansion card. This way, at least, given the expansion NIC's propensity toward failing first, I don't have to crash the server and all running VMs to recover.
I've seen absolutely no issues yet with the 4.4 kernel either, but I am not willing to call that a win because of the quiet from even the servers on which no tweaks have been applied yet.
Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, though: what kernel was this machine running at the time?
Kernel running at the time was the Virt SIG's 3.18.44-20 kernel.
As a further note, within an additional 24 hours, the onboard Intel 82576 that was switched to enable jumbo frames also failed and we had to reboot the server. The expansion and onboard ports without jumbo frames did not fail. Since reboot, it's on the 4.4.47 kernel from Xen Made Easy now with jumbo frames and has not exhibited issues since Friday.
-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt
On 14-02-2017 5:08, Kevin Stange wrote:
On 02/12/2017 05:07 PM, Adi Pircalabu wrote:
[...]
Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, though: what kernel was this machine running at the time?
Kernel running at the time was the Virt SIG's 3.18.44-20 kernel.
As a further note, within an additional 24 hours, the onboard Intel 82576 that was switched to enable jumbo frames also failed and we had to reboot the server. The expansion and onboard ports without jumbo frames did not fail. Since reboot, it's on the 4.4.47 kernel from Xen Made Easy now with jumbo frames and has not exhibited issues since Friday.
Fingers crossed for 4.4.47 kernel, thanks and keep us posted :) Cheers,
--- Adi Pircalabu
On 30 January 2017 at 22:17, Adi Pircalabu adi@ddns.com.au wrote:
May I chip in here? In our environment we're randomly seeing:
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit Hang
Someone in this thread: https://sourceforge.net/p/e1000/bugs/530/#2855 reported that *"With these kernels I was only able to work around the issue by disabling tx-checksumming offload with ethtool."*
However, that was reported for Kernels 4.2.6 / 4.2.8 / 4.4.8 and 4.4.10. I just thought it could be something you could rule out and hence mentioned it:
ethtool --offload eth6 rx off tx off
Another thing to rule out in case its a regression with Intel NICs and TSO:
# tso => tcp-segmentation-offload # gso => generic-segmentation-offload # gro => generic-receive-offload # sg => scatter-gather # ufo => udp-fragmentation-offload (Cannot change) # lro => large-receive-offload (Cannot change)
ethtool -K eth6 tso off gso off gro off sg off
On 31/01/17 21:00, Jinesh Choksi wrote:
On 30 January 2017 at 22:17, Adi Pircalabu wrote:
May I chip in here? In our environment we're randomly seeing: Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit Hang
Someone in this thread: https://sourceforge.net/p/e1000/bugs/530/#2855 reported that /"With these kernels I was only able to work around the issue by disabling tx-checksumming offload with ethtool."/
However, that was reported for Kernels 4.2.6 / 4.2.8 / 4.4.8 and 4.4.10. I just thought it could be something you could rule out and hence mentioned it:
ethtool --offload eth6 rx off tx off
Another thing to rule out in case its a regression with Intel NICs and TSO:
# tso => tcp-segmentation-offload # gso => generic-segmentation-offload # gro => generic-receive-offload # sg => scatter-gather # ufo => udp-fragmentation-offload (Cannot change) # lro => large-receive-offload (Cannot change)
ethtool -K eth6 tso off gso off gro off sg off
Nice, useful information. I've just disabled tx & rx checksumming on all the 10Gb interfaces on the affected servers, see how it goes. But as I said yesterday, in our environment it takes months to replicate.
Thanks,
Adi Pircalabu
On 01/23/2017 11:04 AM, Kevin Stange wrote:
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs.
Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably.
- Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
Kernel messages upon failure:
pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First)
This spams to the console continuously until hard booting.
- Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210>
This spams to the console continuously until hard booting.
- Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10>
This type of system, the NIC automatically recovers and I don't need to reboot.
So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes.
On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet.
I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw.
Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues?
Kevin,
Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs).
If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages.
The actual path is here for the packages:
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
Hopefully this helps.
Thanks, Johnny Hughes
On 02/21/2017 11:47 AM, Johnny Hughes wrote:
On 01/23/2017 11:04 AM, Kevin Stange wrote:
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs.
Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably.
- Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
Kernel messages upon failure:
pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First)
This spams to the console continuously until hard booting.
- Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210>
This spams to the console continuously until hard booting.
- Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10>
This type of system, the NIC automatically recovers and I don't need to reboot.
So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes.
On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet.
I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw.
Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues?
Kevin,
Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs).
If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages.
The actual path is here for the packages:
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
Hopefully this helps.
I should have said .. 'just releaed for testing' :)
I have been using this for 4 or 5 days with no issues in production, but it needs testing before final release :)
On 02/21/2017 11:50 AM, Johnny Hughes wrote:
On 02/21/2017 11:47 AM, Johnny Hughes wrote:
Kevin,
Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs).
If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages.
The actual path is here for the packages:
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
Hopefully this helps.
I should have said .. 'just releaed for testing' :)
I have been using this for 4 or 5 days with no issues in production, but it needs testing before final release :)
Currently I've moved most of my servers onto the 4.4 kernel from xen made easy and they've been stable. I have some indications of an issue with one of my 3.18 servers right now which required it to be rebooted, so I'm going to bring the 4.9 kernel up on that server to see how it does. It may take a few weeks or more to draw any conclusions.
On 02/21/2017 05:32 PM, Kevin Stange wrote:
On 02/21/2017 11:50 AM, Johnny Hughes wrote:
On 02/21/2017 11:47 AM, Johnny Hughes wrote:
Kevin,
Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs).
If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages.
The actual path is here for the packages:
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
Hopefully this helps.
I should have said .. 'just releaed for testing' :)
I have been using this for 4 or 5 days with no issues in production, but it needs testing before final release :)
Currently I've moved most of my servers onto the 4.4 kernel from xen made easy and they've been stable. I have some indications of an issue with one of my 3.18 servers right now which required it to be rebooted, so I'm going to bring the 4.9 kernel up on that server to see how it does. It may take a few weeks or more to draw any conclusions.
Currently running 4.9.11 on a few servers and they've been working fine. No new issues have come up so far, anyway.
I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9 server has yet had a NIC issue, with some being up almost a full month. It looks promising! (I'm knocking on all the wood everywhere, though.)
On 02/24/2017 11:51 AM, Kevin Stange wrote:
On 02/21/2017 05:32 PM, Kevin Stange wrote:
On 02/21/2017 11:50 AM, Johnny Hughes wrote:
On 02/21/2017 11:47 AM, Johnny Hughes wrote:
Kevin,
Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs).
If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages.
The actual path is here for the packages:
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
Hopefully this helps.
I should have said .. 'just releaed for testing' :)
I have been using this for 4 or 5 days with no issues in production, but it needs testing before final release :)
Currently I've moved most of my servers onto the 4.4 kernel from xen made easy and they've been stable. I have some indications of an issue with one of my 3.18 servers right now which required it to be rebooted, so I'm going to bring the 4.9 kernel up on that server to see how it does. It may take a few weeks or more to draw any conclusions.
Currently running 4.9.11 on a few servers and they've been working fine. No new issues have come up so far, anyway.
I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9 server has yet had a NIC issue, with some being up almost a full month. It looks promising! (I'm knocking on all the wood everywhere, though.)
I'm ready to call this conclusive. The problems I was having across the board seemed to be caused by something seriously broken in 3.18. Most of my servers are now on 4.9.13 or newer and everything has been working very well.
I'm not going to post any further updates unless something breaks. Thanks to everyone that provided tips and suggestions along the way.
On 03/16/2017 04:22 PM, Kevin Stange wrote:
I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9 server has yet had a NIC issue, with some being up almost a full month. It looks promising! (I'm knocking on all the wood everywhere, though.)
I'm ready to call this conclusive. The problems I was having across the board seemed to be caused by something seriously broken in 3.18. Most of my servers are now on 4.9.13 or newer and everything has been working very well.
I'm not going to post any further updates unless something breaks. Thanks to everyone that provided tips and suggestions along the way.
Do you mind sharing what hardware have you been running the 4.9 kernel on other than "Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB" and "Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB" if any? Are you using any SATA/SAS controllers?
Thanks, Sarah
On 03/25/2017 02:35 PM, Sarah Newman wrote:
On 03/16/2017 04:22 PM, Kevin Stange wrote:
I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9 server has yet had a NIC issue, with some being up almost a full month. It looks promising! (I'm knocking on all the wood everywhere, though.)
I'm ready to call this conclusive. The problems I was having across the board seemed to be caused by something seriously broken in 3.18. Most of my servers are now on 4.9.13 or newer and everything has been working very well.
I'm not going to post any further updates unless something breaks. Thanks to everyone that provided tips and suggestions along the way.
Do you mind sharing what hardware have you been running the 4.9 kernel on other than "Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB" and "Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB" if any? Are you using any SATA/SAS controllers?
We have no expansion cards installed except for the dual-port gigabit NICs. We're using the onboard SATA controller for only the local Dom0 OS, and iSCSI and NFS for managing storage for VMs and images.
On 03/27/2017 04:03 PM, Kevin Stange wrote:
On 03/25/2017 02:35 PM, Sarah Newman wrote:
On 03/16/2017 04:22 PM, Kevin Stange wrote:
I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9 server has yet had a NIC issue, with some being up almost a full month. It looks promising! (I'm knocking on all the wood everywhere, though.)
I'm ready to call this conclusive. The problems I was having across the board seemed to be caused by something seriously broken in 3.18. Most of my servers are now on 4.9.13 or newer and everything has been working very well.
I'm not going to post any further updates unless something breaks. Thanks to everyone that provided tips and suggestions along the way.
Do you mind sharing what hardware have you been running the 4.9 kernel on other than "Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB" and "Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB" if any? Are you using any SATA/SAS controllers?
We have no expansion cards installed except for the dual-port gigabit NICs. We're using the onboard SATA controller for only the local Dom0 OS, and iSCSI and NFS for managing storage for VMs and images.
We've got some other motherboards as well, I think this list is exhaustive:
Supermicro X8DT3 Supermicro X8DT6 Supermicro X9DRD-iF/LF Supermicro X9DRT Supermicro X9SCL/X9SCM
These are -F variants which means they include a BMC chip with a separate NIC. A few of the X8DT3 are the LN4 variant, which has 4 onboard NICs and therefore we did not use an expansion NIC.
On 28-03-2017 8:12, Kevin Stange wrote:
On 03/27/2017 04:03 PM, Kevin Stange wrote:
On 03/25/2017 02:35 PM, Sarah Newman wrote:
On 03/16/2017 04:22 PM, Kevin Stange wrote:
I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9 server has yet had a NIC issue, with some being up almost a full month. It looks promising! (I'm knocking on all the wood everywhere, though.)
I'm ready to call this conclusive. The problems I was having across the board seemed to be caused by something seriously broken in 3.18. Most of my servers are now on 4.9.13 or newer and everything has been working very well.
I'm not going to post any further updates unless something breaks. Thanks to everyone that provided tips and suggestions along the way.
Do you mind sharing what hardware have you been running the 4.9 kernel on other than "Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB" and "Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB" if any? Are you using any SATA/SAS controllers?
We have no expansion cards installed except for the dual-port gigabit NICs. We're using the onboard SATA controller for only the local Dom0 OS, and iSCSI and NFS for managing storage for VMs and images.
We've got some other motherboards as well, I think this list is exhaustive:
Supermicro X8DT3 Supermicro X8DT6 Supermicro X9DRD-iF/LF Supermicro X9DRT Supermicro X9SCL/X9SCM
These are -F variants which means they include a BMC chip with a separate NIC. A few of the X8DT3 are the LN4 variant, which has 4 onboard NICs and therefore we did not use an expansion NIC.
FYI, Here's one of our machines, which just crapped itself earlier today without being subjected to any significant load. Before the crash it was running kernel-3.18.44-20.el6.x86_64, now it's on kernel-4.9.13-22.el6.x86_64: - Dell PowerEdge R620 - 2 x Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 6 cores each - Dual Intel 10-Gigabit X540-AT2 (rev 01) Before the crash both em interfaces member of bond1, which connects to the storage network, had tx & rx checksumming turned off. xen_commandline: dom0_mem=1536M,max:2048M dom0_max_vcpus=1 dom0_vcpus_pin cpuinfo com1=115200,8n1 console=com1,tty loglvl=all guest_loglvl=all
--- Adi Pircalabu