Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

24 Jan 2017

      On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
...
Kevin Stange,
It can be either kernel or update the NIC driver or firmware of the NIC
card. Hope that helps!
Xlord
-----Original Message-----
From: CentOS-virt [mailto:centos-virt-bounces@centos.org] On Behalf Of Kevin
Stange
Sent: Tuesday, January 24, 2017 1:04 AM
To: centos-virt@centos.org
Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
Linux 3.18
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by
hardware) that are experiencing stability issues which I haven't been able
to track down.  All three types seem to be having issues with NIC and/or
PCIe.  In most cases, the issues are unrecoverable and require a hard boot
to resolve.  All have Intel NICs.
Often the systems will remain stable for days or weeks, then suddenly
encounter one of these issues.  I have yet to tie the error to any specific
action on the systems and can't reproduce it reliably.

Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs

Kernel messages upon failure:
pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction
Layer, id=0018(Receiver ID)
pcieport 0000:00:03.0:   device [8086:340a] error
status/mask=00002000/00001001
pcieport 0000:00:03.0:    [13] Advisory Non-Fatal
pcieport 0000:00:03.0:   Error of this Agent(0018) is reported first
igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer,
id=0400(Receiver ID)
igb 0000:04:00.0:   device [8086:10a7] error status/mask=00002001/00002000
igb 0000:04:00.0:    [ 0] Receiver Error         (First)
igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer,
id=0401(Receiver ID)
igb 0000:04:00.1:   device [8086:10a7] error status/mask=00002001/00002000
igb 0000:04:00.1:    [ 0] Receiver Error         (First)
This spams to the console continuously until hard booting.

Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB

igb 0000:82:00.0: Detected Tx Unit Hang
 Tx Queue             <1>
 TDH                  <43>
 TDT                  <50>
 next_to_use          <50>
 next_to_clean        <43>
buffer_info[next_to_clean]
 time_stamp           <12e6bc0b6> next_to_watch        <ffff880006aa7440>
 jiffies              <12e6bc8dc>
 desc.status          <1c8210>
This spams to the console continuously until hard booting.

Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB

e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang:
  TDH                  <ff>
  TDT                  <33>
  next_to_use          <33>
  next_to_clean        <fd>
buffer_info[next_to_clean]:
  time_stamp           <138230862>
  next_to_watch        <ff>
  jiffies              <138231ac0>
  next_to_watch.status <0>
MAC Status             <80383>
PHY Status             <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
This type of system, the NIC automatically recovers and I don't need to
reboot.
So far I tried using pcie_aspm=off to see that would help, but it appears
that the 3.18 kernel turns off ASPM by default on these due to probing the
BIOS.  Stability issues were not resolved by the changes.
On the latter system type I also turned off all offloading setting.  It
appears the stability increased slightly but it didn't fully resolve the
problem.  I haven't adjusted offload settings on the first two server types
yet.
I suspect this problem is related to the 3.18 kernel used by the virt SIG,
as we had these running Xen on CentOS 5's kernel with no issues for years,
and systems of these types used elsewhere in our facility are stable under
CentOS 6's standard kernel.  This affects more than one server of each type,
so I don't believe it is a hardware failure, or else it's a hardware design
flaw.
Has anyone experienced similar issues with this configuration, and if so,
does anyone have tips on how to resolve the issues?
Honeslty I would email Intel and see if they can help. This looks like
the NIC decides something is wrong, throws off an PCIe error and
then resets itself.
It could also be an error in the Linux stack which would "eat" an
interrupt when migrating interrupts (which was fixed
upstream, see below). Are you running irqbalance? Could you try
turning it off?
Did you have these issues with an earlier kernel?
The fix was 
ff1e22e7a638a0782f54f81a6c9cb139aca2da35
Author: Boris Ostrovsky boris.ostrovsky@oracle.com
Date:   Fri Mar 18 10:11:07 2016 -0400
xen/events: Mask a moving irq
and then there was a fix to this fix:
commit f0f393877c71ad227d36705d61d1e4062bc29cf5
Author: Ross Lagerwall ross.lagerwall@citrix.com
Date:   Tue May 10 16:11:00 2016 +0100
xen/events: Don't move disabled irqs
...
--
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin@steadfast.net | www.steadfast.net
_______________________________________________
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt

CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18