[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Tue Feb 21 17:47:46 UTC 2017
Johnny Hughes <johnny at centos.org>

On 01/23/2017 11:04 AM, Kevin Stange wrote:
> I have three different types of CentOS 6 Xen 4.4 based hypervisors (by
> hardware) that are experiencing stability issues which I haven't been
> able to track down.  All three types seem to be having issues with NIC
> and/or PCIe.  In most cases, the issues are unrecoverable and require a
> hard boot to resolve.  All have Intel NICs.
> 
> Often the systems will remain stable for days or weeks, then suddenly
> encounter one of these issues.  I have yet to tie the error to any
> specific action on the systems and can't reproduce it reliably.
> 
> - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
> 
> Kernel messages upon failure:
> 
> pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018
> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected,
> type=Transaction Layer, id=0018(Receiver ID)
> pcieport 0000:00:03.0:   device [8086:340a] error
> status/mask=00002000/00001001
> pcieport 0000:00:03.0:    [13] Advisory Non-Fatal
> pcieport 0000:00:03.0:   Error of this Agent(0018) is reported first
> igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical
> Layer, id=0400(Receiver ID)
> igb 0000:04:00.0:   device [8086:10a7] error status/mask=00002001/00002000
> igb 0000:04:00.0:    [ 0] Receiver Error         (First)
> igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical
> Layer, id=0401(Receiver ID)
> igb 0000:04:00.1:   device [8086:10a7] error status/mask=00002001/00002000
> igb 0000:04:00.1:    [ 0] Receiver Error         (First)
> 
> This spams to the console continuously until hard booting.
> 
> - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
> 
> igb 0000:82:00.0: Detected Tx Unit Hang
>  Tx Queue             <1>
>  TDH                  <43>
>  TDT                  <50>
>  next_to_use          <50>
>  next_to_clean        <43>
> buffer_info[next_to_clean]
>  time_stamp           <12e6bc0b6> next_to_watch        <ffff880006aa7440>
>  jiffies              <12e6bc8dc>
>  desc.status          <1c8210>
> 
> This spams to the console continuously until hard booting.
> 
> - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
> 
> e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang:
>   TDH                  <ff>
>   TDT                  <33>
>   next_to_use          <33>
>   next_to_clean        <fd>
> buffer_info[next_to_clean]:
>   time_stamp           <138230862>
>   next_to_watch        <ff>
>   jiffies              <138231ac0>
>   next_to_watch.status <0>
> MAC Status             <80383>
> PHY Status             <792d>
> PHY 1000BASE-T Status  <3c00>
> PHY Extended Status    <3000>
> PCI Status             <10>
> 
> This type of system, the NIC automatically recovers and I don't need to
> reboot.
> 
> So far I tried using pcie_aspm=off to see that would help, but it
> appears that the 3.18 kernel turns off ASPM by default on these due to
> probing the BIOS.  Stability issues were not resolved by the changes.
> 
> On the latter system type I also turned off all offloading setting.  It
> appears the stability increased slightly but it didn't fully resolve the
> problem.  I haven't adjusted offload settings on the first two server
> types yet.
> 
> I suspect this problem is related to the 3.18 kernel used by the virt
> SIG, as we had these running Xen on CentOS 5's kernel with no issues for
> years, and systems of these types used elsewhere in our facility are
> stable under CentOS 6's standard kernel.  This affects more than one
> server of each type, so I don't believe it is a hardware failure, or
> else it's a hardware design flaw.
> 
> Has anyone experienced similar issues with this configuration, and if
> so, does anyone have tips on how to resolve the issues?
> 


Kevin,

Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along
with the newer linux-firmare packages and xfsprogs).

If you enable the xen-testing repository in your CentOS-Xen.repo file
(assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade'
should replace all the needed packages.

The actual path is here for the packages:

https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/

Hopefully this helps.

Thanks,
Johnny Hughes


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170221/9d9c3218/attachment-0003.sig>