[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
Johnny Hughes
johnny at centos.org
Tue Feb 21 17:50:02 UTC 2017
On 02/21/2017 11:47 AM, Johnny Hughes wrote:
> On 01/23/2017 11:04 AM, Kevin Stange wrote:
>> I have three different types of CentOS 6 Xen 4.4 based hypervisors (by
>> hardware) that are experiencing stability issues which I haven't been
>> able to track down. All three types seem to be having issues with NIC
>> and/or PCIe. In most cases, the issues are unrecoverable and require a
>> hard boot to resolve. All have Intel NICs.
>>
>> Often the systems will remain stable for days or weeks, then suddenly
>> encounter one of these issues. I have yet to tie the error to any
>> specific action on the systems and can't reproduce it reliably.
>>
>> - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
>>
>> Kernel messages upon failure:
>>
>> pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018
>> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected,
>> type=Transaction Layer, id=0018(Receiver ID)
>> pcieport 0000:00:03.0: device [8086:340a] error
>> status/mask=00002000/00001001
>> pcieport 0000:00:03.0: [13] Advisory Non-Fatal
>> pcieport 0000:00:03.0: Error of this Agent(0018) is reported first
>> igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical
>> Layer, id=0400(Receiver ID)
>> igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000
>> igb 0000:04:00.0: [ 0] Receiver Error (First)
>> igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical
>> Layer, id=0401(Receiver ID)
>> igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000
>> igb 0000:04:00.1: [ 0] Receiver Error (First)
>>
>> This spams to the console continuously until hard booting.
>>
>> - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
>>
>> igb 0000:82:00.0: Detected Tx Unit Hang
>> Tx Queue <1>
>> TDH <43>
>> TDT <50>
>> next_to_use <50>
>> next_to_clean <43>
>> buffer_info[next_to_clean]
>> time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440>
>> jiffies <12e6bc8dc>
>> desc.status <1c8210>
>>
>> This spams to the console continuously until hard booting.
>>
>> - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
>>
>> e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang:
>> TDH <ff>
>> TDT <33>
>> next_to_use <33>
>> next_to_clean <fd>
>> buffer_info[next_to_clean]:
>> time_stamp <138230862>
>> next_to_watch <ff>
>> jiffies <138231ac0>
>> next_to_watch.status <0>
>> MAC Status <80383>
>> PHY Status <792d>
>> PHY 1000BASE-T Status <3c00>
>> PHY Extended Status <3000>
>> PCI Status <10>
>>
>> This type of system, the NIC automatically recovers and I don't need to
>> reboot.
>>
>> So far I tried using pcie_aspm=off to see that would help, but it
>> appears that the 3.18 kernel turns off ASPM by default on these due to
>> probing the BIOS. Stability issues were not resolved by the changes.
>>
>> On the latter system type I also turned off all offloading setting. It
>> appears the stability increased slightly but it didn't fully resolve the
>> problem. I haven't adjusted offload settings on the first two server
>> types yet.
>>
>> I suspect this problem is related to the 3.18 kernel used by the virt
>> SIG, as we had these running Xen on CentOS 5's kernel with no issues for
>> years, and systems of these types used elsewhere in our facility are
>> stable under CentOS 6's standard kernel. This affects more than one
>> server of each type, so I don't believe it is a hardware failure, or
>> else it's a hardware design flaw.
>>
>> Has anyone experienced similar issues with this configuration, and if
>> so, does anyone have tips on how to resolve the issues?
>>
>
>
> Kevin,
>
> Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along
> with the newer linux-firmare packages and xfsprogs).
>
> If you enable the xen-testing repository in your CentOS-Xen.repo file
> (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade'
> should replace all the needed packages.
>
> The actual path is here for the packages:
>
> https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
>
> Hopefully this helps.
>
I should have said .. 'just releaed for testing' :)
I have been using this for 4 or 5 days with no issues in production, but
it needs testing before final release :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170221/c5618519/attachment.sig>
More information about the CentOS-virt
mailing list