[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 02/21/2017 11:47 AM, Johnny Hughes wrote:
> On 01/23/2017 11:04 AM, Kevin Stange wrote:
>> I have three different types of CentOS 6 Xen 4.4 based hypervisors (by
>> hardware) that are experiencing stability issues which I haven't been
>> able to track down.  All three types seem to be having issues with NIC
>> and/or PCIe.  In most cases, the issues are unrecoverable and require a
>> hard boot to resolve.  All have Intel NICs.
>>
>> Often the systems will remain stable for days or weeks, then suddenly
>> encounter one of these issues.  I have yet to tie the error to any
>> specific action on the systems and can't reproduce it reliably.
>>
>> - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs
>>
>> Kernel messages upon failure:
>>
>> pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018
>> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected,
>> type=Transaction Layer, id=0018(Receiver ID)
>> pcieport 0000:00:03.0:   device [8086:340a] error
>> status/mask=00002000/00001001
>> pcieport 0000:00:03.0:    [13] Advisory Non-Fatal
>> pcieport 0000:00:03.0:   Error of this Agent(0018) is reported first
>> igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical
>> Layer, id=0400(Receiver ID)
>> igb 0000:04:00.0:   device [8086:10a7] error status/mask=00002001/00002000
>> igb 0000:04:00.0:    [ 0] Receiver Error         (First)
>> igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical
>> Layer, id=0401(Receiver ID)
>> igb 0000:04:00.1:   device [8086:10a7] error status/mask=00002001/00002000
>> igb 0000:04:00.1:    [ 0] Receiver Error         (First)
>>
>> This spams to the console continuously until hard booting.
>>
>> - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB
>>
>> igb 0000:82:00.0: Detected Tx Unit Hang
>>  Tx Queue             <1>
>>  TDH                  <43>
>>  TDT                  <50>
>>  next_to_use          <50>
>>  next_to_clean        <43>
>> buffer_info[next_to_clean]
>>  time_stamp           <12e6bc0b6> next_to_watch        <ffff880006aa7440>
>>  jiffies              <12e6bc8dc>
>>  desc.status          <1c8210>
>>
>> This spams to the console continuously until hard booting.
>>
>> - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB
>>
>> e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang:
>>   TDH                  <ff>
>>   TDT                  <33>
>>   next_to_use          <33>
>>   next_to_clean        <fd>
>> buffer_info[next_to_clean]:
>>   time_stamp           <138230862>
>>   next_to_watch        <ff>
>>   jiffies              <138231ac0>
>>   next_to_watch.status <0>
>> MAC Status             <80383>
>> PHY Status             <792d>
>> PHY 1000BASE-T Status  <3c00>
>> PHY Extended Status    <3000>
>> PCI Status             <10>
>>
>> This type of system, the NIC automatically recovers and I don't need to
>> reboot.
>>
>> So far I tried using pcie_aspm=off to see that would help, but it
>> appears that the 3.18 kernel turns off ASPM by default on these due to
>> probing the BIOS.  Stability issues were not resolved by the changes.
>>
>> On the latter system type I also turned off all offloading setting.  It
>> appears the stability increased slightly but it didn't fully resolve the
>> problem.  I haven't adjusted offload settings on the first two server
>> types yet.
>>
>> I suspect this problem is related to the 3.18 kernel used by the virt
>> SIG, as we had these running Xen on CentOS 5's kernel with no issues for
>> years, and systems of these types used elsewhere in our facility are
>> stable under CentOS 6's standard kernel.  This affects more than one
>> server of each type, so I don't believe it is a hardware failure, or
>> else it's a hardware design flaw.
>>
>> Has anyone experienced similar issues with this configuration, and if
>> so, does anyone have tips on how to resolve the issues?
>>
> 
> 
> Kevin,
> 
> Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along
> with the newer linux-firmare packages and xfsprogs).
> 
> If you enable the xen-testing repository in your CentOS-Xen.repo file
> (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade'
> should replace all the needed packages.
> 
> The actual path is here for the packages:
> 
> https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
> 
> Hopefully this helps.
> 

I should have said .. 'just releaed for testing' :)

I have been using this for 4 or 5 days with no issues in production, but
it needs testing before final release :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170221/c5618519/attachment-0006.sig>