[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Tue Jan 24 17:16:52 UTC 2017
Kevin Stange <kevin at steadfast.net>

On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
>> Kevin Stange,
>> It can be either kernel or update the NIC driver or firmware of the NIC
>> card. Hope that helps!
>>
>> Xlord
>> -----Original Message-----
>> From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin
>> Stange
>> Sent: Tuesday, January 24, 2017 1:04 AM
>> To: centos-virt at centos.org
>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
>> Linux 3.18
>>
<snip>
>>
>> Has anyone experienced similar issues with this configuration, and if so,
>> does anyone have tips on how to resolve the issues?
> 
> Honeslty I would email Intel and see if they can help. This looks like
> the NIC decides something is wrong, throws off an PCIe error and
> then resets itself.

This happens for several different NICs.  Is there a good contact at
Intel for this kind of thing, or should I just try to reach them through
their web site?

> It could also be an error in the Linux stack which would "eat" an
> interrupt when migrating interrupts (which was fixed
> upstream, see below). Are you running irqbalance? Could you try
> turning it off?

irqbalance is enabled on these servers.  I'll try disabling it.

> Did you have these issues with an earlier kernel?

The last kernel these boxes ran was 2.6.18-412.el5xen under CentOS 5 and
they were very stable, however the differences between 2.6.18 and 3.18
are immense, especially with features like ASPM and other power
management code.  We've run into ASPM issues on systems before going
from CentOS 5 to the CentOS 6 kernel 2.6.32, but not this particular
hardware, which is why my first thought was to look at ASPM.

They've all been upgraded to CentOS 6 and running the virt SIG kernel
kernel-3.18.44-20.el6.x86_64.  I haven't run any previous versions 3.18
or tried any other kernels.

It surprises me that we would have all these issues if there isn't a
more widespread problem considering the hardware is fairly maintain and
covers a lot of NIC chips.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net