[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Thu Jan 26 20:08:12 UTC 2017
Kevin Stange <kevin at steadfast.net>

On 01/26/2017 09:35 AM, Johnny Hughes wrote:
> On 01/26/2017 09:32 AM, Johnny Hughes wrote:
>> On 01/25/2017 11:49 AM, Kevin Stange wrote:
>>> On 01/24/2017 11:16 AM, Kevin Stange wrote:
>>>> On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
>>>>>> Kevin Stange,
>>>>>> It can be either kernel or update the NIC driver or firmware of the NIC
>>>>>> card. Hope that helps!
>>>>>> Xlord
>>>>>> -----Original Message-----
>>>>>> From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin
>>>>>> Stange
>>>>>> Sent: Tuesday, January 24, 2017 1:04 AM
>>>>>> To: centos-virt at centos.org
>>>>>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
>>>>>> Linux 3.18
>>>> <snip>
>>>>>> Has anyone experienced similar issues with this configuration, and if so,
>>>>>> does anyone have tips on how to resolve the issues?
>>>>> Honeslty I would email Intel and see if they can help. This looks like
>>>>> the NIC decides something is wrong, throws off an PCIe error and
>>>>> then resets itself.
>>>> This happens for several different NICs.  Is there a good contact at
>>>> Intel for this kind of thing, or should I just try to reach them through
>>>> their web site?
>>>>> It could also be an error in the Linux stack which would "eat" an
>>>>> interrupt when migrating interrupts (which was fixed
>>>>> upstream, see below). Are you running irqbalance? Could you try
>>>>> turning it off?
>>>> irqbalance is enabled on these servers.  I'll try disabling it.
>>> I had stopped irqbalance yesterday afternoon, but had a hypervisor's
>>> NICs fail anyway in early morning this morning, so I'm pretty sure this
>>> is not the right tree to bark up.
>> Here is a set of drivers/fireware from Intel for those NICs:
>> https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver-for-PCI-E-Gigabit-Network-Connections-under-Linux-
>> I will see if I can get a CentOS-6 build of the latest version of that
>> from our older SRPM:
>> http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-
>> I am currently very busy with several c5, c6, c7 updates and the i686
>> altarch c7 tree .. but I have this on my list.  In the meantime, maybe
>> someone else could also see if those drivers help you (or you could try
>> to compile / install it).
>> Do you have another machine that you can use to see if you can duplicate
>> the issue NOT running the xen.gz hypervisor boot, but just the straight
>> kernel?

I can't actually reproduce this problem reliably.  It happens randomly
when the servers are up and running anywhere between a few hours and a
month or more, and I haven't been able to isolate any specific way to
cause it to happen.  As a result I can't really test different solutions
on different servers to see what helps.  I was hoping other people were
seeing it so that I could get some direction.  If I can reproduce it, it
won't take me very long to identify what the cause is.  Right now if I
do upgrade the drivers on the systems I won't really know if it's fixed
until I don't see another issue for several months.

> Actually .. I think this is the driver for you:
> https://downloadcenter.intel.com/download/13663
> And this explains how to make it work:
> http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000005767.html

The different combinations of NICs overlap both the e1000e and igb
drivers, but the most egregious issues have been with the igb ones.
I'll try to give this a shot and report back if I still see issues with
a server after doing so, but it might be a week or two before I find out.

Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net