[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Tue Feb 14 15:00:25 UTC 2017
-=X.L.O.R.D=- <xlord.sl at gmail.com>

Kevin Stange
Sound interesting.

Xlord

-----Original Message-----
From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin
Stange
Sent: Tuesday, February 14, 2017 2:09 AM
To: Adi Pircalabu <adi at ddns.com.au>; Discussion about the virtualization on
CentOS <centos-virt at centos.org>
Subject: Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
Linux 3.18

On 02/12/2017 05:07 PM, Adi Pircalabu wrote:
> On 11/02/17 06:29, Kevin Stange wrote:
>> On 01/30/2017 06:41 PM, Kevin Stange wrote:
>>> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>>>> On 31/01/17 10:49, Kevin Stange wrote:
>>>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now 
>>>>> is a
>>>>> 4.4 kernel.  Any chance you have tested with that one?
>>>>
>>>> Not yet, however the future Xen nodes we'll deploy will run CentOS 
>>>> 7 and Xen with kernel 4.4.
>>>
>>> I'll keep you (and others here) posted on my own experiences with 
>>> that
>>> 4.4 build over the next few weeks to report on any issues.  I'm 
>>> hoping something happened between 3.18 and 4.4 that fixed underlying
problems.
>>>
>>>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>>>
>>>> Yes, also with all sorts of configuration combinations like LACP 
>>>> rate slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
>>>
>>> Alright, I'll assume that probably won't help then.  I tried it on 
>>> one box which hasn't had the issue again yet, but that doesn't 
>>> guarantee anything.
>>
>> I was able to discover something new, which might not conclusively 
>> prove anything, but it at least seems to rule out the pci=nomsi 
>> kernel option from being effective.
>>
>> I had one server booted with that option as well as MTU 1500.  It was 
>> stable for quite a long time, so I decided to try turning the MTU 
>> back to 9000 and within 12 hours, the interface on the expansion NIC 
>> with the jumbo MTU failed.
>>
>> The other NIC in the LACP bundle is onboard and didn't fail.  The 
>> other NIC on the dual-port expansion card also didn't fail.  This 
>> leads me to believe that ONE of the bugs I'm experiencing is related 
>> to 82575EB + jumbo frames.
>>
>> I still think I'm also having a PCI-e issue that is separate and 
>> additional on top of that, and which has not reared its head 
>> recently, making it difficult for me to gather any new data.
>>
>> One of the things I've done that seemed to help a lot with stability 
>> was balance the LACP so that one NIC from onboard and one NIC from 
>> expansion card is in each LAG.  Previously we just had the first LAG 
>> onboard and the second on the expansion card.  This way, at least, 
>> given the expansion NIC's propensity toward failing first, I don't 
>> have to crash the server and all running VMs to recover.
>>
>> I've seen absolutely no issues yet with the 4.4 kernel either, but I 
>> am not willing to call that a win because of the quiet from even the 
>> servers on which no tweaks have been applied yet.
> 
> Thanks for the heads-up Kevin, appreciated. One thing I need to 
> clarify,
> though: what kernel was this machine running at the time?

Kernel running at the time was the Virt SIG's 3.18.44-20 kernel.

As a further note, within an additional 24 hours, the onboard Intel
82576 that was switched to enable jumbo frames also failed and we had to
reboot the server.  The expansion and onboard ports without jumbo frames did
not fail.  Since reboot, it's on the 4.4.47 kernel from Xen Made Easy now
with jumbo frames and has not exhibited issues since Friday.

--
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net
_______________________________________________
CentOS-virt mailing list
CentOS-virt at centos.org
https://lists.centos.org/mailman/listinfo/centos-virt