[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Sun Feb 12 23:07:07 UTC 2017
Adi Pircalabu <adi at ddns.com.au>

On 11/02/17 06:29, Kevin Stange wrote:
> On 01/30/2017 06:41 PM, Kevin Stange wrote:
>> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>>> On 31/01/17 10:49, Kevin Stange wrote:
>>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
>>>> 4.4 kernel.  Any chance you have tested with that one?
>>>
>>> Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and
>>> Xen with kernel 4.4.
>>
>> I'll keep you (and others here) posted on my own experiences with that
>> 4.4 build over the next few weeks to report on any issues.  I'm hoping
>> something happened between 3.18 and 4.4 that fixed underlying problems.
>>
>>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>>
>>> Yes, also with all sorts of configuration combinations like LACP rate
>>> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
>>
>> Alright, I'll assume that probably won't help then.  I tried it on one
>> box which hasn't had the issue again yet, but that doesn't guarantee
>> anything.
> 
> I was able to discover something new, which might not conclusively prove
> anything, but it at least seems to rule out the pci=nomsi kernel option
> from being effective.
> 
> I had one server booted with that option as well as MTU 1500.  It was
> stable for quite a long time, so I decided to try turning the MTU back
> to 9000 and within 12 hours, the interface on the expansion NIC with the
> jumbo MTU failed.
> 
> The other NIC in the LACP bundle is onboard and didn't fail.  The other
> NIC on the dual-port expansion card also didn't fail.  This leads me to
> believe that ONE of the bugs I'm experiencing is related to 82575EB +
> jumbo frames.
> 
> I still think I'm also having a PCI-e issue that is separate and
> additional on top of that, and which has not reared its head recently,
> making it difficult for me to gather any new data.
> 
> One of the things I've done that seemed to help a lot with stability was
> balance the LACP so that one NIC from onboard and one NIC from expansion
> card is in each LAG.  Previously we just had the first LAG onboard and
> the second on the expansion card.  This way, at least, given the
> expansion NIC's propensity toward failing first, I don't have to crash
> the server and all running VMs to recover.
> 
> I've seen absolutely no issues yet with the 4.4 kernel either, but I am
> not willing to call that a win because of the quiet from even the
> servers on which no tweaks have been applied yet.

Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, 
though: what kernel was this machine running at the time?

Adi Pircalabu