On 11/02/17 06:29, Kevin Stange wrote: > On 01/30/2017 06:41 PM, Kevin Stange wrote: >> On 01/30/2017 06:12 PM, Adi Pircalabu wrote: >>> On 31/01/17 10:49, Kevin Stange wrote: >>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a >>>> 4.4 kernel. Any chance you have tested with that one? >>> >>> Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and >>> Xen with kernel 4.4. >> >> I'll keep you (and others here) posted on my own experiences with that >> 4.4 build over the next few weeks to report on any issues. I'm hoping >> something happened between 3.18 and 4.4 that fixed underlying problems. >> >>>> Did you ever try without MTU=9000 (default 1500 instead)? >>> >>> Yes, also with all sorts of configuration combinations like LACP rate >>> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement. >> >> Alright, I'll assume that probably won't help then. I tried it on one >> box which hasn't had the issue again yet, but that doesn't guarantee >> anything. > > I was able to discover something new, which might not conclusively prove > anything, but it at least seems to rule out the pci=nomsi kernel option > from being effective. > > I had one server booted with that option as well as MTU 1500. It was > stable for quite a long time, so I decided to try turning the MTU back > to 9000 and within 12 hours, the interface on the expansion NIC with the > jumbo MTU failed. > > The other NIC in the LACP bundle is onboard and didn't fail. The other > NIC on the dual-port expansion card also didn't fail. This leads me to > believe that ONE of the bugs I'm experiencing is related to 82575EB + > jumbo frames. > > I still think I'm also having a PCI-e issue that is separate and > additional on top of that, and which has not reared its head recently, > making it difficult for me to gather any new data. > > One of the things I've done that seemed to help a lot with stability was > balance the LACP so that one NIC from onboard and one NIC from expansion > card is in each LAG. Previously we just had the first LAG onboard and > the second on the expansion card. This way, at least, given the > expansion NIC's propensity toward failing first, I don't have to crash > the server and all running VMs to recover. > > I've seen absolutely no issues yet with the 4.4 kernel either, but I am > not willing to call that a win because of the quiet from even the > servers on which no tweaks have been applied yet. Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, though: what kernel was this machine running at the time? Adi Pircalabu