[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Fri Jan 27 18:21:16 UTC 2017
Kevin Stange <kevin at steadfast.net>

On 01/27/2017 06:08 AM, Karel Hendrych wrote:
> Have you tried to eliminate all power management features all over?

I've been trying to find and disable all power management features but
having relatively little luck with that solving the problems.  Stabbing
the the dark I've tried different ACPI settings, including completely
disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
on the kernel command line.  Are there other kernel options that might
be useful to try?

> Are the devices connected to the same network infrastructure?

There are two onboard NICs and two NICs on a dual-port card in each
server.  All devices connect to a cisco switch pair in VSS and the links
are paired in LACP.

> There has to be something common.

The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
and NFS traffic, as well as some basic management stuff over SSH, and
they are configured with an MTU of 9000 on the native VLAN.  It's a lot
of features, but I can't really turn them off and then actually have
enough load on the NICs to reproduce the issue.  Several of these
servers were installed and being burned in for 3 months without ever
having an issue, but suddenly collapsed when I tried to bring 20 or so
real-world VMs up on them.

The other NICs in the system that are connected don't exhibit issues and
run only VM network interfaces.  They are also in LACP and running VLAN
tags, but normal 1500 MTU.

So far it seems to correlate with NICs on the expansion cards, but it's
a coincidence that these cards are the ones with the storage and
management traffic.  I'm trying to swap some of this load to the onboard
NICs to see if the issues migrate over with it, or if they stay with the
expansion cards.

If the issue exists on both NIC types, then it rules out the specific
NIC chipset as the culprit.  It could point to the driver, but upgrading
it to a newer version did not help and actually appeared to make
everything worse.  This issue might actually be more to do with the PCIe
bridge than the NICs, but these are still different motherboards with
different PCIe bridges (5520 vs C600) experiencing the same issues.

> I've been using Intel NICs with Xen/CentOS for ages with no issues.

I figured that must be so.  Everyone uses Intel NICs.  If this was a
common issue, it would probably be causing a lot of people a lot of trouble.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net