[CentOS] Network bond - one port goes down from time to time

Tue Mar 29 11:57:12 UTC 2016
Marcelo Ricardo Leitner <marcelo.leitner at gmail.com>

Em 29-03-2016 03:46, Götz Reinicke - IT Koordinator escreveu:
> Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
>> Em 28-03-2016 06:27, Götz Reinicke escreveu:
>>> Hi,
>>>
>>> may be someone has an idea:
>>>
>>> We have three supermicron servers with two 10Gb Ports each, connected
>>> to a cisco switch stack 1Gb ports. All are on auto speed.
>>>
>>> I configured a LACP bond on both sides on all servers, first with
>>> citrix xen server.
>>>
>>> On one server eth0 goes down from time to time … maybe within minutes,
>>> someday it is up for some hours.
>>>
>>> Two server are fine; the bond is up for 24 days(!) now without any
>>> problem.
>>>
>>> Recently I installed centos 7.2 on that server in question and - bam -
>>> eth0 is going down from time to time …
>>>
>>> I checked patch cables, tried an other switch port channel,
>>> reconfigured the ports, reinstalled the os. Same behavior.
>>>
>>> And: We got a replacement server. Same behavior …. :)
>>>
>>> Currently the cisco tech guys don’t see a problem on the switch (which
>>> is up for 3 Years now with 10+ servers connected … no problem so far),
>>> from the citrix side I don’t get much more hints.
>>>
>>> In the logs i just have a Nic Link is Down … Nic Link is Up. It is
>>> always eth0.
>>>
>>> Question:
>>>
>>> Any idea ? One suggestion was Disable all power saving features in the
>>> server bios. Did not do that yet.
>>>
>>> Is there any chance to set some sort of higher debug level for that
>>> nic/kernel/whatever to get some server os side feedback why the port
>>> goes down?
>>>
>>> Regards and thanks for any hint! . Götz
>>
>> If you are seeing NIC Link is Down as in:
>> [710442.668059] e1000e: enp0s25 NIC Link is Down
>> then the NIC lost its link and bond is just protecting you as you
>> probably didn't have any downtime due to that. IOW bonding is not the
>> issue.
>>
>> Which NIC do you have on those servers?
>
>
> The mainbord is a supermicro X10DRI-T with Intel X540 Dual port 10GBase-T.

Okay, it's probably using ixgbe driver then.
You may consider testing a newer kernel and see how that goes out, 
before doing too much debugging.
You can install v4.5 using one of ELrepo's kernels at
http://elrepo.org/linux/kernel/el7/x86_64/RPMS/
http://elrepo.org/tiki/tiki-index.php
There are some changes between 7.2 and that kernel that it's good to be 
tested.

Or... enable ixgbe debug, module param debug=16, and send the dmesg log, 
specially the lines around the event.