[CentOS] Network bond - one port goes down from time to time

Wed Mar 30 09:46:05 UTC 2016
Götz Reinicke - IT Koordinator <goetz.reinicke at filmakademie.de>

Am 29.03.16 um 13:57 schrieb Marcelo Ricardo Leitner:
> Em 29-03-2016 03:46, Götz Reinicke - IT Koordinator escreveu:
>> Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
>>> Em 28-03-2016 06:27, Götz Reinicke escreveu:
>>>> Hi,
>>>>
>>>> may be someone has an idea:
>>>>
>>>> We have three supermicron servers with two 10Gb Ports each, connected
>>>> to a cisco switch stack 1Gb ports. All are on auto speed.
>>>>
>>>> I configured a LACP bond on both sides on all servers, first with
>>>> citrix xen server.
>>>>
>>>> On one server eth0 goes down from time to time … maybe within minutes,
>>>> someday it is up for some hours.
>>>>
>>>> Two server are fine; the bond is up for 24 days(!) now without any
>>>> problem.
>>>>
>>>> Recently I installed centos 7.2 on that server in question and - bam -
>>>> eth0 is going down from time to time …
>>>>
>>>> I checked patch cables, tried an other switch port channel,
>>>> reconfigured the ports, reinstalled the os. Same behavior.
>>>>
>>>> And: We got a replacement server. Same behavior …. :)
>>>>
>>>> Currently the cisco tech guys don’t see a problem on the switch (which
>>>> is up for 3 Years now with 10+ servers connected … no problem so far),
>>>> from the citrix side I don’t get much more hints.
>>>>
>>>> In the logs i just have a Nic Link is Down … Nic Link is Up. It is
>>>> always eth0.
>>>>
>>>> Question:
>>>>
>>>> Any idea ? One suggestion was Disable all power saving features in the
>>>> server bios. Did not do that yet.
>>>>
>>>> Is there any chance to set some sort of higher debug level for that
>>>> nic/kernel/whatever to get some server os side feedback why the port
>>>> goes down?
>>>>
>>>> Regards and thanks for any hint! . Götz
>>>
>>> If you are seeing NIC Link is Down as in:
>>> [710442.668059] e1000e: enp0s25 NIC Link is Down
>>> then the NIC lost its link and bond is just protecting you as you
>>> probably didn't have any downtime due to that. IOW bonding is not the
>>> issue.
>>>
>>> Which NIC do you have on those servers?
>>
>>
>> The mainbord is a supermicro X10DRI-T with Intel X540 Dual port
>> 10GBase-T.
> 
> Okay, it's probably using ixgbe driver then.
> You may consider testing a newer kernel and see how that goes out,
> before doing too much debugging.
> You can install v4.5 using one of ELrepo's kernels at
> http://elrepo.org/linux/kernel/el7/x86_64/RPMS/
> http://elrepo.org/tiki/tiki-index.php
> There are some changes between 7.2 and that kernel that it's good to be
> tested.
> 
> Or... enable ixgbe debug, module param debug=16, and send the dmesg log,
> specially the lines around the event.

Hm,, could you give me a hint, how to enable that (at runtime) for
centos 7.2? I cant figure that out.

Would be nice. cheers . Götz