[CentOS] Network bond - one port goes down from time to time

Wed Mar 30 13:49:47 UTC 2016
Marcelo Ricardo Leitner <marcelo.leitner at gmail.com>

Em 30-03-2016 06:46, Götz Reinicke - IT Koordinator escreveu:
> Am 29.03.16 um 13:57 schrieb Marcelo Ricardo Leitner:
>> Em 29-03-2016 03:46, Götz Reinicke - IT Koordinator escreveu:
>>> Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
>>>> Em 28-03-2016 06:27, Götz Reinicke escreveu:
>>>>> Hi,
>>>>>
>>>>> may be someone has an idea:
>>>>>
>>>>> We have three supermicron servers with two 10Gb Ports each, connected
>>>>> to a cisco switch stack 1Gb ports. All are on auto speed.
>>>>>
>>>>> I configured a LACP bond on both sides on all servers, first with
>>>>> citrix xen server.
>>>>>
>>>>> On one server eth0 goes down from time to time … maybe within minutes,
>>>>> someday it is up for some hours.
>>>>>
>>>>> Two server are fine; the bond is up for 24 days(!) now without any
>>>>> problem.
>>>>>
>>>>> Recently I installed centos 7.2 on that server in question and - bam -
>>>>> eth0 is going down from time to time …
>>>>>
>>>>> I checked patch cables, tried an other switch port channel,
>>>>> reconfigured the ports, reinstalled the os. Same behavior.
>>>>>
>>>>> And: We got a replacement server. Same behavior …. :)
>>>>>
>>>>> Currently the cisco tech guys don’t see a problem on the switch (which
>>>>> is up for 3 Years now with 10+ servers connected … no problem so far),
>>>>> from the citrix side I don’t get much more hints.
>>>>>
>>>>> In the logs i just have a Nic Link is Down … Nic Link is Up. It is
>>>>> always eth0.
>>>>>
>>>>> Question:
>>>>>
>>>>> Any idea ? One suggestion was Disable all power saving features in the
>>>>> server bios. Did not do that yet.
>>>>>
>>>>> Is there any chance to set some sort of higher debug level for that
>>>>> nic/kernel/whatever to get some server os side feedback why the port
>>>>> goes down?
>>>>>
>>>>> Regards and thanks for any hint! . Götz
>>>>
>>>> If you are seeing NIC Link is Down as in:
>>>> [710442.668059] e1000e: enp0s25 NIC Link is Down
>>>> then the NIC lost its link and bond is just protecting you as you
>>>> probably didn't have any downtime due to that. IOW bonding is not the
>>>> issue.
>>>>
>>>> Which NIC do you have on those servers?
>>>
>>>
>>> The mainbord is a supermicro X10DRI-T with Intel X540 Dual port
>>> 10GBase-T.
>>
>> Okay, it's probably using ixgbe driver then.
>> You may consider testing a newer kernel and see how that goes out,
>> before doing too much debugging.
>> You can install v4.5 using one of ELrepo's kernels at
>> http://elrepo.org/linux/kernel/el7/x86_64/RPMS/
>> http://elrepo.org/tiki/tiki-index.php
>> There are some changes between 7.2 and that kernel that it's good to be
>> tested.
>>
>> Or... enable ixgbe debug, module param debug=16, and send the dmesg log,
>> specially the lines around the event.
>
> Hm,, could you give me a hint, how to enable that (at runtime) for
> centos 7.2? I cant figure that out.
>
> Would be nice. cheers . Götz

Ah during runtime you can just use ethtool:
# ethtool -s eth0 msglvl 0xffff
when done, revert with:
# ethtool -s eth0 msglvl 0x7

   Marcelo