Hi,
may be someone has an idea:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
Regards and thanks for any hint! . Götz
Am 28.03.2016 um 11:27 schrieb Götz Reinicke goetz.reinicke@filmakademie.de:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
How is your interface exactly configured ?
-- LF
Am 28.03.16 um 12:12 schrieb Leon Fauster:
Am 28.03.2016 um 11:27 schrieb Götz Reinicke goetz.reinicke@filmakademie.de:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
How is your interface exactly configured ?
TYPE=Bond #Interface type set to bond BOOTPROTO=static BONDING_MASTER=yes BONDING_OPTS="mode=4" #i set mode to active-backup DEFROUTE=yes IPADDR="192.168.xxx.xxx" NETMASK=255.255.255.0 GATEWAY="192.168.xxx.xxx" IPV4_FAILURE_FATAL=no IPV6INIT=no NAME=bond0 DEVICE=bond0 ONBOOT=yes
TYPE="Ethernet" MASTER=bond0 SLAVE=yes NAME="enp4s0f0" UUID="xxx" DEVICE="enp4s0f0" ONBOOT="yes"
TYPE="Ethernet" MASTER=bond0 SLAVE=yes NAME="enp4s0f0" UUID="xxx" DEVICE="enp4s0f1" ONBOOT="yes"
/Götz
On 3/28/2016 11:44 PM, Götz Reinicke - IT Koordinator wrote:
How is your interface exactly configured ?
TYPE=Bond #Interface type set to bond BOOTPROTO=static BONDING_MASTER=yes BONDING_OPTS="mode=4" #i set mode to active-backup DEFROUTE=yes IPADDR="192.168.xxx.xxx" NETMASK=255.255.255.0 GATEWAY="192.168.xxx.xxx" IPV4_FAILURE_FATAL=no IPV6INIT=no NAME=bond0 DEVICE=bond0 ONBOOT=yes
TYPE="Ethernet" MASTER=bond0 SLAVE=yes NAME="enp4s0f0" UUID="xxx" DEVICE="enp4s0f0" ONBOOT="yes"
TYPE="Ethernet" MASTER=bond0 SLAVE=yes NAME="enp4s0f0" UUID="xxx" DEVICE="enp4s0f1" ONBOOT="yes"
should both those 'ethernet' devices have the same NAME ?
Am 29.03.16 um 11:12 schrieb John R Pierce:
On 3/28/2016 11:44 PM, Götz Reinicke - IT Koordinator wrote:
How is your interface exactly configured ?
TYPE=Bond #Interface type set to bond BOOTPROTO=static BONDING_MASTER=yes BONDING_OPTS="mode=4" #i set mode to active-backup DEFROUTE=yes IPADDR="192.168.xxx.xxx" NETMASK=255.255.255.0 GATEWAY="192.168.xxx.xxx" IPV4_FAILURE_FATAL=no IPV6INIT=no NAME=bond0 DEVICE=bond0 ONBOOT=yes
TYPE="Ethernet" MASTER=bond0 SLAVE=yes NAME="enp4s0f0" UUID="xxx" DEVICE="enp4s0f0" ONBOOT="yes"
TYPE="Ethernet" MASTER=bond0 SLAVE=yes NAME="enp4s0f0" UUID="xxx" DEVICE="enp4s0f1" ONBOOT="yes"
should both those 'ethernet' devices have the same NAME ?
Copy and Past error, they dont have the same name.
/Götz
Em 28-03-2016 06:27, Götz Reinicke escreveu:
Hi,
may be someone has an idea:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
Regards and thanks for any hint! . Götz
If you are seeing NIC Link is Down as in: [710442.668059] e1000e: enp0s25 NIC Link is Down then the NIC lost its link and bond is just protecting you as you probably didn't have any downtime due to that. IOW bonding is not the issue.
Which NIC do you have on those servers?
Marcelo
Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
Em 28-03-2016 06:27, Götz Reinicke escreveu:
Hi,
may be someone has an idea:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
Regards and thanks for any hint! . Götz
If you are seeing NIC Link is Down as in: [710442.668059] e1000e: enp0s25 NIC Link is Down then the NIC lost its link and bond is just protecting you as you probably didn't have any downtime due to that. IOW bonding is not the issue.
Which NIC do you have on those servers?
The mainbord is a supermicro X10DRI-T with Intel X540 Dual port 10GBase-T.
regards . Götz
Em 29-03-2016 03:46, Götz Reinicke - IT Koordinator escreveu:
Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
Em 28-03-2016 06:27, Götz Reinicke escreveu:
Hi,
may be someone has an idea:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
Regards and thanks for any hint! . Götz
If you are seeing NIC Link is Down as in: [710442.668059] e1000e: enp0s25 NIC Link is Down then the NIC lost its link and bond is just protecting you as you probably didn't have any downtime due to that. IOW bonding is not the issue.
Which NIC do you have on those servers?
The mainbord is a supermicro X10DRI-T with Intel X540 Dual port 10GBase-T.
Okay, it's probably using ixgbe driver then. You may consider testing a newer kernel and see how that goes out, before doing too much debugging. You can install v4.5 using one of ELrepo's kernels at http://elrepo.org/linux/kernel/el7/x86_64/RPMS/ http://elrepo.org/tiki/tiki-index.php There are some changes between 7.2 and that kernel that it's good to be tested.
Or... enable ixgbe debug, module param debug=16, and send the dmesg log, specially the lines around the event.
Am 29.03.16 um 13:57 schrieb Marcelo Ricardo Leitner:
Em 29-03-2016 03:46, Götz Reinicke - IT Koordinator escreveu:
Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
Em 28-03-2016 06:27, Götz Reinicke escreveu:
Hi,
may be someone has an idea:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
Regards and thanks for any hint! . Götz
If you are seeing NIC Link is Down as in: [710442.668059] e1000e: enp0s25 NIC Link is Down then the NIC lost its link and bond is just protecting you as you probably didn't have any downtime due to that. IOW bonding is not the issue.
Which NIC do you have on those servers?
The mainbord is a supermicro X10DRI-T with Intel X540 Dual port 10GBase-T.
Okay, it's probably using ixgbe driver then. You may consider testing a newer kernel and see how that goes out, before doing too much debugging. You can install v4.5 using one of ELrepo's kernels at http://elrepo.org/linux/kernel/el7/x86_64/RPMS/ http://elrepo.org/tiki/tiki-index.php There are some changes between 7.2 and that kernel that it's good to be tested.
Or... enable ixgbe debug, module param debug=16, and send the dmesg log, specially the lines around the event.
Hm,, could you give me a hint, how to enable that (at runtime) for centos 7.2? I cant figure that out.
Would be nice. cheers . Götz
Em 30-03-2016 06:46, Götz Reinicke - IT Koordinator escreveu:
Am 29.03.16 um 13:57 schrieb Marcelo Ricardo Leitner:
Em 29-03-2016 03:46, Götz Reinicke - IT Koordinator escreveu:
Am 28.03.16 um 16:23 schrieb Marcelo Ricardo Leitner:
Em 28-03-2016 06:27, Götz Reinicke escreveu:
Hi,
may be someone has an idea:
We have three supermicron servers with two 10Gb Ports each, connected to a cisco switch stack 1Gb ports. All are on auto speed.
I configured a LACP bond on both sides on all servers, first with citrix xen server.
On one server eth0 goes down from time to time … maybe within minutes, someday it is up for some hours.
Two server are fine; the bond is up for 24 days(!) now without any problem.
Recently I installed centos 7.2 on that server in question and - bam - eth0 is going down from time to time …
I checked patch cables, tried an other switch port channel, reconfigured the ports, reinstalled the os. Same behavior.
And: We got a replacement server. Same behavior …. :)
Currently the cisco tech guys don’t see a problem on the switch (which is up for 3 Years now with 10+ servers connected … no problem so far), from the citrix side I don’t get much more hints.
In the logs i just have a Nic Link is Down … Nic Link is Up. It is always eth0.
Question:
Any idea ? One suggestion was Disable all power saving features in the server bios. Did not do that yet.
Is there any chance to set some sort of higher debug level for that nic/kernel/whatever to get some server os side feedback why the port goes down?
Regards and thanks for any hint! . Götz
If you are seeing NIC Link is Down as in: [710442.668059] e1000e: enp0s25 NIC Link is Down then the NIC lost its link and bond is just protecting you as you probably didn't have any downtime due to that. IOW bonding is not the issue.
Which NIC do you have on those servers?
The mainbord is a supermicro X10DRI-T with Intel X540 Dual port 10GBase-T.
Okay, it's probably using ixgbe driver then. You may consider testing a newer kernel and see how that goes out, before doing too much debugging. You can install v4.5 using one of ELrepo's kernels at http://elrepo.org/linux/kernel/el7/x86_64/RPMS/ http://elrepo.org/tiki/tiki-index.php There are some changes between 7.2 and that kernel that it's good to be tested.
Or... enable ixgbe debug, module param debug=16, and send the dmesg log, specially the lines around the event.
Hm,, could you give me a hint, how to enable that (at runtime) for centos 7.2? I cant figure that out.
Would be nice. cheers . Götz
Ah during runtime you can just use ethtool: # ethtool -s eth0 msglvl 0xffff when done, revert with: # ethtool -s eth0 msglvl 0x7
Marcelo