I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo). eth0 and eth1 are the built-in NICs, this is then eth2 (or it should be). Works. Everything is fine until I add a virtual interface to eth0 and reboot. I can add eth0:1 at runtime just fine. But if I let it stay in network-scripts and boot the whole eth0 is killed (doesn't show up in ifconfig and doesn't work). A network restart brings it up as if nothing is wrong. I first thought it might have something to do with the fact that eth0 is actually a bridge on Xen > 3.2 and tried the same config on another machine and there it works. It's not the exact same xen version, not 64bit and it's got only 1 NIC. So there are differences, but it seems to rule out the bridge as a cause.
I then checked the logs more thoroughly and found that CentOS changes the NIC initialization order at boot-time. Without the third NIC it's eth0=NIC1 and eth1=NIC2 (as shown on the chassis). But with the third NIC it's most often that one that goes first. Here's a typical excerpt from messages. tigon/tg3 is the driver for the internal NICs which normally were on eth0 and eth1.
Apr 25 19:00:59 c4 kernel: eth0: RTL8168b/8111b at 0xffffc20000022000, 00:21:27:c9:d1:f5, XID 38000000 IRQ 16 Apr 25 19:00:59 c4 kernel: eth1: Tigon3 [partno(BCM95721) rev 4201 PHY (5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:1e:c9:fe:fb:ab Apr 25 19:00:59 c4 kernel: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1] Apr 25 19:00:59 c4 kernel: eth1: dma_rwctrl[76180000] dma_mask[64-bit] Apr 25 19:00:59 c4 kernel: eth2: Tigon3 [partno(BCM95721) rev 4201 PHY (5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:1e:c9:fe:fb:ac Apr 25 19:00:59 c4 kernel: eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1] Apr 25 19:00:59 c4 kernel: eth2: dma_rwctrl[76180000] dma_mask[64-bit] Apr 25 19:00:59 c4 kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex. Apr 25 19:00:59 c4 kernel: tg3: eth0: Flow control is on for TX and on for RX. Apr 25 19:00:59 c4 kernel: r8169: eth2: link up Apr 25 19:00:59 c4 kernel: r8169: eth2: link up Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth2, 192.168.2.4#123 Enabled Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth0, 192.168.1.24#123 Enabled Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth1, 192.168.2.3#123 Enabled Apr 25 19:01:08 c4 uxmon: c4.net: started monitoring: lo eth2 eth0 eth1 Apr 25 19:01:18 c4 kernel: tg3: peth0: Link is up at 1000 Mbps, full duplex. Apr 25 19:01:18 c4 kernel: tg3: peth0: Flow control is on for TX and on for RX. Apr 25 19:01:18 c4 kernel: device peth0 entered promiscuous mode Apr 25 19:01:18 c4 kernel: type=1700 audit(1240678878.244:3): dev=peth0 prom=256 old_prom=0 auid=4294967295 ses=4294967295 Apr 25 19:01:18 c4 kernel: eth0: topology change detected, propagating Apr 25 19:01:18 c4 kernel: eth0: port 1(peth0) entering forwarding state
Repeated booting sometimes gives me a different order, e.g. the two tigon come first, but this is rare.
Well, it seems this wasn't a problem until I added a virtual interface to eth0. When the eth interfaces are brought up the system seems to reenumerate the eth numbering according to the HWADDR matches and thus eth0=NIC1 and so on. As soon as I add a virtual interface to eth0 this breaks and all of eth0 is killed. At least that's what I figure.
So, the next obvious question is: How can I set a fixed order, so that NIC1 is always brought up first as eth0?
I'm not sure if this would fix it, though. I have done too few reboots yet, but it seems that at least once I got the "correct" initialization order but eth0 got killed, anyway. So, it might not be the order but still something in the Xen script which happens only when multiple NICs are present and a virtual interface is added.
Any thoughts so far?
Kai
On Sat, 2009-04-25 at 20:33 +0200, Kai Schaetzl wrote:
I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo).
---- see this: http://linux.dell.com/files/whitepapers/nic-enum-whitepaper-v3.pdf
This is a known issue with all Poweredge Servers. It will give you an explanation and workaround for it.
JohnStanley
JohnS wrote:
On Sat, 2009-04-25 at 20:33 +0200, Kai Schaetzl wrote:
I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo).
see this: http://linux.dell.com/files/whitepapers/nic-enum-whitepaper-v3.pdf
This is a known issue with all Poweredge Servers. It will give you an explanation and workaround for it.
I don't think there is anything unique to Dells about this. The kernel essentially randomizes device naming on everything. Dell just took the trouble to document it.
On Sat, 2009-04-25 at 14:52 -0500, Les Mikesell wrote:
JohnS wrote:
On Sat, 2009-04-25 at 20:33 +0200, Kai Schaetzl wrote:
I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo).
see this: http://linux.dell.com/files/whitepapers/nic-enum-whitepaper-v3.pdf
This is a known issue with all Poweredge Servers. It will give you an explanation and workaround for it.
I don't think there is anything unique to Dells about this. The kernel essentially randomizes device naming on everything. Dell just took the trouble to document it.
----
From what I understand this was only with dell hardware that this was
happening and they submitted a patch to red hat. Also it is the only hardware I have encountered with the problem also. There could be others.
What's more is Kai says he's running 5.3 but the fix should be in that kernel. What I do wonder is if when the centos kernel was built, was it included? Maybe the CentOS Kernel builder could let us know?
2.6.19-rc3 and higher are supposed to have the fix?
It is however a strange thing when you encounter it. I pulled my hair for a long time.
Last thing is he has this problem on a R200 and from memory those were not a problem. Could be this is something new? He could check for a BIOS Revision if there is one.
JohnStanley
On Sat, 2009-04-25 at 14:52 -0500, Les Mikesell wrote:
JohnS wrote:
On Sat, 2009-04-25 at 20:33 +0200, Kai Schaetzl wrote:
I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo).
see this: http://linux.dell.com/files/whitepapers/nic-enum-whitepaper-v3.pdf
This is a known issue with all Poweredge Servers. It will give you an explanation and workaround for it.
I don't think there is anything unique to Dells about this. The kernel essentially randomizes device naming on everything. Dell just took the trouble to document it.
---
Also: https://bugzilla.redhat.com/show_bug.cgi?id=491432 Seems to apply to Kais case.
You *must* specify the HWADDR field in the ifcfg-* files in order to have persistent ethernet naming. Was the way I done it on dell hardware and it states that on the Bug Report.
JohnStanley
JohnS wrote:
On Sat, 2009-04-25 at 14:52 -0500, Les Mikesell wrote:
JohnS wrote:
On Sat, 2009-04-25 at 20:33 +0200, Kai Schaetzl wrote:
I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo).
see this: http://linux.dell.com/files/whitepapers/nic-enum-whitepaper-v3.pdf
This is a known issue with all Poweredge Servers. It will give you an explanation and workaround for it.
I don't think there is anything unique to Dells about this. The kernel essentially randomizes device naming on everything. Dell just took the trouble to document it.
Also: https://bugzilla.redhat.com/show_bug.cgi?id=491432 Seems to apply to Kais case.
You *must* specify the HWADDR field in the ifcfg-* files in order to have persistent ethernet naming. Was the way I done it on dell hardware and it states that on the Bug Report.
I've had my ifcfg-* files renamed to ifcfg-*.bak files and ignored completely when moving drives, even among identical hardware. It's no fun when shipping to remote locations where the on-site people don't know much about linux.
At Sat, 25 Apr 2009 16:32:06 -0400 CentOS mailing list centos@centos.org wrote:
On Sat, 2009-04-25 at 14:52 -0500, Les Mikesell wrote:
JohnS wrote:
On Sat, 2009-04-25 at 20:33 +0200, Kai Schaetzl wrote:
I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo).
see this: http://linux.dell.com/files/whitepapers/nic-enum-whitepaper-v3.pdf
This is a known issue with all Poweredge Servers. It will give you an explanation and workaround for it.
I don't think there is anything unique to Dells about this. The kernel essentially randomizes device naming on everything. Dell just took the trouble to document it.
Also: https://bugzilla.redhat.com/show_bug.cgi?id=491432 Seems to apply to Kais case.
You *must* specify the HWADDR field in the ifcfg-* files in order to have persistent ethernet naming. Was the way I done it on dell hardware and it states that on the Bug Report.
On ALL RedHat flavored distros (even eith 2.4 kernels), I *always* specificed the HWADDR field in the ifcfg-* files. I *think* the RedHat installers generally always set this field during installation as well. At least as early as RH 7.<mumble> or RH 9, which would be when I first was dealing with machines with more than one NIC.
JohnStanley
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
JohnS wrote on Sat, 25 Apr 2009 16:32:06 -0400:
You *must* specify the HWADDR field in the ifcfg-* files in order to have persistent ethernet naming.
And that is what I always do. Never done it another way. You may have overlooked that part in my message where I state that it works without a problem despite this juggling around until I add a virtual interface to eth0. I'll try tomorrow adding HWADDR to eth0:1 as well, but I think this will fail. I guess I will have to turn off in the BIOS or remove eth2 and maybe eth1 as well tomorrow and run some more tests with just one adapter and then add to it. I hope I can switch off eth2 in the BIOS somehow. I would hate to remove it as it is below the SAS adapter and the many SATA cables.
Thanks for the answers so far. At least confirms that the simple juggling around of the main network interfaces is normal and to be expected.
Kai
Kai Schaetzl wrote on Sun, 26 Apr 2009 00:31:20 +0200:
Thanks for the answers so far. At least confirms that the simple juggling around of the main network interfaces is normal and to be expected.
Simple test, I shut off xend and xendomains and the problem is gone. So, the problem is with the script that xend runs when creating the eth0/peth0 network bridge for the domUs. I'll move this to the xen-users list.
Kai