I have a strange problem on one machine where eth0 gets killed when I add a virtual interface. It's got something to do with the NIC ordering or with the xen network script having a problem with multiple NICs and virtual interfaces. I could need some help/comments on this.
Some history: I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server. CentOS 5.3 with Xen 3.3.1 (gitco repo). eth0 and eth1 are the built-in NICs, this is then eth2 (or it should be). Works. Everything is fine until I add a virtual interface to eth0 and reboot. I can add eth0:1 at runtime just fine. But if I let it stay in network-scripts and boot the whole eth0 is killed (doesn't show up in ifconfig and doesn't work). A network restart brings it up as if nothing is wrong. I first thought it might have something to do with the fact that eth0 is actually a bridge on Xen > 3.2 and tried the same config on another machine and there it works. It's not the exact same xen version, not 64bit and it's got only 1 NIC. So there are differences, but it seems to rule out the bridge as a cause.
I then checked the logs more thoroughly and found that CentOS changes the NIC initialization order at boot-time. Without the third NIC it's eth0=NIC1 and eth1=NIC2 (as shown on the chassis). But with the third NIC it's most often that one that goes first. Here's a typical excerpt from messages. tigon/tg3 is the driver for the internal NICs which normally were on eth0 and eth1.
Apr 25 19:00:59 c4 kernel: eth0: RTL8168b/8111b at 0xffffc20000022000, 00:21:27:c9:d1:f5, XID 38000000 IRQ 16 Apr 25 19:00:59 c4 kernel: eth1: Tigon3 [partno(BCM95721) rev 4201 PHY (5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:1e:c9:fe:fb:ab Apr 25 19:00:59 c4 kernel: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1] Apr 25 19:00:59 c4 kernel: eth1: dma_rwctrl[76180000] dma_mask[64-bit] Apr 25 19:00:59 c4 kernel: eth2: Tigon3 [partno(BCM95721) rev 4201 PHY (5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:1e:c9:fe:fb:ac Apr 25 19:00:59 c4 kernel: eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1] Apr 25 19:00:59 c4 kernel: eth2: dma_rwctrl[76180000] dma_mask[64-bit] Apr 25 19:00:59 c4 kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex. Apr 25 19:00:59 c4 kernel: tg3: eth0: Flow control is on for TX and on for RX. Apr 25 19:00:59 c4 kernel: r8169: eth2: link up Apr 25 19:00:59 c4 kernel: r8169: eth2: link up Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth2, 192.168.2.4#123 Enabled Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth0, 192.168.1.24#123 Enabled Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth1, 192.168.2.3#123 Enabled Apr 25 19:01:08 c4 uxmon: c4.net: started monitoring: lo eth2 eth0 eth1 Apr 25 19:01:18 c4 kernel: tg3: peth0: Link is up at 1000 Mbps, full duplex. Apr 25 19:01:18 c4 kernel: tg3: peth0: Flow control is on for TX and on for RX. Apr 25 19:01:18 c4 kernel: device peth0 entered promiscuous mode Apr 25 19:01:18 c4 kernel: type=1700 audit(1240678878.244:3): dev=peth0 prom=256 old_prom=0 auid=4294967295 ses=4294967295 Apr 25 19:01:18 c4 kernel: eth0: topology change detected, propagating Apr 25 19:01:18 c4 kernel: eth0: port 1(peth0) entering forwarding state
Repeated booting sometimes gives me a different order, e.g. the two tigon come first, but this is rare.
Well, it seems this wasn't a problem until I added a virtual interface to eth0. When the eth interfaces are brought up the system seems to reenumerate the eth numbering according to the HWADDR matches and thus eth0=NIC1 and so on. As soon as I add a virtual interface to eth0 this breaks and all of eth0 is killed. At least that's what I figure.
So, the next obvious question is: How can I set a fixed order, so that NIC1 is always brought up first as eth0?
I'm not sure if this would fix it, though. I have done too few reboots yet, but it seems that at least once I got the "correct" initialization order but eth0 got killed, anyway. So, it might not be the order but still something in the Xen script which happens only when multiple NICs are present and a virtual interface is added.
Any thoughts so far?
Kai