[CentOS] eth0 killed when adding virtual interface and multiple NICs are present

Sat Apr 25 18:33:36 UTC 2009
Kai Schaetzl <maillists at conactive.com>

I have a strange problem on one machine where eth0 gets killed when I add 
a virtual interface. It's got something to do with the NIC ordering or 
with the xen network script having a problem with multiple NICs and 
virtual interfaces. I could need some help/comments on this.

Some history:
I added a NIC (chip identifies as Realtek Semiconductor Co., Ltd. 
RTL8111/8168B PCI Express Gigabit Ethernet) to a Dell R200 server.
CentOS 5.3 with Xen 3.3.1 (gitco repo). eth0 and eth1 are the built-in 
NICs, this is then eth2 (or it should be).
Works. Everything is fine until I add a virtual interface to eth0 and 
reboot. I can add eth0:1 at runtime just fine. But if I let it stay in 
network-scripts and boot the whole eth0 is killed (doesn't show up in 
ifconfig and doesn't work). A network restart brings it up as if nothing 
is wrong.
I first thought it might have something to do with the fact that eth0 is 
actually a bridge on Xen > 3.2 and tried the same config on another 
machine and there it works. It's not the exact same xen version, not 64bit 
and it's got only 1 NIC. So there are differences, but it seems to rule 
out the bridge as a cause.

I then checked the logs more thoroughly and found that CentOS changes the 
NIC initialization order at boot-time.
Without the third NIC it's eth0=NIC1 and eth1=NIC2 (as shown on the 
chassis). But with the third NIC it's most often that one that goes first. 
Here's a typical excerpt from messages. tigon/tg3 is the driver for the 
internal NICs which normally were on eth0 and eth1.

Apr 25 19:00:59 c4 kernel: eth0: RTL8168b/8111b at 0xffffc20000022000, 
00:21:27:c9:d1:f5, XID 38000000 IRQ 16
Apr 25 19:00:59 c4 kernel: eth1: Tigon3 [partno(BCM95721) rev 4201 PHY
(5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:1e:c9:fe:fb:ab
Apr 25 19:00:59 c4 kernel: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] 
WireSpeed[1] TSOcap[1]
Apr 25 19:00:59 c4 kernel: eth1: dma_rwctrl[76180000] dma_mask[64-bit]
Apr 25 19:00:59 c4 kernel: eth2: Tigon3 [partno(BCM95721) rev 4201 PHY
(5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:1e:c9:fe:fb:ac
Apr 25 19:00:59 c4 kernel: eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] 
WireSpeed[1] TSOcap[1]
Apr 25 19:00:59 c4 kernel: eth2: dma_rwctrl[76180000] dma_mask[64-bit]
Apr 25 19:00:59 c4 kernel: tg3: eth0: Link is up at 1000 Mbps, full 
duplex.
Apr 25 19:00:59 c4 kernel: tg3: eth0: Flow control is on for TX and on for 
RX.
Apr 25 19:00:59 c4 kernel: r8169: eth2: link up
Apr 25 19:00:59 c4 kernel: r8169: eth2: link up
Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth2, 
192.168.2.4#123 Enabled
Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth0, 
192.168.1.24#123 Enabled
Apr 25 19:01:01 c4 ntpd[2461]: Listening on interface eth1, 
192.168.2.3#123 Enabled
Apr 25 19:01:08 c4 uxmon: c4.net: started monitoring: lo eth2 eth0 eth1
Apr 25 19:01:18 c4 kernel: tg3: peth0: Link is up at 1000 Mbps, full 
duplex.
Apr 25 19:01:18 c4 kernel: tg3: peth0: Flow control is on for TX and on 
for RX.
Apr 25 19:01:18 c4 kernel: device peth0 entered promiscuous mode
Apr 25 19:01:18 c4 kernel: type=1700 audit(1240678878.244:3): dev=peth0 
prom=256 old_prom=0 auid=4294967295 ses=4294967295
Apr 25 19:01:18 c4 kernel: eth0: topology change detected, propagating
Apr 25 19:01:18 c4 kernel: eth0: port 1(peth0) entering forwarding state

Repeated booting sometimes gives me a different order, e.g. the two tigon 
come first, but this is rare.

Well, it seems this wasn't a problem until I added a virtual interface to 
eth0. When the eth interfaces are brought up the system seems to 
reenumerate the eth numbering according to the HWADDR matches and thus 
eth0=NIC1 and so on. As soon as I add a virtual interface to eth0 this 
breaks and all of eth0 is killed. At least that's what I figure.

So, the next obvious question is: How can I set a fixed order, so that 
NIC1 is always brought up first as eth0?

I'm not sure if this would fix it, though. I have done too few reboots 
yet, but it seems that at least once I got the "correct" initialization 
order but eth0 got killed, anyway. So, it might not be the order but still 
something in the Xen script which happens only when multiple NICs are 
present and a virtual interface is added.

Any thoughts so far?

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com