[CentOS] PXE problem after CentOS reboot

Sat Jan 5 04:37:30 UTC 2008
Andrey Slepuhin <andrey.slepuhin at t-platforms.ru>

Dear folks,

We are installing a large diskless cluster using CentOS 5.1. The 
hardware is pretty new - Supermicro X7DWT boards with Harpertown CPUs. 
Unfortunately we have some PXE-related problems described by the 
following scenario:
1) Set up DHCP, TFTP and NFS on a server, prepare PXE kernel and initrd 
- fine.
2) Start up the node using PXE for the first time - fine.
3) Reboot the node - PXE boot fails for all next attempts. We see that a 
server gets DHCP requests and answers them, but a node doesn't response 
with DHCP ack. The typical DHCP log is:
Jan  5 09:14:34 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan  5 09:14:34 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to 
00:30:48:7e:24:a6 via eth1
Jan  5 09:14:36 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan  5 09:14:36 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to 
00:30:48:7e:24:a6 via eth1
Jan  5 09:14:40 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan  5 09:14:40 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to 
00:30:48:7e:24:a6 via eth1
Jan  5 09:14:48 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan  5 09:14:48 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to 
00:30:48:7e:24:a6 via eth1
4) Anything like DHCP server restart, node reset, node power on/off 
doesn't help
5) The only thing that will enable system to boot again over PXE is to 
perform "bmc reset cold" command on a node using ipmitool - yes, we have 
IPMI card sharing the same Ethernet interface. After that we can boot 
CentOS again.
6) When Linux is loaded, if we reboot a node using "bmc power cycle" 
instead of reboot or shutdown, a node will boot for the next time 
without problems
7) There are no problems with a second GbE interface (without IPMI)
8) So our guess is that Linux on a reboot leaves Ethernet device in some 
state that cause brain damage for IPMI+PXE combination. We tried to play 
with some e1000 driver options, we are also tried latest Intel driver - 
nothing helps.
Do you have any idea what goes wrong? Any help will be much appreciated. 
Below there is a system summary:

[root at node-05-03 ~]# uname -a
Linux node-05-03 2.6.18-53.1.4.el5 #1 SMP Fri Nov 30 00:45:55 EST 2007 
x86_64 x86_64 x86_64 GNU/Linux

[root at node-05-03 ~]# lspci
00:00.0 Host bridge: Intel Corporation Memory Controller Hub (rev 20)
00:01.0 PCI bridge: Intel Corporation PCI Express Port 1 (rev 20)
00:05.0 PCI bridge: Intel Corporation PCI Express Port 5 (rev 20)
00:07.0 PCI bridge: Intel Corporation PCI Express Port 7 (rev 20)
00:0f.0 System peripheral: Intel Corporation DMA/DCA Engine (rev 20)
00:10.0 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.1 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.2 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.3 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.4 Host bridge: Intel Corporation FSB Registers (rev 20)
00:11.0 Host bridge: Intel Corporation Unknown device 4031 (rev 20)
00:15.0 Host bridge: Intel Corporation FBD Registers (rev 20)
00:15.1 Host bridge: Intel Corporation FBD Registers (rev 20)
00:16.0 Host bridge: Intel Corporation FBD Registers (rev 20)
00:16.1 Host bridge: Intel Corporation FBD Registers (rev 20)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #3 (rev 09)
00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC 
Interface Controller (rev 09)
00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI 
Controller (rev 09)
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus 
Controller (rev 09)
01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Upstream Port (rev 01)
02:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to 
PCI-X Bridge (rev 01)
03:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Downstream Port E1 (rev 01)
03:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Downstream Port E3 (rev 01)
05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
05:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
08:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)

Thanks in advance,
Andrey