On Friday 22 May 2009, Peter Hopfgartner wrote: ...
Would it make sense to install the kernel from CentOS 5.2? Any contraindications?
As others have said, you should still have the 5.2 kernel around. Just change the grub.conf and reboot. It makes no sense to start swapping around hardware until you've tried to revert the kernel.
That said, we've seen hangs and strange kernel messages on several different server platforms (HP DL140g3: NMI-related messages logged, HP DL160g5: hangs semi-randomly) with the new 5.3 kernels. All of these problems could be worked around by booting with the kernel option "nmi_watchdog=0".
/Peter
I am experiencing the same issue with random reboots after a 5.3 upgrade. Sometimes it will go for days without rebooting then today it has rebooted 6 times at random times. I have modified grub.conf to go back to 2.6.18-92.1.22.el5xen on my dom0 and my only domU so we will see what happens (or hopefully doesn't happen) the next few days.
I have a 3.0 P4 CPU with HT that does not support 64-bit so it's running an i686 kernel. It does have a Broadcom NIC like an earlier post was suspicious of: 02:08.0 Ethernet controller: Intel Corporation 82562EZ 10/100 Ethernet Controller (rev 02) 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02)
I have upgraded about 8 other servers with no random reboot problems but they are all running on a newer processors with a 64-bit kernel.
On Wed, 2009-05-27 at 18:03 -0500, Dave Jones wrote:
I am experiencing the same issue with random reboots after a 5.3 upgrade. Sometimes it will go for days without rebooting then today it has rebooted 6 times at random times. I have modified grub.conf to go back to 2.6.18-92.1.22.el5xen on my dom0 and my only domU so we will see what happens (or hopefully doesn't happen) the next few days.
I have a 3.0 P4 CPU with HT that does not support 64-bit so it's running an i686 kernel.
--- A P4 with HT...? This may not be your problem but I have several P4 CPUs with HT Enabled. Do you get messages in /var/log/messages/ about your cpu temp is above thresh hold and that will throttle back cpu 0 or 1 constanlty? Also just currious are you using "p4-clockmod" driver? On a reboot that had happened those were the messages I had in my logs.
JohnStanley
JohnS wrote:
On Wed, 2009-05-27 at 18:03 -0500, Dave Jones wrote:
I am experiencing the same issue with random reboots after a 5.3 upgrade. Sometimes it will go for days without rebooting then today it has rebooted 6 times at random times. I have modified grub.conf to go back to 2.6.18-92.1.22.el5xen on my dom0 and my only domU so we will see what happens (or hopefully doesn't happen) the next few days.
I have a 3.0 P4 CPU with HT that does not support 64-bit so it's running an i686 kernel.
A P4 with HT...? This may not be your problem but I have several P4 CPUs with HT Enabled. Do you get messages in /var/log/messages/ about your cpu temp is above thresh hold and that will throttle back cpu 0 or 1 constanlty? Also just currious are you using "p4-clockmod" driver? On a reboot that had happened those were the messages I had in my logs.
JohnStanley
I have heard of HP/Compaq Proliant servers having similar problems, random reboot or system hangs, seems te be a kernel bug from upstream, take a look at these bugzilla tickets https://bugzilla.redhat.com/show_bug.cgi?id=494114 https://bugzilla.redhat.com/show_bug.cgi?id=470202
The two problems have been solved as from test kernel 2.6.18-144, available at http://people.redhat.com/dzickus/el5/
Epilogue:
I've tried to disable TSO (ethtool -K eth0 tso off), as was suggested on the poweredge list. This did not help.
I've configured the machine to start with the 5.2 kernel in /boot/grub/grub.conf, changing the default. It has been running for 6 1/2 days, now. I would say that this helped and is what I would suggest to others experiencing the same problem, right now.
Thus, current running kernel is 2.6.18-92.1.10.el5xen.
Regards and thanks for all replies,
Peter
I have booted the previous 5.2 kernel as well with my problem server and it has been stable for the past week too. There is definitely something going on with this 5.3 Xen kernel on that hardware. The CPU is a 3.0 GHz P4 with HT and no 64-bit support. Kernel 2.6.18-92.1.22.el5xen seems to be the last stable one on my server that I am running now.
[root@server1 ~]# uname -a Linux server1.mydomain.com 2.6.18-92.1.22.el5xen #1 SMP Tue Dec 16 13:08:49 EST 2008 i686 i686 i386 GNU/Linux [root@server1 ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 9 cpu MHz : 2992.572 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 7485.27
processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 9 cpu MHz : 2992.572 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe up cid xtpr bogomips : 7485.27
[root@server1 ~]# lspci 00:00.0 Host bridge: Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02) 00:01.0 PCI bridge: Intel Corporation 82865G/PE/P PCI to AGP Controller (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) 00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02) 01:00.0 VGA compatible controller: nVidia Corporation NV36.2 [GeForce FX 5700] (rev a1) 02:08.0 Ethernet controller: Intel Corporation 82562EZ 10/100 Ethernet Controller (rev 02) 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02) 02:0a.0 RAID bus controller: Silicon Image, Inc. Adaptec AAR-1210SA SATA HostRAID Controller (rev 02)
[root@server1 ~]# lsusb Bus 001 Device 001: ID 0000:0000 Bus 003 Device 001: ID 0000:0000 Bus 002 Device 001: ID 0000:0000 Bus 002 Device 002: ID 051d:0002 American Power Conversion Uninterruptible Power Supply
on 6-3-2009 11:30 AM Dave Jones spake the following:
Epilogue:
I've tried to disable TSO (ethtool -K eth0 tso off), as was suggested on the poweredge list. This did not help.
I've configured the machine to start with the 5.2 kernel in /boot/grub/grub.conf, changing the default. It has been running for 6 1/2 days, now. I would say that this helped and is what I would suggest to others experiencing the same problem, right now.
Thus, current running kernel is 2.6.18-92.1.10.el5xen.
Regards and thanks for all replies,
Peter
I have booted the previous 5.2 kernel as well with my problem server and it has been stable for the past week too. There is definitely something going on with this 5.3 Xen kernel on that hardware. The CPU is a 3.0 GHz P4 with HT and no 64-bit support. Kernel 2.6.18-92.1.22.el5xen seems to be the last stable one on my server that I am running now.
Did you ever try the newer kernel with hyperthreading off? It isn't really a second processor, and doesn't add much to a server. I have an older server with 2 HT zeon's, and it actually runs better with HT off. But that is 64 bit, and might be different.