Hi,
Is there someone on this mailing list who could/want help me figure out this issue? We do not know where to look to solve this.
--- Description ---
This is a brand new server, which has been tested for days with FreeBSD in our office, and a few days with Windows on the site of our hardware distributor. Now customer wants CentOS, which we installed, but after few days we get a kernel panic. Last night at 2:08 it gave the same kernel panic.
Please tell me what information I should give you and most important how to get it from the system, because we do not have experience with CentOS (only FreeBSD).
I would be very surprised if this is hardware related. We use the same hardware for several years, and run FreeBSD on it very successfully. It is a SuperMicro PDSMI+ motherboard with 3ware raid controller (8006-2LP). CPU is Xeon 3040 1.8 Ghz EM64 2MB 1066FSB (65W). Memory is DDR 2 Trancend 2048MB ECC Unbuffered 800.
Error message on console is in "Additional Information".
I am hoping that I should switch off some setting in CentOS to fix this, but I cannot find much useful information about this issue on Google.
--- Additional Information ---
CentOS release 5 (Final) Kernel 2.6.18-53.1.21.el5 on an i686
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
--- Attachments ---
19-06-2008 16-03-31.png (Screenshot of console)
With kind regards,
Alwin Roosen
On Fri, 2008-06-20 at 14:40 +0200, Alwin Roosen wrote:
Hi,
Is there someone on this mailing list who could/want help me figure out this issue? We do not know where to look to solve this.
...
I would be very surprised if this is hardware related.
A google on
"Machine Check Exception" "Kernel panic - not syncing: CPU context corrupt"
turns up 50 results (including your CentOS BZ request referring you to this list), many of which point to hardware problems - CPU, MB (bad caps), chipset, are all listed as possible problems. I'd go back to the hardware vendor if still under warranty.
Phil
2008/6/20 Alwin Roosen alwin.roosen@webline.be:
Hi,
Is there someone on this mailing list who could/want help me figure out this issue? We do not know where to look to solve this.
If your installation is standard CentOS with no thirdparty software, and configurations, I would first run the vendor hardware checks several times, as they are usually not good with intermittent or hard to find problems, run extenisve memtest also if possible
regards
Walid
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote: <snip>
CentOS release 5 (Final) Kernel 2.6.18-53.1.21.el5 on an i686
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
Phil or someone else: Do the three (3) "Bank" lines above indicate RAM problems? If not, what do they refer to? Alwin wrote that this is brand new HW, so he suspects that it is OK, but it doesn't seem to be OK? Lanny
Lanny Marcus wrote:
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote:
<snip>
CentOS release 5 (Final) Kernel 2.6.18-53.1.21.el5 on an i686
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
Phil or someone else: Do the three (3) "Bank" lines above indicate RAM problems? If not, what do they refer to? Alwin wrote that this is brand new HW, so he suspects that it is OK, but it doesn't seem to be OK? Lanny _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
I have the same issue, unresolved. However I am using old desktop hardware (Compaq Persario, and HP something or another). Maybe it is memory, or CPU, or some kind of incompatibility with something. I was just making a list of the hardware that should be purchased to run a low-end SME server using CentOS.
Rack mountable case, with Power Supply and fans included. MotherBoard, mid-range processor. 2 Gb RAM USB Drive 1 Tb Two 500Gb or four 300 Gb internal hardrives (HW Raid would be nice) CD/DVD R/W drive and so on..........
But I don't want to get into the situation above, where I purchase NEW hardware, and CentOS doesn't like it, and furthermore the resolution is elusive.
What is the best HW environment for CentOS? Brand, MFG, chipset rev, and so on....
Michael wrote:
But I don't want to get into the situation above, where I purchase NEW hardware, and CentOS doesn't like it, and furthermore the resolution is elusive.
What is the best HW environment for CentOS? Brand, MFG, chipset rev, and so on....
Easiest is to buy from a vendor that can test on your OS of choice, there are lots of vendors out there that can do it.
Two such companies I have bought from that do this include http://www.siliconmechanics.com/ (HQ in Seattle, WA area) http://www.asaservers.com/ (HQ in San Fransisco, CA area)
Both specialize in Supermicro/Tyan-based systems(as to most other "whitebox" vendors).
nate
On 6/20/08, nate centos@linuxpowered.net wrote: <snip>
Easiest is to buy from a vendor that can test on your OS of choice, there are lots of vendors out there that can do it.
Two such companies I have bought from that do this include http://www.siliconmechanics.com/ (HQ in Seattle, WA area) http://www.asaservers.com/ (HQ in San Fransisco, CA area)
Both specialize in Supermicro/Tyan-based systems(as to most other "whitebox" vendors).
That, IMHO, is the best way to go. Another way, if the HW is available, is to test it with a Live CD for CentOS, before purchasing, to see if CentOS will run properly on the HW.
on 6-20-2008 8:23 AM Lanny Marcus spake the following:
On 6/20/08, Alwin Roosen alwin.roosen-AcEhIOVMebKZIoH1IeqzKA@public.gmane.org wrote:
<snip> > CentOS release 5 (Final) > Kernel 2.6.18-53.1.21.el5 on an i686 > > ws174 login: CPU 1: Machine Check Exception: 0000000000000005 > CPU 0: Machine Check Exception: 0000000000000004 > Bank 3: f62000020002010a at 0000000032c93500 > Bank 5: f20000300c000e0f > Kernel panic - not syncing: CPU context corrupt > Bank 3: f62000020002010a > Phil or someone else: Do the three (3) "Bank" lines above indicate RAM problems? If not, what do they refer to? Alwin wrote that this is brand new HW, so he suspects that it is OK, but it doesn't seem to be OK? Lanny
As most of us have found out at some time; brand new does not always equal OK. I have had plenty of hardware that was dead on arrival or dead in days. Check the obvious of re-seating all removable parts like memory and cards, and also any option cards for second processors if they are included. Shipping or moving equipment can loosen things.
Also look at the memory to see if it is on the recommended list for the motherboard.
On 6/20/08, Scott Silva ssilva@sgvwater.com wrote: <snip>
As most of us have found out at some time; brand new does not always equal OK. I have had plenty of hardware that was dead on arrival or dead in days. Check the obvious of re-seating all removable parts like memory and cards, and also any option cards for second processors if they are included. Shipping or moving equipment can loosen things.
Also look at the memory to see if it is on the recommended list for the motherboard.
The HW is using Memory Banking? Three (3) Banks have problems? How many Banks are there?
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote:
Hi,
CentOS release 5 (Final) Kernel 2.6.18-53.1.21.el5 on an i686
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
Alwin -->
I would be very, very "surprised" *IF* this wasn't hardware related.
Dave Jones wrote a nice little program to help decode this:
$ parsemce -b 3 -s f62000020002010a -e 5 -a 0000000032c93500 Status: (5) Machine Check in progress. Restart IP valid. parsebank(3): f62000020002010a @ 32c93500 External tag parity error CPU state corrupt. Restart not possible Address in addr register valid Error enabled in control register Error not corrected. Error overflow Memory hierarchy error Request: Generic error Transaction type : Generic Memory/IO : I/O
and:
$ parsemce -b 5 -s f20000300c000e0f -e 4 -a 0 Status: (4) Machine Check in progress. Restart IP invalid. parsebank(5): f20000300c000e0f @ 0 External tag parity error CPU state corrupt. Restart not possible Error enabled in control register Error not corrected. Error overflow Bus and interconnect error Participation: Generic Timeout: Request did not timeout Request: Generic error Transaction type : Invalid Memory/IO : Other
Dag's Repo has the new memtest86+ 2.01 RPM. I'd pull it and let it run overnight. While memtest86+ is good, I've recently had cases where is didn't find (obvious) memory errors.
I've also seen things like SATA disks drive cause MCEs.
This one looks like you're taking memory parity errors somewhere in the path to the CPU. On you BIOS, check you Events log for any "interesting" entries, too.
Hope this helps ...
-rak-
Richard Karhuse wrote:
Dag's Repo has the new memtest86+ 2.01 RPM. I'd pull it and let it run overnight. While memtest86+ is good, I've recently had cases where is didn't find (obvious) memory errors.
My favorite test is cerberus(ctcs). Quite a few OEMs out there use it to burn in their systems. For me it can typically find a problem within a few hours. Whereas memtest I've let it run for a week and have it not find anything useful.
Though the results of cerberus sometimes won't help you pinpoint the problem(often the result is just a machine crash). But at least you know there is an issue and can start swapping hardware until it's fixed(or just replace the whole system).
http://sourceforge.net/projects/va-ctcs/
nate
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote: <snip>
This is a brand new server, which has been tested for days with FreeBSD in our office, and a few days with Windows on the site of our hardware distributor. Now customer wants CentOS, which we installed, but after few days we get a kernel panic. Last night at 2:08 it gave the same kernel panic.
The fact that it worked OK, the first few days, with FreeBSD and Windows, may have been a Burn In test and now something in the HW has failed or is failing. Or, possibly CentOS is utilizing the HW much more robustly than the other 2 OS did?
I would suggest that you get a Knoppix Live CD, or, preferably, a CentOS Live CD, and let it roll.
And, you get a Kernel Panic, after_ a_ few_ days, on CentOS. That might indicate a Memory problem? Or, a Cooling problem?
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
Two banks of Memory (3 and 5) have problems?
If the RAM tests OK, suggest you swap the motherboard
On Fri, 2008-06-20 at 15:57 -0500, Lanny Marcus wrote:
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote:
<snip> ><snip>
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
Two banks of Memory (3 and 5) have problems?
If the RAM tests OK, suggest you swap the motherboard
IIRC, you have memory interleaved? I've had problems with that, in the past, on ... an acer? Anyway, if so, try turning it off in the BIOS setup.
Also, make sure you have the latest BIOS for the mainboard.
<snip sig stuff>
HTH
William L. Maltby wrote:
On Fri, 2008-06-20 at 15:57 -0500, Lanny Marcus wrote:
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote:
<snip>
<snip>
ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a
Two banks of Memory (3 and 5) have problems?
If the RAM tests OK, suggest you swap the motherboard
IIRC, you have memory interleaved? I've had problems with that, in the past, on ... an acer? Anyway, if so, try turning it off in the BIOS setup.
Also, make sure you have the latest BIOS for the mainboard.
I'm pretty sure those 'banks' mentioned in that error relate to the on-CPU cache, and not to motherboard main RAM.
any ECC in a MACHINE CHECK is likely CACHE ecc, not main memory ECC.
On 6/20/08, Alwin Roosen alwin.roosen@webline.be wrote:
This is a brand new server, which has been tested for days with FreeBSD in our office, and a few days with Windows on the site of our hardware distributor. Now customer wants CentOS, which we installed, but after few days we get a kernel panic. Last night at 2:08 it gave the same kernel panic.
Have you checked to verify that the fans are spinning?
Since it is a new system, I think you should take it back to your HW distributor and have them run cerberus(ctcs) on it, as Richard Karhuse wrote.
If it takes a few days for it to get the Kernel Panic, I doubt that is related to the OS.
Let your HW distributor do the work of troubleshooting and replacing whatever component(s) are faulty. They can get a CentOS Live CD and run that on it.