On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
Alexander Farber wrote:
every few hours I get the following message in /var/log/message: Jul 5 20:23:28 hXXX kernel: Machine check events logged
...
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
...
First, this is *very* bad
That's a bit hard. Depending on what the actual error is that triggers this mce it may actually be just an annoyance (even though, yes, it is a hardware problem). Also the OP did mention that the servers runs without any obvious problems.
- I'm not good enough on this to tell you if
it's the CPU, or the motherboard, but it's one of the two, *not* just memory.
What do you base that on? I've seen a lot of different MCE-errors being resolved by finding and replacing flaky dimms.
Second, if you're paying for hosting, and it's *their* server, you need to get on the phone with them *now*, and tell them that they need to fix it, yesterday would be preferable. They *should* have seen the logs.
Dunno if you have a physical machine hosted there, or a VM'
I'm quite sure you can't get that kind of MCE-dump inside a VM.
/Peter
if the latter, they can move it without you seeing any downtime at all. If the former, they can just hot swap the drives into another server.
But call them *NOW*. You're paying for the service.
mark