[CentOS] kernel: Machine check events logged

Wed Jul 7 14:21:18 UTC 2010
Peter Kjellstrom <cap at nsc.liu.se>

On Wednesday 07 July 2010, m.roth at 5-cent.us wrote:
> Alexander Farber wrote:
> > every few hours I get the following message in /var/log/message:
> > Jul  5 20:23:28 hXXX kernel: Machine check events logged
...
> > MCE 0
> > HARDWARE ERROR. This is *NOT* a software problem!
> > Please contact your hardware vendor
> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
> > uptime (unreliable)]
> > MISC c008000001000000 ADDR 1148f5940
> >   Northbridge NB Array Error
> >        bit35 = err cpu3
> >        bit42 = L3 subcache in error bit 0
> >        bit43 = L3 subcache in error bit 1
> >        bit46 = corrected ecc error
> >        bit59 = misc error valid
> >   memory/cache error 'generic read mem transaction, generic
> > transaction, level generic'
> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
> > No DIMM found for 1148f5940 in SMBIOS
...
> First, this is *very* bad

That's a bit hard. Depending on what the actual error is that triggers this 
mce it may actually be just an annoyance (even though, yes, it is a hardware 
problem). Also the OP did mention that the servers runs without any obvious 
problems.

> - I'm not good enough on this to tell you if 
> it's the CPU, or the motherboard, but it's one of the two, *not* just
> memory.

What do you base that on? I've seen a lot of different MCE-errors being 
resolved by finding and replacing flaky dimms.

> Second, if you're paying for hosting, and it's *their* server, you 
> need to get on the phone with them *now*, and tell them that they need to
> fix it, yesterday would be preferable. They *should* have seen the logs.
>
> Dunno if you have a physical machine hosted there, or a VM'

I'm quite sure you can't get that kind of MCE-dump inside a VM.

/Peter

> if the latter, 
> they can move it without you seeing any downtime at all. If the former,
> they can just hot swap the drives into another server.
>
> But call them *NOW*. You're paying for the service.
>
>         mark
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.centos.org/pipermail/centos/attachments/20100707/535d4f16/attachment-0005.sig>