[CentOS] kernel: Machine check events logged

Wed Jul 7 15:44:20 UTC 2010
Peter Kjellstrom <cap at nsc.liu.se>

On Wednesday 07 July 2010, m.roth at 5-cent.us wrote:
> Peter Kjellstrom wrote:
> > On Wednesday 07 July 2010, m.roth at 5-cent.us wrote:
> >> Alexander Farber wrote:
...
> >> > MISC c008000001000000 ADDR 1148f5940
> >> >   Northbridge NB Array Error
> >> >        bit35 = err cpu3
> >> >        bit42 = L3 subcache in error bit 0
> >> >        bit43 = L3 subcache in error bit 1
> >> >        bit46 = corrected ecc error
> >> >        bit59 = misc error valid
> >> >   memory/cache error 'generic read mem transaction, generic
> >> > transaction, level generic'
> >> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
> >> > No DIMM found for 1148f5940 in SMBIOS
...
> >> - I'm not good enough on this to tell you if
> >> it's the CPU, or the motherboard, but it's one of the two, *not* just
> >> memory.
> >
> > What do you base that on? I've seen a lot of different MCE-errors being
> > resolved by finding and replacing flaky dimms.
>
> Because it says NB Array error, and errors in the L3 subcache. I've seen
> enough memory errors, and not seen an NB array & subcache error.

That does sound like a reasonable guess. However, you presented it as absolute 
truth. The MCE could just as easily be read as: NB means not IC/DC/BU => 
actual RAM.

Given that real world figures show bad RAM to be a lot more likely that a bad 
CPU I'd start by looking at the dimms (or at the very least not exclude 
it...).

> I do just note that there's "No DIMM found for ... in SMBIOS", but I
> assume that's just a bank that's not filled.

or the SMBIOS data is borked, wouldn't be the first time...

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.centos.org/pipermail/centos/attachments/20100707/5edd7407/attachment-0005.sig>