[CentOS] kernel: Machine check events logged

Wed Jul 7 14:26:11 UTC 2010
m.roth at 5-cent.us <m.roth at 5-cent.us>

Peter Kjellstrom wrote:
> On Wednesday 07 July 2010, m.roth at 5-cent.us wrote:
>> Alexander Farber wrote:
>> > every few hours I get the following message in /var/log/message:
>> > Jul  5 20:23:28 hXXX kernel: Machine check events logged
> ...
>> > MCE 0
>> > HARDWARE ERROR. This is *NOT* a software problem!
>> > Please contact your hardware vendor
>> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
>> > uptime (unreliable)]
>> > MISC c008000001000000 ADDR 1148f5940
>> >   Northbridge NB Array Error
>> >        bit35 = err cpu3
>> >        bit42 = L3 subcache in error bit 0
>> >        bit43 = L3 subcache in error bit 1
>> >        bit46 = corrected ecc error
>> >        bit59 = misc error valid
>> >   memory/cache error 'generic read mem transaction, generic
>> > transaction, level generic'
>> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
>> > No DIMM found for 1148f5940 in SMBIOS
> ...
<snip>
>> - I'm not good enough on this to tell you if
>> it's the CPU, or the motherboard, but it's one of the two, *not* just
>> memory.
>
> What do you base that on? I've seen a lot of different MCE-errors being
> resolved by finding and replacing flaky dimms.

Because it says NB Array error, and errors in the L3 subcache. I've seen
enough memory errors, and not seen an NB array & subcache error.

I do just note that there's "No DIMM found for ... in SMBIOS", but I
assume that's just a bank that's not filled.

         mark