[CentOS] Interpretation of a hardware error

Fri Apr 13 09:42:13 UTC 2012
Peter Kjellström <cap at nsc.liu.se>

On Thursday 12 April 2012 13.36.03 m.roth at 5-cent.us wrote:
> Hey, folks,
> 
> I've just started seeing
> Apr 12 13:09:59 <server> kernel: [Hardware Error]:
> MC4_STATUS[Over|CE|MiscV|-|AddrV|-|Poison|CECC]: 0xdd0accf2001d011b
> Apr 12 13:09:59 <server> kernel: [Hardware Error]: Northbridge Error (node
> 1, core 1): ECC error in L3 cache tag.

The error message certainly points to the CPU. The fact that the error 
happened on cache tag, not cache data further implicates the CPU.

The message is quite specific and I'd say rather trustworthy...

But there's also the possibility that the message is wrong (either something 
else went wrong or nothing really went wrong). In my experience hardware fault 
error messages are quite unreliable and at the end of the day DIMMs are 
magnitudes more likely to fail than CPUs...

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.centos.org/pipermail/centos/attachments/20120413/8da75ca9/attachment-0005.sig>