[CentOS] New kernel causes hardware error?

Tue Jun 22 07:40:36 UTC 2010
Peter Kjellstrom <cap at nsc.liu.se>

On Tuesday 22 June 2010, John R Pierce wrote:
> On 06/22/10 12:21 AM, Peter Kjellstrom wrote:
> > On Tuesday 22 June 2010, Eric Deis wrote:
> >> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days
> >> the machine crashed with the following error (repeating in mcelog):
> >
> > I'm guessing the old kernel just didn't notice.
> >
> > The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to
> > debug I'd suggest you start there (but it could be the systemboard too).
> > Try running with half you DIMMs then the other half.
>
> and on nehalem (xeon 5500, 5600), the memory controller is in the CPUs,
> so they are suspect too.

In theory, yes. But while we've replaced many DIMMS and some system boards I 
don't think we've replaced a single (nehalem type) CPU (this observed during 
~10000 CPU-months).

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.centos.org/pipermail/centos/attachments/20100622/71ca8c9a/attachment-0004.sig>