[CentOS] New kernel causes hardware error?

Tue Jun 22 07:27:02 UTC 2010
John R Pierce <pierce at hogranch.com>

On 06/22/10 12:21 AM, Peter Kjellstrom wrote:
> On Tuesday 22 June 2010, Eric Deis wrote:
>    
>> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days
>> the machine crashed with the following error (repeating in mcelog):
>>      
> I'm guessing the old kernel just didn't notice.
>
> The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to
> debug I'd suggest you start there (but it could be the systemboard too). Try
> running with half you DIMMs then the other half.
>    

and on nehalem (xeon 5500, 5600), the memory controller is in the CPUs, 
so they are suspect too.

first, however, i'd see if there's a BIOS flash upgrade for the 
mainboard.  these sometimes have microcode fixes for various specific 
Intel CPUs, and also may have updated memory timing parameters.