On 06/22/10 12:21 AM, Peter Kjellstrom wrote:
On Tuesday 22 June 2010, Eric Deis wrote:
I have recently upgraded to 2.6.18-194.3.1.el5 and within several days the machine crashed with the following error (repeating in mcelog):
I'm guessing the old kernel just didn't notice.
The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to debug I'd suggest you start there (but it could be the systemboard too). Try running with half you DIMMs then the other half.
and on nehalem (xeon 5500, 5600), the memory controller is in the CPUs, so they are suspect too.
first, however, i'd see if there's a BIOS flash upgrade for the mainboard. these sometimes have microcode fixes for various specific Intel CPUs, and also may have updated memory timing parameters.