On Tuesday 22 June 2010, John R Pierce wrote:
On 06/22/10 12:21 AM, Peter Kjellstrom wrote:
On Tuesday 22 June 2010, Eric Deis wrote:
I have recently upgraded to 2.6.18-194.3.1.el5 and within several days the machine crashed with the following error (repeating in mcelog):
I'm guessing the old kernel just didn't notice.
The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to debug I'd suggest you start there (but it could be the systemboard too). Try running with half you DIMMs then the other half.
and on nehalem (xeon 5500, 5600), the memory controller is in the CPUs, so they are suspect too.
In theory, yes. But while we've replaced many DIMMS and some system boards I don't think we've replaced a single (nehalem type) CPU (this observed during ~10000 CPU-months).
/Peter