[CentOS] New kernel causes hardware error?

Tue Jun 22 22:08:16 UTC 2010
Eric Deis <nospamthankyou at anemone.cx>

Thanks Guys!

Your advice helped me fix the problem.

Yes, it was the motherboard that was the issue. I update the firmware 
and must have had some microcode fixes to support my CPU (John mentioned 
the memory controller is in the CPU for Xeon 5500).

Now upon reboot using 2.6.18-194.3.1.el5 no errors are found in mcelog.

I will do some further testing, but think that I'm in the clear.


Thank you so much! I spent hours googling trying to find a solution to 
this, couldn't find the error reported anywhere else. Glad to have some 
people I can turn to for advice.


All the best,
eric



Tsuyoshi Nagata wrote:
> Hi! Eric
> (2010/06/22 13:11), Eric Deis wrote:
>   
>> Transaction: Address/Command error
>>     
>
> Its mother board (memory controller) problem.
> Its *not* DIMM problem.(memtest can't detect this error.)
> your data transfer(read/write) sometimes met bit errors.
> This is Nehalem cpu's error detecting feature.(MCE)
>
> Try new mother board,
> or your MB always indicates this error in latest kernel,
> Its time to buy certified vendors hardware.
>
> Supermicro's MB is not certified hardware, but
> she just indicates hardware problem.
>
> Tsuyoshi.
>