[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 13:42:51 UTC 2011

Vladimir Budnev wrote:
> 2011/3/21 <m.roth at 5-cent.us>
>> Vladimir Budnev wrote:
>> > Hello community.
>> >
>> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
>> > Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
>> >
>> > For some time we have lots of MCE in mcelog and we cant find out the
>> > reason.
>>
>> The only thing that shows there (when it shows, since sometimes it
>> doesn't seem to) is a hardware error. You *WILL* be replacing hardware,
sometime
>> soon, like yesterday.
<snip>
>> Bad news: you have *two* DIMMs failing, one associated with the physical
>> CPU that has core 53, and another associated with the physical CPU that
>> has cores 32-35.
<snip>
> Last night we'v made some research to find out which RAM modules bugged.
>
> To be noticed we have 8 modules 4G each.
<snip>
> Finally we'v placed last 2 modules...and no errors. It should be noticed
> that at that step we have exactly the same modules placement as before
> experiment.
>
> Sounds strange, but at first glance looks like smthg was wrong with
> modules placement. But we cant realise why the problem didnt show for
the first
> days, even month of server running. Noone touched server HW, so i have no
> idea what was that.
>
> Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but
there will, I'll wager.

Here's a question out of left field: who was the manufacturer of the 4G
DIMMs? Not Supermicro, but the DIMMs themselves?

        mark