[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 14:15:29 UTC 2011
m.roth at 5-cent.us <m.roth at 5-cent.us>

Vladimir Budnev wrote:
> 2011/3/22 <m.roth at 5-cent.us>
>> Vladimir Budnev wrote:
>> > 2011/3/21 <m.roth at 5-cent.us>
>> >> Vladimir Budnev wrote:
>> >> >
>> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
>> >> > Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
>> >> >
>> >> > For some time we have lots of MCE in mcelog and we cant find out
>> >> > the reason.
>> >>
>> >> The only thing that shows there (when it shows, since sometimes it
>> >> doesn't seem to) is a hardware error. You *WILL* be replacing
>> >> hardware, sometime soon, like yesterday.
>> <snip>
>> >> Bad news: you have *two* DIMMs failing, one associated with the
>> >> physical CPU that has core 53, and another associated with the
physical CPU
>> >> that has cores 32-35.
>> <snip, memory reseating>
>> > Now we are just waiting will there be errors again.
>>
>> I'm sure there will. Reseating the memory may have done something, but
>> there will, I'll wager.
>
> mark, you are absolutely right :) Approximetely 1h ago errors appeared.
> They appeared only once since reboot, but they r back. Hi there :(
>
> The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and
> 18,19,20,21.We definetely moved "broken" modules to another slots.
> Anyway bad dimm is really a good news for us instead of e.g.  motherboard.
<snip>
> Is it possible to determine which physical dimms correspond to those cpus
> noticed in mce messagees? We have two rows of slots(6 slot for each row)
> one for cpu1 and second for cpu2. Used slots marked as
> cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.
>
> I remeber that you adviced to divide cpu number on physical core count. We
> have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on
> situation? I hope we will find those bustards ourselvs but hint would be
> great.
>
> And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers"
> per each memory module(in our situation), why we see only 4 numbers and
> not 8 e.g. 0,1,2,3,4,5,6,7 ?

I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
That doesn't add up, since you say you have 2 quad core processors, for a
total of 8 cpus, and each of those processors have 6 banks, which would
mean each processor should only see six (directly). Where I'm confused is
how you could have cores 32-35, or 53-whatsit, when you only have 8 cores
in two processors.
>
>> Here's a question out of left field: who was the manufacturer of the 4G
>> DIMMs? Not Supermicro, but the DIMMs themselves?
>>
> This is Kingston KVR1333D3D4R9S/4G if i got the question

Oh, ok. I was wondering if they were Hynix - I've seen a good number of
bad 4G and 8G DIMMs from them recently, and that across three different
OEMs and model DIMMs.

         mark