[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 14:43:28 UTC 2011
m.roth at 5-cent.us <m.roth at 5-cent.us>

Vladimir Budnev wrote:
> 2011/3/22 <m.roth at 5-cent.us>
>> Vladimir Budnev wrote:
>> > 2011/3/22 <m.roth at 5-cent.us>
>> >> Vladimir Budnev wrote:
>> >> > 2011/3/21 <m.roth at 5-cent.us>
>> >> >> Vladimir Budnev wrote:
>> >> >> >
>> >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
>> >> >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
>> >> >> >
>> >> >> > For some time we have lots of MCE in mcelog and we cant find out
>> >> >> > the reason.
>> >> >>
>> >> >> The only thing that shows there (when it shows, since sometimes it
>> >> >> doesn't seem to) is a hardware error. You *WILL* be replacing
>> >> >> hardware, sometime soon, like yesterday.
>> >> <snip>
>> > We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or
depends on
>> > situation? I hope we will find those bustards ourselvs but hint would
>> > be great.
>> >
>> > And one more thing i cant funderstand ... if there is,say, 8 "cpu
>> > numbers" per each memory module(in our situation), why we see only 4
numbers
>> > and not 8 e.g. 0,1,2,3,4,5,6,7 ?
>>
>> I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
>> That doesn't add up, since you say you have 2 quad core processors, for
>> a total of 8 cpus, and each of those processors have 6 banks, which would
>> mean each processor should only see six (directly). Where I'm confused
>> is how you could have cores 32-35, or 53-whatsit, when you only have 8
>> cores in two processors.
>
>  2 cpu each 8 cores and HT support. So 16 at max i think. for such way is
> it  ok?

Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor
chips. HT support doesn't figure into it; if you use dmidecode or lshw, I
believe it will show you 8 cores, not 16.

>  I really lost the idea line with those cpu to memory bank mappings...

Each processor will directly see the DIMMs associate with it, so that the
banks associated with each processor will be what directly affects the
cores. So, if you see something like
Mar 20 05:01:35 <system name> kernel:  Northbridge Error, node 0, core: 5
(these processors are 8-core), it means that one of the DIMMs in bank 0,
0-3, is bad.
You should see
       __
      |_0|  0 1 2 3
                 __
                |_1|  0 1 2 3

or whatever on the m/b, so one of the top ones there is affected. Is that
any clearer?

       mark