[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 14:59:29 UTC 2011
Vladimir Budnev <vladimir.budnev at gmail.com>

2011/3/22 <m.roth at 5-cent.us>

> Vladimir Budnev wrote:
> > 2011/3/22 <m.roth at 5-cent.us>
> >> Vladimir Budnev wrote:
> >> > 2011/3/22 <m.roth at 5-cent.us>
> >> >> Vladimir Budnev wrote:
> >> >> > 2011/3/21 <m.roth at 5-cent.us>
> >> >> >> Vladimir Budnev wrote:
> >> >> >> >
> >> >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
> >> >> >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
> >> >> >> >
> >> >> >> > For some time we have lots of MCE in mcelog and we cant find out
> >> >> >> > the reason.
> >> >> >>
> >> >> >> The only thing that shows there (when it shows, since sometimes it
> >> >> >> doesn't seem to) is a hardware error. You *WILL* be replacing
> >> >> >> hardware, sometime soon, like yesterday.
> >> >> <snip>
> >> > We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or
> depends on
> >> > situation? I hope we will find those bustards ourselvs but hint would
> >> > be great.
> >> >
> >> > And one more thing i cant funderstand ... if there is,say, 8 "cpu
> >> > numbers" per each memory module(in our situation), why we see only 4
> numbers
> >> > and not 8 e.g. 0,1,2,3,4,5,6,7 ?
> >>
> >> I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
> >> That doesn't add up, since you say you have 2 quad core processors, for
> >> a total of 8 cpus, and each of those processors have 6 banks, which
> would
> >> mean each processor should only see six (directly). Where I'm confused
> >> is how you could have cores 32-35, or 53-whatsit, when you only have 8
> >> cores in two processors.
> >
> >  2 cpu each 8 cores and HT support. So 16 at max i think. for such way is
> > it  ok?
>
> Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor
> chips. HT support doesn't figure into it; if you use dmidecode or lshw, I
> believe it will show you 8 cores, not 16.
>
Was a typo, sorry. 2 CPU and each one has 4 cores so totally 8 cores.


> >  I really lost the idea line with those cpu to memory bank mappings...
>
> Each processor will directly see the DIMMs associate with it, so that the
> banks associated with each processor will be what directly affects the
> cores. So, if you see something like
> Mar 20 05:01:35 <system name> kernel:  Northbridge Error, node 0, core: 5
> (these processors are 8-core), it means that one of the DIMMs in bank 0,
> 0-3, is bad.
> You should see
>       __
>      |_0|  0 1 2 3
>                 __
>                |_1|  0 1 2 3
>
> or whatever on the m/b, so one of the top ones there is affected. Is that
> any clearer?

First of all big thnx for helping mark.

In your example everything is ok. But i am lost with what we have.
Previously we recieved messages like i post in the first mail:
CPU 51 BANK 8 TSC 8511e3ca77dc
MISC 274d587f00006141 ADDR 807044840
STATUS cc0055000001009f MCGSTATU

And always there were same cpu numbers. I really dont know why do mcleog
show such numbers but thats what we have.Always Bank 8 and there were
32,33,34,45 and 50,51,52,53 numbers in CPU field.

You encouraged us that it is a dimm problem and we decide to make a little
research which i described up the thread. During that wev replaced DIMM
moduels between slots, so now we have BANK 8 and cpu 1,2,3 and 18,29,20,21.
It really seems that some how those numbers connected with RAM modules.

But... as i sad we have following slots
   CPU1    cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3
   CPU2    cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3

We have modules placed in such way:
+------------+------------+------------+------------+------------+------------+------------+
|              |      V     |     V      |      V     |      V     |
free    |    free    |
+------------+------------+------------+------------+------------+------------+------------+
|   CPU1  |  cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 |
+------------+------------+------------+------------+------------+------------+------------+


+------------+------------+------------+------------+------------+------------+------------+
|              |      V     |     V      |      V     |      V     |
free    |    free    |
+------------+------------+------------+------------+------------+------------+------------+
|   CPU2  |  cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 |
+------------+------------+------------+------------+------------+------------+------------+

Definetely there is something with memory banks,becasue replacinbg moudels
changed the mce messages, but what exactly...or iv interpreted all wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20110322/9844a2bd/attachment-0005.html>