Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

22 Mar 2011


      2011/3/22 m.roth@5-cent.us
...
Vladimir Budnev wrote:
...
2011/3/22 m.roth@5-cent.us
...
Vladimir Budnev wrote:
...
2011/3/22 m.roth@5-cent.us
...
Vladimir Budnev wrote:
...
2011/3/21 m.roth@5-cent.us
> Vladimir Budnev wrote:
> >
> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
> >
> > For some time we have lots of MCE in mcelog and we cant find out
> > the reason.
>
> The only thing that shows there (when it shows, since sometimes it
> doesn't seem to) is a hardware error. You *WILL* be replacing
> hardware, sometime soon, like yesterday.
<snip>
We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or
depends on
...
...
...
situation? I hope we will find those bustards ourselvs but hint would
be great.
And one more thing i cant funderstand ... if there is,say, 8 "cpu
numbers" per each memory module(in our situation), why we see only 4
numbers
...
...
...
and not 8 e.g. 0,1,2,3,4,5,6,7 ?
I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
That doesn't add up, since you say you have 2 quad core processors, for
a total of 8 cpus, and each of those processors have 6 banks, which
would
...
...
mean each processor should only see six (directly). Where I'm confused
is how you could have cores 32-35, or 53-whatsit, when you only have 8
cores in two processors.
2 cpu each 8 cores and HT support. So 16 at max i think. for such way is
it  ok?
Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor
chips. HT support doesn't figure into it; if you use dmidecode or lshw, I
believe it will show you 8 cores, not 16.
Was a typo, sorry. 2 CPU and each one has 4 cores so totally 8 cores.
...
...
I really lost the idea line with those cpu to memory bank mappings...
Each processor will directly see the DIMMs associate with it, so that the
banks associated with each processor will be what directly affects the
cores. So, if you see something like
Mar 20 05:01:35 <system name> kernel:  Northbridge Error, node 0, core: 5
(these processors are 8-core), it means that one of the DIMMs in bank 0,
0-3, is bad.
You should see
      __
     |_0|  0 1 2 3
                __
               |_1|  0 1 2 3
or whatever on the m/b, so one of the top ones there is affected. Is that
any clearer?
First of all big thnx for helping mark.
In your example everything is ok. But i am lost with what we have.
Previously we recieved messages like i post in the first mail:
CPU 51 BANK 8 TSC 8511e3ca77dc
MISC 274d587f00006141 ADDR 807044840
STATUS cc0055000001009f MCGSTATU
And always there were same cpu numbers. I really dont know why do mcleog
show such numbers but thats what we have.Always Bank 8 and there were
32,33,34,45 and 50,51,52,53 numbers in CPU field.
You encouraged us that it is a dimm problem and we decide to make a little
research which i described up the thread. During that wev replaced DIMM
moduels between slots, so now we have BANK 8 and cpu 1,2,3 and 18,29,20,21.
It really seems that some how those numbers connected with RAM modules.
But... as i sad we have following slots
   CPU1    cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3
   CPU2    cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3
We have modules placed in such way:
+------------+------------+------------+------------+------------+------------+------------+
|              |      V     |     V      |      V     |      V     |
free    |    free    |
+------------+------------+------------+------------+------------+------------+------------+
|   CPU1  |  cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 |
+------------+------------+------------+------------+------------+------------+------------+
+------------+------------+------------+------------+------------+------------+------------+
|              |      V     |     V      |      V     |      V     |
free    |    free    |
+------------+------------+------------+------------+------------+------------+------------+
|   CPU2  |  cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 |
+------------+------------+------------+------------+------------+------------+------------+
Definetely there is something with memory banks,becasue replacinbg moudels
changed the mce messages, but what exactly...or iv interpreted all wrong?

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)