2011/3/22 m.roth@5-cent.us
Vladimir Budnev wrote:
2011/3/22 m.roth@5-cent.us
Vladimir Budnev wrote:
2011/3/22 m.roth@5-cent.us
Vladimir Budnev wrote:
2011/3/21 m.roth@5-cent.us > Vladimir Budnev wrote: > > > > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with > > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G > > > > For some time we have lots of MCE in mcelog and we cant find out > > the reason. > > The only thing that shows there (when it shows, since sometimes it > doesn't seem to) is a hardware error. You *WILL* be replacing > hardware, sometime soon, like yesterday.
<snip>
We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or
depends on
situation? I hope we will find those bustards ourselvs but hint would be great.
And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" per each memory module(in our situation), why we see only 4
numbers
and not 8 e.g. 0,1,2,3,4,5,6,7 ?
I'm now confused about a lot: originally, you mentioned 53 - 57, was it? That doesn't add up, since you say you have 2 quad core processors, for a total of 8 cpus, and each of those processors have 6 banks, which
would
mean each processor should only see six (directly). Where I'm confused is how you could have cores 32-35, or 53-whatsit, when you only have 8 cores in two processors.
2 cpu each 8 cores and HT support. So 16 at max i think. for such way is it ok?
Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor chips. HT support doesn't figure into it; if you use dmidecode or lshw, I believe it will show you 8 cores, not 16.
Was a typo, sorry. 2 CPU and each one has 4 cores so totally 8 cores.
I really lost the idea line with those cpu to memory bank mappings...
Each processor will directly see the DIMMs associate with it, so that the banks associated with each processor will be what directly affects the cores. So, if you see something like Mar 20 05:01:35 <system name> kernel: Northbridge Error, node 0, core: 5 (these processors are 8-core), it means that one of the DIMMs in bank 0, 0-3, is bad. You should see __ |_0| 0 1 2 3 __ |_1| 0 1 2 3
or whatever on the m/b, so one of the top ones there is affected. Is that any clearer?
First of all big thnx for helping mark.
In your example everything is ok. But i am lost with what we have. Previously we recieved messages like i post in the first mail: CPU 51 BANK 8 TSC 8511e3ca77dc MISC 274d587f00006141 ADDR 807044840 STATUS cc0055000001009f MCGSTATU
And always there were same cpu numbers. I really dont know why do mcleog show such numbers but thats what we have.Always Bank 8 and there were 32,33,34,45 and 50,51,52,53 numbers in CPU field.
You encouraged us that it is a dimm problem and we decide to make a little research which i described up the thread. During that wev replaced DIMM moduels between slots, so now we have BANK 8 and cpu 1,2,3 and 18,29,20,21. It really seems that some how those numbers connected with RAM modules.
But... as i sad we have following slots CPU1 cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3 CPU2 cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3
We have modules placed in such way: +------------+------------+------------+------------+------------+------------+------------+ | | V | V | V | V | free | free | +------------+------------+------------+------------+------------+------------+------------+ | CPU1 | cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 | +------------+------------+------------+------------+------------+------------+------------+
+------------+------------+------------+------------+------------+------------+------------+ | | V | V | V | V | free | free | +------------+------------+------------+------------+------------+------------+------------+ | CPU2 | cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 | +------------+------------+------------+------------+------------+------------+------------+
Definetely there is something with memory banks,becasue replacinbg moudels changed the mce messages, but what exactly...or iv interpreted all wrong?