Vladimir Budnev wrote:
2011/3/22 m.roth@5-cent.us
Vladimir Budnev wrote:
2011/3/21 m.roth@5-cent.us
Vladimir Budnev wrote:
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
For some time we have lots of MCE in mcelog and we cant find out the reason.
The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday.
<snip> >> Bad news: you have *two* DIMMs failing, one associated with the >> physical CPU that has core 53, and another associated with the
physical CPU
that has cores 32-35.
<snip, memory reseating>
Now we are just waiting will there be errors again.
I'm sure there will. Reseating the memory may have done something, but there will, I'll wager.
mark, you are absolutely right :) Approximetely 1h ago errors appeared. They appeared only once since reboot, but they r back. Hi there :(
The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and 18,19,20,21.We definetely moved "broken" modules to another slots. Anyway bad dimm is really a good news for us instead of e.g. motherboard.
<snip>
Is it possible to determine which physical dimms correspond to those cpus noticed in mce messagees? We have two rows of slots(6 slot for each row) one for cpu1 and second for cpu2. Used slots marked as cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.
I remeber that you adviced to divide cpu number on physical core count. We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on situation? I hope we will find those bustards ourselvs but hint would be great.
And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers" per each memory module(in our situation), why we see only 4 numbers and not 8 e.g. 0,1,2,3,4,5,6,7 ?
I'm now confused about a lot: originally, you mentioned 53 - 57, was it? That doesn't add up, since you say you have 2 quad core processors, for a total of 8 cpus, and each of those processors have 6 banks, which would mean each processor should only see six (directly). Where I'm confused is how you could have cores 32-35, or 53-whatsit, when you only have 8 cores in two processors.
Here's a question out of left field: who was the manufacturer of the 4G DIMMs? Not Supermicro, but the DIMMs themselves?
This is Kingston KVR1333D3D4R9S/4G if i got the question
Oh, ok. I was wondering if they were Hynix - I've seen a good number of bad 4G and 8G DIMMs from them recently, and that across three different OEMs and model DIMMs.
mark