[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 14:24:01 UTC 2011
Vladimir Budnev <vladimir.budnev at gmail.com>

2011/3/22 <m.roth at 5-cent.us>

> Vladimir Budnev wrote:
> > 2011/3/22 <m.roth at 5-cent.us>
> >> Vladimir Budnev wrote:
> >> > 2011/3/21 <m.roth at 5-cent.us>
> >> >> Vladimir Budnev wrote:
> >> >> >
> >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
> >> >> > Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
> >> >> >
> >> >> > For some time we have lots of MCE in mcelog and we cant find out
> >> >> > the reason.
> >> >>
> >> >> The only thing that shows there (when it shows, since sometimes it
> >> >> doesn't seem to) is a hardware error. You *WILL* be replacing
> >> >> hardware, sometime soon, like yesterday.
> >> <snip>
> >> >> Bad news: you have *two* DIMMs failing, one associated with the
> >> >> physical CPU that has core 53, and another associated with the
> physical CPU
> >> >> that has cores 32-35.
> >> <snip, memory reseating>
> >> > Now we are just waiting will there be errors again.
> >>
> >> I'm sure there will. Reseating the memory may have done something, but
> >> there will, I'll wager.
> >
> > mark, you are absolutely right :) Approximetely 1h ago errors appeared.
> > They appeared only once since reboot, but they r back. Hi there :(
> >
> > The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and
> > 18,19,20,21.We definetely moved "broken" modules to another slots.
> > Anyway bad dimm is really a good news for us instead of e.g.
>  motherboard.
> <snip>
> > Is it possible to determine which physical dimms correspond to those cpus
> > noticed in mce messagees? We have two rows of slots(6 slot for each row)
> > one for cpu1 and second for cpu2. Used slots marked as
> > cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.
> >
> > I remeber that you adviced to divide cpu number on physical core count.
> We
> > have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on
> > situation? I hope we will find those bustards ourselvs but hint would be
> > great.
> >
> > And one more thing i cant funderstand ... if there is,say, 8 "cpu
> numbers"
> > per each memory module(in our situation), why we see only 4 numbers and
> > not 8 e.g. 0,1,2,3,4,5,6,7 ?
>
> I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
> That doesn't add up, since you say you have 2 quad core processors, for a
> total of 8 cpus, and each of those processors have 6 banks, which would
> mean each processor should only see six (directly). Where I'm confused is
> how you could have cores 32-35, or 53-whatsit, when you only have 8 cores
> in two processors.
>

 2 cpu each 8 cores and HT support. So 16 at max i think. for such way is it
ok?
 I really lost the idea line with those cpu to memory bank mappings...

>
> >> Here's a question out of left field: who was the manufacturer of the 4G
> >> DIMMs? Not Supermicro, but the DIMMs themselves?
> >>
> > This is Kingston KVR1333D3D4R9S/4G if i got the question
>
> Oh, ok. I was wondering if they were Hynix - I've seen a good number of
> bad 4G and 8G DIMMs from them recently, and that across three different
> OEMs and model DIMMs.
>
>         mark
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20110322/b66e945a/attachment-0004.html>