[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 14:01:41 UTC 2011
Vladimir Budnev <vladimir.budnev at gmail.com>

2011/3/22 <m.roth at 5-cent.us>

> Vladimir Budnev wrote:
> > 2011/3/21 <m.roth at 5-cent.us>
> >> Vladimir Budnev wrote:
> >> > Hello community.
> >> >
> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
> >> > Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
> >> >
> >> > For some time we have lots of MCE in mcelog and we cant find out the
> >> > reason.
> >>
> >> The only thing that shows there (when it shows, since sometimes it
> >> doesn't seem to) is a hardware error. You *WILL* be replacing hardware,
> sometime
> >> soon, like yesterday.
> <snip>
> >> Bad news: you have *two* DIMMs failing, one associated with the physical
> >> CPU that has core 53, and another associated with the physical CPU that
> >> has cores 32-35.
> <snip>
> > Last night we'v made some research to find out which RAM modules bugged.
> >
> > To be noticed we have 8 modules 4G each.
> <snip>
> > Finally we'v placed last 2 modules...and no errors. It should be noticed
> > that at that step we have exactly the same modules placement as before
> > experiment.
> >
> > Sounds strange, but at first glance looks like smthg was wrong with
> > modules placement. But we cant realise why the problem didnt show for
> the first
> > days, even month of server running. Noone touched server HW, so i have no
> > idea what was that.
> >
> > Now we are just waiting will there be errors again.
>
> I'm sure there will. Reseating the memory may have done something, but
> there will, I'll wager.
>

mark, you are absolutely right :) Approximetely 1h ago errors appeared. They
appeared only once since reboot, but they r back. Hi there :(

The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and
18,19,20,21.We definetely moved "broken" modules to another slots.
Anyway bad dimm is really a good news for us instead of e.g.  motherboard.

We are going to continue party this night or tomorrow morning, and determin
which two modules are broken.

Is it possible to determine which physical dimms correspond to those cpus
noticed in mce messagees? We have two rows of slots(6 slot for each row) one
for cpu1 and second for cpu2. Used slots marked as
cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.

I remeber that you adviced to divide cpu number on physical core count. We
have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on
situation? I hope we will find those bustards ourselvs but hint would be
great.

And one more thing i cant funderstand ... if there is,say, 8 "cpu numbers"
per each memory module(in our situation), why we see only 4 numbers and not
8 e.g. 0,1,2,3,4,5,6,7 ?


> Here's a question out of left field: who was the manufacturer of the 4G
> DIMMs? Not Supermicro, but the DIMMs themselves?
>

This is Kingston KVR1333D3D4R9S/4G if i got the question
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20110322/28cb92c9/attachment-0005.html>