[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Mon Mar 21 15:12:39 UTC 2011
m.roth at 5-cent.us <m.roth at 5-cent.us>

Vladimir Budnev wrote:
> Hello community.
>
> We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
> E5630 and 8xKingston KVR1333D3D4R9S/4G
>
> For some time we have lots of MCE in mcelog and we cant find out the
> reason.

The only thing that shows there (when it shows, since sometimes it doesn't
seem to) is a hardware error. You *WILL* be replacing hardware, sometime
soon, like yesterday.

"Normal" is not: *ANYTHING* here is Bad News. First, you've got DIMMs
failing.  CPU 53, assuming this system doesn't have 53+ physical CPUs,
means that you have x-core systems, so you need to divide by x, so that if
it's a 12-core system with 6 physical chips, that would make it DIMM 8
associated with that physical CPU.
<snip>
> One more interesting thins is the following output:
> [root at zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
> 32
> 33
> 34
> 35
> 50
> 51
> 52
> 53
>
> Those numbers are always the same.

Bad news: you have *two* DIMMs failing, one associated with the physical
CPU that has core 53, and another associated with the physical CPU that
has cores 32-35.

Talk to your OEM support to help identify which banks need replacing,
and/or find a motherboard diagram.

          mark, who has to deal *again* with one machine with the same
problem....

_______________________________________________
CentOS mailing list
CentOS at centos.org
http://lists.centos.org/mailman/listinfo/centos