Hello community.
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
For some time we have lots of MCE in mcelog and we cant find out the reason.
"Ordinary" mce message looks like:
CPU 51 BANK 8 TSC 8511e3ca77dc
MISC 274d587f00006141 ADDR 807044840
STATUS cc0055000001009f MCGSTATUS 0
decode with mcelog --ascii --cpu p4(cause there is no xeon56xx in list):
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 53 BANK 8 TSC 1982d8f72b1f
MISC e1742eac00006242 ADDR 7ffd78a80
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS cc0002000001009f MCGSTATUS 0
The global question is it possible to find out the exact hw which causes those messages?
First we thought that according to
/* A machine check record */
struct mce {
__u64 status; /* bank status register */
__u64 misc; /* misc register (always 0 right now) */
__u64 addr; /* address or 0 */
__u64 mcgstatus; /* global MC status register */
__u64 rip; /* Program counter or 0 for silent error */
__u64 tsc; /* cpu time stamp counter */
__u64 res1; /* for future extension */
__u64 res2; /* dito. */
__u8 cs; /* code segment */
__u8 bank; /* machine check bank */
__u8 cpu; /* cpu that raised the error */
__u8 finished; /* entry is valid */
__u32 pad;
};
cpu is the cpu rised the exception, but we have 2 quadro cpus with HT so maximum cpu number should be 16 and in logs we see 53 etc.
So no we r not sure about what cpu value is :)Does anyone know what the CPU number means exactly?
One more interesting thins is the following output:
[root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
32
33
34
35
50
51
52
53
Those numbers are always the same.
Ok.Supposed we have problem in RAM, since i dont really know what those cpu numbers mean we suppose that cpu+bank can point the problem hw.Is it possible?
According to our "broken ram theory" we suppose that those numbers 32,33,34,45 and 50,51,52,53 indicate some simetric problem with ram/or slots or smth else.Is it correct?
Thanks in advance.