[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Tue Mar 22 11:48:58 UTC 2011

On Tue, Mar 22, 2011 at 7:33 AM, Vladimir Budnev
<vladimir.budnev at gmail.com> wrote:
>
>
> 2011/3/21 <m.roth at 5-cent.us>
>>
>> Vladimir Budnev wrote:
>> > Hello community.
>> >
>> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
>> > E5630 and 8xKingston KVR1333D3D4R9S/4G
>> >
>> > For some time we have lots of MCE in mcelog and we cant find out the
>> > reason.
>>
>> The only thing that shows there (when it shows, since sometimes it doesn't
>> seem to) is a hardware error. You *WILL* be replacing hardware, sometime
>> soon, like yesterday.
>>
>> "Normal" is not: *ANYTHING* here is Bad News. First, you've got DIMMs
>> failing.  CPU 53, assuming this system doesn't have 53+ physical CPUs,
>> means that you have x-core systems, so you need to divide by x, so that if
>> it's a 12-core system with 6 physical chips, that would make it DIMM 8
>> associated with that physical CPU.
>> <snip>
>> > One more interesting thins is the following output:
>> > [root at zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
>> > 32
>> > 33
>> > 34
>> > 35
>> > 50
>> > 51
>> > 52
>> > 53
>> >
>> > Those numbers are always the same.
>>
>> Bad news: you have *two* DIMMs failing, one associated with the physical
>> CPU that has core 53, and another associated with the physical CPU that
>> has cores 32-35.
>>
>> Talk to your OEM support to help identify which banks need replacing,
>> and/or find a motherboard diagram.
>>
>>          mark, who has to deal *again* with one machine with the same
>> problem....
>
> Tnx for the asnwer!
>
> Last night we'v made some research to find out which RAM modules bugged.
>
> To be noticed we have 8 modules 4G each.
>
> First  we'v removed a3,b1 slots for each cpu, and there were no changes in
> HW behaviour. Errors appeared after boot.
>
> Then we'v removed a1,a2 (yes i know that "for hight performance" we should
> place modules starting from a1 but it was our mistake and in any case server
> started) and ...and there were no errors during 1h. Usually we can observer
> errors coming ~every 5 mins.
>
> Then we'v placed back 2 modules. At that step we had a1,a3,b1 slots occupied
> for each cpu. No errors.
>
> Finally we'v placed last 2 modules...and no errors. It should be noticed
> that at that step we have exactly the same modules placement as before
> experiment.
>
> Sounds strange, but at first glance looks like smthg was wrong with modules
> placement. But we cant realise why the problem didnt show for the first
> days, even month of server running. Noone touched server HW, so i have no
> idea what was that.
>
> Now we are just waiting will there be errors again.

You know......

I once had a *whole rack* of blade servers, running CentOS, where
someone decided to "save money" by buying the memory separately and
replacing it in-house. Slews of memory errors started up pretty soon.
and I wound up having to reseat all of it, run some memory testing
tools against them, juggle the good memory with the bad memory to get
working systems, replace DIMM's, etc., etc. We kept seeing failures
over the next few months as part of the falling part of a bathtub
curve.

I was furious that we'd "saved" perhaps 2 thousand bucks on RAM,
overall, and completely burned a month of my time and made our clients
*VERY* unhappy and come out looking like fools for not having this
very expensive piece of kit working from day one.

In the process, though, some of the systems were repaired
"permanently" by simply reseating the RAM. I did handle them
carefully, cleaning the filters, removing any dust (of which there was
very little, they were new) and checking all the cabling. I also
cleaned up the airflow a bit by doing some recabling and relabeling,
normal practice when I have a rack down and a chance to make sure
things go where they shouuld.

And I *carefully* cleaned up the blood where I cut my hand on the heat
sink on the one system. Maybe it was the blood sacrifice that appeased
the gods on that server?