[CentOS] Machine check events

Thu Nov 28 12:37:26 UTC 2013
Glenn Eychaner <geychaner at mac.com>

m.roth writes:

> Is the system still under warranty? How 'bout the memory, if you've
> replaced it? You *should* replace it. It's not going to get better....

This is brand-new Kingston 1600MHz ECC memory on a workstation/server
running at high altitude in a relatively open environment; I am loath to
replace it based on a single correctable parity error every few days.
Especially since both active computers are (thus far) seeing about the same
error frequency (though it will take many more days or even weeks to
determine that for certain; I haven't seen one in the last three days on
either active computer), and memtest was run on these computers overnight
(18+ hours) between build and deployment without apparent issue.

[The computers were built in the states and then shipped 10,000 miles to
the observatory location.]

And the turnaround time from the observatory to the U.S. on servicing is no
small matter. I have five of these computers (two active, one "hot" spare,
one "cold" spare, one test system); if in the long run one proves to be a
problem, i will deal with it at that time. If the memory is a bad batch,
I'll need more proof.


On Nov 27, 2013, at 3:56 PM, Glenn Eychaner <geychaner at mac.com> wrote:

> And all that work was done to get this, output of a corrected memory parity
> error. I get about one of these per workstation per 3 days, more or less; is
> this a surprising number? (The workstation under the heaviest load gets
> more, while the idle spare gets none at all; no surprise there!)
> MCE 6
> CPU 1 BANK 0 
> TIME 1385426237 Mon Nov 25 21:37:17 2013
> MCG status:
> MCi status:
> Corrected error
> Error enabled
> MCA: Internal parity error
> STATUS 90000040000f0005 MCGSTATUS 0
> CPUID Vendor Intel Family 6 Model 60

Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory