[CentOS] Machine check events

Wed Nov 27 18:56:28 UTC 2013
Glenn Eychaner <geychaner at mac.com>

And all that work was done to get this, output of a corrected memory parity
error. I get about one of these per workstation per 3 days, more or less; is
this a surprising number? (The workstation under the heaviest load gets
more, while the idle spare gets none at all; no surprise there!)

TIME 1385426237 Mon Nov 25 21:37:17 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 60


On Nov 27, 2013, at 3:32 PM, Glenn Eychaner <geychaner at mac.com> wrote:

> On further, further, further toying, I now have mcelog running on my 32-bit
> CentOS 6 systems! I admit to doing it the "dumb" way: I grabbed the source
> from the git repository, compiled and installed it, and THEN discovered
> that the init.d file supplied with the source was not CentOS compatible, so
> I grabbed the x86-64 RPM, extracted the startup files, and copied them into
> place. The RPM was small enough to make this easy.
> What I SHOULD have done is to grab the source RPM, replace the source with
> the latest source, build and install the source RPM, and then repackage the
> RPMs again for future consumption.  Maybe I will try that at a future date, but
> I don't really have time today.
> -G.
> On Nov 26, 2013, at 11:11 AM, Glenn Eychaner <geychaner at mac.com> wrote:
>> On further, further investigation, it looks like according to the mcelog install
>> guide at http://www.mcelog.org/installation.html, I could "roll my own" for 32-bit
>> CentOS 6:
>> "For bad page offlining you will need a 2.6.33+ kernel or a 2.6.32 kernel with
>> the soft offlining capability backported (like RHEL6 or SLES11-SP1)"
>> "The kernel has to have CONFIG_X86_MCE enabled. For 32bit kernels you
>> need at least a 2.6,30 kernel."
>> The current kernel I am running is 2.6.32-358.23.2, but I can't tell whether it
>> has CONFIG_X86_MCE enabled. How can I find this out?
>> JD writes:
>>> yum info mcelog
>>> ...
>>> Description : mcelog is a daemon that collects and decodes Machine Check
>>>           : Exception data on x86-64 machines.
>>> So not for 32-bit...
>> On Nov 26, 2013, at 9:25 AM, Glenn Eychaner <geychaner at mac.com> wrote:
>>> Further investigation seems to indicate that these events should be handled
>>> by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a
>>> "mcelog" or "mced" binary, nor does yum seem to contain anything related
>>> (based on "yum whatprovides '*/mcelog'" and similar queries).
>>> Thus, I still don't know what to do with these errors.  Ignore them? I am
>>> running 32-bit CentOS 6.4 (legacy software reasons).
>>> On Nov 25, 2013, at 11:05 AM, Glenn Eychaner <geychaner at mac.com> wrote:
>>>> On my new Haswell-based machines, I am occasionally seeing entries like the
>>>> following in /var/log/messages:
>>>> 	kernel: [Hardware Error]: Machine check events logged
>>>> (I would not have even noticed them, except that they get flagged by logwatch.)
>>>> These messages always occur alone, and don't seem to have a corresponding
>>>> entry in any other log file in /var/log. How can I get more info about these
>>>> messages?

Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory