And all that work was done to get this, output of a corrected memory parity error. I get about one of these per workstation per 3 days, more or less; is this a surprising number? (The workstation under the heaviest load gets more, while the idle spare gets none at all; no surprise there!)
MCE 6 CPU 1 BANK 0 TIME 1385426237 Mon Nov 25 21:37:17 2013 MCG status: MCi status: Corrected error Error enabled MCA: Internal parity error STATUS 90000040000f0005 MCGSTATUS 0 MCGCAP c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 60
Anyway, -G.
On Nov 27, 2013, at 3:32 PM, Glenn Eychaner geychaner@mac.com wrote:
On further, further, further toying, I now have mcelog running on my 32-bit CentOS 6 systems! I admit to doing it the "dumb" way: I grabbed the source from the git repository, compiled and installed it, and THEN discovered that the init.d file supplied with the source was not CentOS compatible, so I grabbed the x86-64 RPM, extracted the startup files, and copied them into place. The RPM was small enough to make this easy.
What I SHOULD have done is to grab the source RPM, replace the source with the latest source, build and install the source RPM, and then repackage the RPMs again for future consumption. Maybe I will try that at a future date, but I don't really have time today.
-G.
On Nov 26, 2013, at 11:11 AM, Glenn Eychaner geychaner@mac.com wrote:
On further, further investigation, it looks like according to the mcelog install guide at http://www.mcelog.org/installation.html, I could "roll my own" for 32-bit CentOS 6:
"For bad page offlining you will need a 2.6.33+ kernel or a 2.6.32 kernel with the soft offlining capability backported (like RHEL6 or SLES11-SP1)" "The kernel has to have CONFIG_X86_MCE enabled. For 32bit kernels you need at least a 2.6,30 kernel."
The current kernel I am running is 2.6.32-358.23.2, but I can't tell whether it has CONFIG_X86_MCE enabled. How can I find this out?
JD writes:
yum info mcelog ... Description : mcelog is a daemon that collects and decodes Machine Check : Exception data on x86-64 machines.
So not for 32-bit...
On Nov 26, 2013, at 9:25 AM, Glenn Eychaner geychaner@mac.com wrote:
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
On Nov 25, 2013, at 11:05 AM, Glenn Eychaner geychaner@mac.com wrote:
On my new Haswell-based machines, I am occasionally seeing entries like the following in /var/log/messages: kernel: [Hardware Error]: Machine check events logged (I would not have even noticed them, except that they get flagged by logwatch.) These messages always occur alone, and don't seem to have a corresponding entry in any other log file in /var/log. How can I get more info about these messages?
-- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory