On my new Haswell-based machines, I am occasionally seeing entries like the following in /var/log/messages: kernel: [Hardware Error]: Machine check events logged (I would not have even noticed them, except that they get flagged by logwatch.) These messages always occur alone, and don't seem to have a corresponding entry in any other log file in /var/log. How can I get more info about these messages?
Thanks, -G. -- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
-G.
On Nov 25, 2013, at 11:05 AM, Glenn Eychaner geychaner@mac.com wrote:
On my new Haswell-based machines, I am occasionally seeing entries like the following in /var/log/messages: kernel: [Hardware Error]: Machine check events logged (I would not have even noticed them, except that they get flagged by logwatch.) These messages always occur alone, and don't seem to have a corresponding entry in any other log file in /var/log. How can I get more info about these messages?
-- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory
On Tue, Nov 26, 2013 at 09:25:55AM -0300, Glenn Eychaner wrote:
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
You should have this package available:
% rpm -qi mcelog Name : mcelog Relocations: (not relocatable) Version : 1.0pre3_20120814_2 Vendor: CentOS Release : 0.6.el6 Build Date: Thu Feb 21 20:52:19 2013 Install Date: Sat Mar 9 06:48:53 2013 Build Host: c6b8.bsys.dev.centos.org Group : System Environment/Base Source RPM: mcelog-1.0pre3_20120814_2-0.6.el6.src.rpm Size : 116942 License: GPLv2 Signature : RSA/SHA1, Sat Feb 23 12:38:34 2013, Key ID 0946fca2c105b9de Packager : CentOS BuildSystem http://bugs.centos.org URL : http://git.kernel.org/?p=utils/cpu/mce/mcelog.git Summary : Tool to translate x86-64 CPU Machine Check Exception data. Description : mcelog is a daemon that collects and decodes Machine Check Exception data on x86-64 machines.
% rpm -ql mcelog /etc/cron.hourly/mcelog.cron /etc/mcelog/mcelog.conf /etc/rc.d/init.d/mcelogd /etc/sysconfig/mcelogd /usr/sbin/mcelog /usr/share/doc/mcelog-1.0pre3_20120814_2 /usr/share/doc/mcelog-1.0pre3_20120814_2/CHANGES /usr/share/doc/mcelog-1.0pre3_20120814_2/README /usr/share/man/man8/mcelog.8.gz
From: Glenn Eychaner geychaner@mac.com
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
yum info mcelog ... Description : mcelog is a daemon that collects and decodes Machine Check : Exception data on x86-64 machines.
So not for 32-bit...
JD
On further, further investigation, it looks like according to the mcelog install guide at http://www.mcelog.org/installation.html, I could "roll my own" for 32-bit CentOS 6:
"For bad page offlining you will need a 2.6.33+ kernel or a 2.6.32 kernel with the soft offlining capability backported (like RHEL6 or SLES11-SP1)" "The kernel has to have CONFIG_X86_MCE enabled. For 32bit kernels you need at least a 2.6,30 kernel."
The current kernel I am running is 2.6.32-358.23.2, but I can't tell whether it has CONFIG_X86_MCE enabled. How can I find this out?
Thanks, -G.
JD writes:
yum info mcelog ... Description : mcelog is a daemon that collects and decodes Machine Check : Exception data on x86-64 machines.
So not for 32-bit...
On Nov 26, 2013, at 9:25 AM, Glenn Eychaner geychaner@mac.com wrote:
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
On Nov 25, 2013, at 11:05 AM, Glenn Eychaner geychaner@mac.com wrote:
On my new Haswell-based machines, I am occasionally seeing entries like the following in /var/log/messages: kernel: [Hardware Error]: Machine check events logged (I would not have even noticed them, except that they get flagged by logwatch.) These messages always occur alone, and don't seem to have a corresponding entry in any other log file in /var/log. How can I get more info about these messages?
-- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory
On 11/26/2013 03:11 PM, Glenn Eychaner wrote: [snip]
The current kernel I am running is 2.6.32-358.23.2, but I can't tell whether it has CONFIG_X86_MCE enabled. How can I find this out?
$ grep CONFIG_X86_MCE /boot/config-2.6.32-358.23.2.el6.x86_64
CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y CONFIG_X86_MCE_AMD=y CONFIG_X86_MCE_THRESHOLD=y CONFIG_X86_MCE_INJECT=m
Regards, Patrick
On further, further, further toying, I now have mcelog running on my 32-bit CentOS 6 systems! I admit to doing it the "dumb" way: I grabbed the source from the git repository, compiled and installed it, and THEN discovered that the init.d file supplied with the source was not CentOS compatible, so I grabbed the x86-64 RPM, extracted the startup files, and copied them into place. The RPM was small enough to make this easy.
What I SHOULD have done is to grab the source RPM, replace the source with the latest source, build and install the source RPM, and then repackage the RPMs again for future consumption. Maybe I will try that at a future date, but I don't really have time today.
-G.
On Nov 26, 2013, at 11:11 AM, Glenn Eychaner geychaner@mac.com wrote:
On further, further investigation, it looks like according to the mcelog install guide at http://www.mcelog.org/installation.html, I could "roll my own" for 32-bit CentOS 6:
"For bad page offlining you will need a 2.6.33+ kernel or a 2.6.32 kernel with the soft offlining capability backported (like RHEL6 or SLES11-SP1)" "The kernel has to have CONFIG_X86_MCE enabled. For 32bit kernels you need at least a 2.6,30 kernel."
The current kernel I am running is 2.6.32-358.23.2, but I can't tell whether it has CONFIG_X86_MCE enabled. How can I find this out?
JD writes:
yum info mcelog ... Description : mcelog is a daemon that collects and decodes Machine Check : Exception data on x86-64 machines.
So not for 32-bit...
On Nov 26, 2013, at 9:25 AM, Glenn Eychaner geychaner@mac.com wrote:
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
On Nov 25, 2013, at 11:05 AM, Glenn Eychaner geychaner@mac.com wrote:
On my new Haswell-based machines, I am occasionally seeing entries like the following in /var/log/messages: kernel: [Hardware Error]: Machine check events logged (I would not have even noticed them, except that they get flagged by logwatch.) These messages always occur alone, and don't seem to have a corresponding entry in any other log file in /var/log. How can I get more info about these messages?
-- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory
And all that work was done to get this, output of a corrected memory parity error. I get about one of these per workstation per 3 days, more or less; is this a surprising number? (The workstation under the heaviest load gets more, while the idle spare gets none at all; no surprise there!)
MCE 6 CPU 1 BANK 0 TIME 1385426237 Mon Nov 25 21:37:17 2013 MCG status: MCi status: Corrected error Error enabled MCA: Internal parity error STATUS 90000040000f0005 MCGSTATUS 0 MCGCAP c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 60
Anyway, -G.
On Nov 27, 2013, at 3:32 PM, Glenn Eychaner geychaner@mac.com wrote:
On further, further, further toying, I now have mcelog running on my 32-bit CentOS 6 systems! I admit to doing it the "dumb" way: I grabbed the source from the git repository, compiled and installed it, and THEN discovered that the init.d file supplied with the source was not CentOS compatible, so I grabbed the x86-64 RPM, extracted the startup files, and copied them into place. The RPM was small enough to make this easy.
What I SHOULD have done is to grab the source RPM, replace the source with the latest source, build and install the source RPM, and then repackage the RPMs again for future consumption. Maybe I will try that at a future date, but I don't really have time today.
-G.
On Nov 26, 2013, at 11:11 AM, Glenn Eychaner geychaner@mac.com wrote:
On further, further investigation, it looks like according to the mcelog install guide at http://www.mcelog.org/installation.html, I could "roll my own" for 32-bit CentOS 6:
"For bad page offlining you will need a 2.6.33+ kernel or a 2.6.32 kernel with the soft offlining capability backported (like RHEL6 or SLES11-SP1)" "The kernel has to have CONFIG_X86_MCE enabled. For 32bit kernels you need at least a 2.6,30 kernel."
The current kernel I am running is 2.6.32-358.23.2, but I can't tell whether it has CONFIG_X86_MCE enabled. How can I find this out?
JD writes:
yum info mcelog ... Description : mcelog is a daemon that collects and decodes Machine Check : Exception data on x86-64 machines.
So not for 32-bit...
On Nov 26, 2013, at 9:25 AM, Glenn Eychaner geychaner@mac.com wrote:
Further investigation seems to indicate that these events should be handled by "mcelog" or "mced". However, there is no /var/log/mcelog, nor do I have a "mcelog" or "mced" binary, nor does yum seem to contain anything related (based on "yum whatprovides '*/mcelog'" and similar queries).
Thus, I still don't know what to do with these errors. Ignore them? I am running 32-bit CentOS 6.4 (legacy software reasons).
On Nov 25, 2013, at 11:05 AM, Glenn Eychaner geychaner@mac.com wrote:
On my new Haswell-based machines, I am occasionally seeing entries like the following in /var/log/messages: kernel: [Hardware Error]: Machine check events logged (I would not have even noticed them, except that they get flagged by logwatch.) These messages always occur alone, and don't seem to have a corresponding entry in any other log file in /var/log. How can I get more info about these messages?
-- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory
Glenn Eychaner wrote:
And all that work was done to get this, output of a corrected memory parity error. I get about one of these per workstation per 3 days, more
or less;
is this a surprising number? (The workstation under the heaviest load gets more, while the idle spare gets none at all; no surprise there!)
MCE 6 CPU 1 BANK 0 TIME 1385426237 Mon Nov 25 21:37:17 2013 MCG status: MCi status: Corrected error Error enabled MCA: Internal parity error STATUS 90000040000f0005 MCGSTATUS 0 MCGCAP c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 60
Is the system still under warranty? How 'bout the memory, if you've replaced it? You *should* replace it. It's not going to get better....
mark
m.roth writes:
Is the system still under warranty? How 'bout the memory, if you've replaced it? You *should* replace it. It's not going to get better....
This is brand-new Kingston 1600MHz ECC memory on a workstation/server running at high altitude in a relatively open environment; I am loath to replace it based on a single correctable parity error every few days. Especially since both active computers are (thus far) seeing about the same error frequency (though it will take many more days or even weeks to determine that for certain; I haven't seen one in the last three days on either active computer), and memtest was run on these computers overnight (18+ hours) between build and deployment without apparent issue.
[The computers were built in the states and then shipped 10,000 miles to the observatory location.]
And the turnaround time from the observatory to the U.S. on servicing is no small matter. I have five of these computers (two active, one "hot" spare, one "cold" spare, one test system); if in the long run one proves to be a problem, i will deal with it at that time. If the memory is a bad batch, I'll need more proof.
-G.
On Nov 27, 2013, at 3:56 PM, Glenn Eychaner geychaner@mac.com wrote:
And all that work was done to get this, output of a corrected memory parity error. I get about one of these per workstation per 3 days, more or less; is this a surprising number? (The workstation under the heaviest load gets more, while the idle spare gets none at all; no surprise there!)
MCE 6 CPU 1 BANK 0 TIME 1385426237 Mon Nov 25 21:37:17 2013 MCG status: MCi status: Corrected error Error enabled MCA: Internal parity error STATUS 90000040000f0005 MCGSTATUS 0 MCGCAP c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 60
-- Glenn Eychaner (geychaner@lco.cl) Telescope Systems Programmer, Las Campanas Observatory
Quoting Glenn Eychaner geychaner@mac.com:
This is brand-new Kingston 1600MHz ECC memory on a workstation/server running at high altitude [snip]
Cosmic rays? Do you have a Poisson distribution for those machine check events? :)
Devin
He's not running the Poisson distro, he's using CentOS! 8-)
On Fri, Nov 29, 2013 at 11:57 AM, Devin Reade gdr@gno.org wrote:
Quoting Glenn Eychaner geychaner@mac.com:
This is brand-new Kingston 1600MHz ECC memory on a workstation/server running at high altitude [snip]
Cosmic rays? Do you have a Poisson distribution for those machine check events? :)
Devin
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos