[CentOS] mce error

Ted Miller tedlists at sbcglobal.net
Fri Nov 16 01:46:29 UTC 2012


>> On 11/13/2012 09:21 AM, Johnny Hughes wrote:
>>> On 11/13/2012 07:49 AM, Banyan He wrote:
>>>> Just check the config to build the edac_mce module if you don't build
>>>> it in.
>>>>
>>>> CONFIG_EDAC_MCE=y
>>>>
>>>> Make sure you have this in the /boot/config-xxxx.
>>> If he is running a standard CentOS kernel then he should have
>>> CONFIG_EDAC_MCE=y.
>>>
>>>>
>>>> On 2012-11-13 8:12 PM, Ted Miller wrote:
>>>>> During booting of Centos6 I see an error message that goes something
>>>>> like:
>>>>>
>>>>> Starting mcelog daemon [FAILED]
>>>>> AMD Processor family 15: Please load edac_mce_amd module.
>>>>> CPU is unsupported
>>>>>
>>>>> The only helpful information I have found is in the "preview" of
>>>>> https://access.redhat.com/knowledge/solutions/158503. I don't have a
>>>>> RedHat account, so don't know if they have a real solution.
>>>>>
>>>>> I know that mce has to do with logging certain microprocessor errors.
>>>>>
>>>>> 1. How important is this
>>>>> 2. Is there anything I should do, except wait for a bug fix sometime?
>>>>>
>>>>> Ted Miller
>>>>> Elkhart, IN
>>> What is does this command say:
>>>
>>> uname -r

> On 2012-11-14 10:58 AM, Ted Miller wrote:

>> Install is 100% stock, off Minimal Install disk, then added groups for
>> Desktop. Up to date.
>>
>> [tmiller at office04]$uname -r
>> 2.6.32-279.14.1.el6.x86_64
>>
>> Then I tried the command the web page has (I see my error during bootup)
>>
>> [root at office04 Documents]# /etc/init.d/mcelogd start
>> [root at office04 Documents]# /etc/init.d/mcelogd status
>> Checking for mcelog
>> mcelog is stopped
>>
>> [tmiller at office04]$ls /dev/mc*
>> /dev/mcelog
>>
>> so the device does exist
>>
>> [root at office04 Documents]# locate edac_mci_amd
>>
>> returned nothing, but I don't know if it should or not.
>>
>> I was reading the MAN page, and noticed "See mcelog --help for a list of
>> valid CPUs." so I tried it, and it lists:
>> Valid CPUs: generic p6old core2 k8 p4 dunnington xeon74xx xeon7400
>> xeon5500 xeon5200 xeon5000 xeon5100 xeon3100 xeon3200 core_i7 core_i5
>> core_i3 nehalem westmere xeon71xx xeon7100 tulsa intel xeon75xx
>> xeon7500 xeon7200 xeon7100 sandybridge sandybridge-ep
>> All the CPUs I recognize in there are Intel, though I don't know all the
>> nicknames.
>>
>> cat /proc/cpuinfo
>>
>> on my system shows (only first of two cores copied)
>>
>> processor : 0
>> vendor_id : AuthenticAMD
>> cpu family : 15
>> model : 35
>> model name : Dual Core AMD Opteron(tm) Processor 180
>> stepping : 2
>> cpu MHz : 1000.000
>> cache size : 1024 KB
>> physical id : 0
>> siblings : 2
>> core id : 0
>> cpu cores : 2
>> apicid : 0
>> initial apicid : 0
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 1
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat
>> pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext
>> 3dnow rep_good pni lahf_lm cmp_legacy
>> bogomips : 2009.40
>> TLB size : 1024 4K pages
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 40 bits physical, 48 bits virtual
>> power management: ts fid vid ttp
>>
>> Not the latest and greatest, and old enough I expected it to be supported
>> by now.
>>
>> Any clues in all this?
>> Ted Miller
>> .
On 11/14/2012 01:22 AM, Banyan He wrote:
> 1. ls /lib/modules/2.6.32-279.el6.i686/kernel/drivers/edac | grep mce

It exists

> If you can find the module there, go to step 2
> 2. modprobe edac_mce_amd

That works

> 3. lsmod | grep mce # verify if it loads

That verifies, even after a reboot.  (Didn't try it before step 2, so don't 
know if it was already loaded.)

> If that is not your case, it is the problem with mcelog itself. I'm not
> 100% confident on these conclusion but the code seems wrong here.
>
> if (!strcmp(vendor,"AuthenticAMD")) {
>    if (family == 15)
>       cputype = CPU_K8;
>    if (family >= 15)
>      SYSERRprintf("AMD Processor family %d: Please load edac_mce_amd module.\n", family);
>    return 0;
>
> Your CPU family is 15. Whatever you do, you will reach here since the check
> is called just after the main is launched.

I'm not much at C programming, but the way I read that, I will hit the 
"return 0" statement no matter what the family number, even if it is less 
than 15.  Any CPU that matches the
    !strcmp(vendor,"AuthenticAMD")
expression is going to get to the
    return 0
line eventually.  The two intermediate if statements only determine if a 
value is set for 'cputype' and if the warning statement gets printed before 
you arrive at the
    return 0
line.  You are going to get there whether your family number is 1 or 100.

I found source code online (had a comment about being edited two months 
ago) for the is_cpu_supported routine.  Looking at the whole thing, I see 
what appear (to my inexperienced eye) two program flow errors.

1. The issue you pointed out, where the third 'if' statement looks like it 
should be '>', not '>='.

2. It looks like there should be braces around the two statements following 
the third 'if' statement.  Then it would look like:

    if (!strcmp(vendor,"AuthenticAMD")) {
      if (family == 15)
        cputype = CPU_K8;
      if (family > 15) {
        SYSERRprintf("AMD Processor family %d: Please load edac_mce_amd 
module.\n", family);
        return 0;}

That construction would allow Family=15 to be supported.  The mcelog error 
message lists k8 as a supported CPU (but I wonder if it has ever been tested?)

Without these changes, my eye says that all AMD CPUs are rejected (return 
0), and never get to the accepted criteria (return 1)

> if (!cpu_forced && !is_cpu_supported()) {
> fprintf(stderr, "CPU is unsupported\n");
> exit(1);
> }
>
> The routine is_cpu_supported reads the data from /proc/cpuinfo for the
> family number. You got stuck here then. You can change the code from ">=15"
> to "> 15".
>
> ------------
> Banyan He
> Blog: http://www.rootong.com
> Email: banyan at rootong.com

I don't have source code downloaded, nor have I done much 
building/compiling, but I would be willing to try to solve this.  Maybe I 
can contribute a little bit back to the project this way.

Does anyone else read the code the way I do, or am I missing something 
completely?
Ted Miller
Indiana, USA


More information about the CentOS mailing list