[CentOS] Kernel:[Hardware Error]:

Sat Aug 12 23:24:26 UTC 2017

On Sat, Aug 12, 2017 at 05:51:33PM -0400, Steven Tardy wrote:
> 
> > On Aug 12, 2017, at 3:50 PM, Fred Smith <fredex at fcshome.stoneham.ma.us> wrote:
> > 
> > I had a series of kernel hardware error reports today while I was away 
> > from my computer:
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: Error Status: Corrected error, no action required.
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: CPU:2 (15:2:0) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
> > 
> > never saw anything like that before.
> > 
> > cpu is:
> > 
> >    $ cat /proc/cpuinfo
> >    processor    : 0
> >    vendor_id    : AuthenticAMD
> >    cpu family    : 21
> >    model        : 2
> >    model name    : AMD FX(tm)-6300 Six-Core Processor
> >    stepping    : 0
> >    microcode    : 0x600084f
> >    cpu MHz        : 1400.000
> >    cache size    : 2048 KB
> >    physical id    : 0
> >    siblings    : 6
> >    core id        : 0
> >    cpu cores    : 3
> >    apicid        : 16
> >    initial apicid    : 0
> >    fpu        : yes
> >    fpu_exception    : yes
> >    cpuid level    : 13
> >    wp        : yes
> >    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1
> >    bogomips    : 7023.90
> >    TLB size    : 1536 4K pages
> >    clflush size    : 64
> >    cache_alignment    : 64
> >    address sizes    : 48 bits physical, 48 bits virtual
> >    power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
> > 
> > 
> > six core AMD, above is one of the cores.
> > 
> > Any clues to figure out the errors, and/or mitigate?
> > 
> > thanks!
> > 
> > Fred
> 
> MC == Machine check exception.
> The important part of a MC is the "status" code.
> One can use the Intel doc "Architecture Software Developers Manual" to decode this (4000 page .pdf).
> Unsure but it looks like AMD does similar MC codes.
> Luckily Linux does some heavy lifting and decodes to "cache hierarchy error L2 data eviction".
> The next most important part is the "corrected" bit.
> 
> Now what does that really mean?
> *shrug*, could be firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.

Well. overheating is possible... we don't live in the cleanest possible
house, AND we have cats. so, in general I open up this box twice a year
and vacuum out the house dirt and cat fuzzies. I'm probably overdue for
this task.

This is the first one of these I've had. Hope it's the last. but a
little PM is in order either way.

thanks for the reply.

Fred
> 
> Hope that doesn't confuse too much. (:
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos

-- 
---- Fred Smith -- fredex at fcshome.stoneham.ma.us -----------------------------
                    The Lord detests the way of the wicked 
                  but he loves those who pursue righteousness.
----------------------------- Proverbs 15:9 (niv) -----------------------------