On Aug 12, 2017, at 3:50 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
I had a series of kernel hardware error reports today while I was away from my computer:
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: CPU:2 (15:2:0) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
never saw anything like that before.
cpu is:
$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD FX(tm)-6300 Six-Core Processor stepping : 0 microcode : 0x600084f cpu MHz : 1400.000 cache size : 2048 KB physical id : 0 siblings : 6 core id : 0 cpu cores : 3 apicid : 16 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1 bogomips : 7023.90 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
six core AMD, above is one of the cores.
Any clues to figure out the errors, and/or mitigate?
thanks!
Fred
MC == Machine check exception. The important part of a MC is the "status" code. One can use the Intel doc "Architecture Software Developers Manual" to decode this (4000 page .pdf). Unsure but it looks like AMD does similar MC codes. Luckily Linux does some heavy lifting and decodes to "cache hierarchy error L2 data eviction". The next most important part is the "corrected" bit.
Now what does that really mean? *shrug*, could be firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.
Hope that doesn't confuse too much. (: