I had a series of kernel hardware error reports today while I was away from my computer:
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: CPU:2 (15:2:0) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
never saw anything like that before.
cpu is:
$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD FX(tm)-6300 Six-Core Processor stepping : 0 microcode : 0x600084f cpu MHz : 1400.000 cache size : 2048 KB physical id : 0 siblings : 6 core id : 0 cpu cores : 3 apicid : 16 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1 bogomips : 7023.90 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
six core AMD, above is one of the cores.
Any clues to figure out the errors, and/or mitigate?
thanks!
Fred
On Aug 12, 2017, at 3:50 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
I had a series of kernel hardware error reports today while I was away from my computer:
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: CPU:2 (15:2:0) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
never saw anything like that before.
cpu is:
$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD FX(tm)-6300 Six-Core Processor stepping : 0 microcode : 0x600084f cpu MHz : 1400.000 cache size : 2048 KB physical id : 0 siblings : 6 core id : 0 cpu cores : 3 apicid : 16 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1 bogomips : 7023.90 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
six core AMD, above is one of the cores.
Any clues to figure out the errors, and/or mitigate?
thanks!
Fred
MC == Machine check exception. The important part of a MC is the "status" code. One can use the Intel doc "Architecture Software Developers Manual" to decode this (4000 page .pdf). Unsure but it looks like AMD does similar MC codes. Luckily Linux does some heavy lifting and decodes to "cache hierarchy error L2 data eviction". The next most important part is the "corrected" bit.
Now what does that really mean? *shrug*, could be firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.
Hope that doesn't confuse too much. (:
On Sat, Aug 12, 2017 at 05:51:33PM -0400, Steven Tardy wrote:
On Aug 12, 2017, at 3:50 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
I had a series of kernel hardware error reports today while I was away from my computer:
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: CPU:2 (15:2:0) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
never saw anything like that before.
cpu is:
$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD FX(tm)-6300 Six-Core Processor stepping : 0 microcode : 0x600084f cpu MHz : 1400.000 cache size : 2048 KB physical id : 0 siblings : 6 core id : 0 cpu cores : 3 apicid : 16 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1 bogomips : 7023.90 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
six core AMD, above is one of the cores.
Any clues to figure out the errors, and/or mitigate?
thanks!
Fred
MC == Machine check exception. The important part of a MC is the "status" code. One can use the Intel doc "Architecture Software Developers Manual" to decode this (4000 page .pdf). Unsure but it looks like AMD does similar MC codes. Luckily Linux does some heavy lifting and decodes to "cache hierarchy error L2 data eviction". The next most important part is the "corrected" bit.
Now what does that really mean? *shrug*, could be firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.
Well. overheating is possible... we don't live in the cleanest possible house, AND we have cats. so, in general I open up this box twice a year and vacuum out the house dirt and cat fuzzies. I'm probably overdue for this task.
This is the first one of these I've had. Hope it's the last. but a little PM is in order either way.
thanks for the reply.
Fred
Hope that doesn't confuse too much. (: _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 08/12/2017 07:24 PM, Fred Smith wrote:
Well. overheating is possible... we don't live in the cleanest possible house, AND we have cats. so, in general I open up this box twice a year and vacuum out the house dirt and cat fuzzies. I'm probably overdue for this task.
Cleaning is a good thing to do, but not with a vacuum... the vacuum could loosen components, even make them disappear. Much better would be to use a blower or bellows of some kind.
Also, cowboys scoff, but I always wear a grounded wrist strap when handling electronics.
On 08/13/2017 05:18 AM, ken wrote:
Also, cowboys scoff, but I always wear a grounded wrist strap when handling electronics.
It's a good idea, especially in low-humidity climates. Also noteworthy: the air moving through a hose can cause a vacuum's hose or attachment to build up a static charge, which is another reason it can be a bad idea to use a vacuum in a computer.
That's why i periodically clean my mother boards with water, followed by distilled water and ideally everclear to reduce drying time (Not rubbing alcohol etc., it nearly always has excessive impurities that leave a solid residue, and methanol can be damaging as well as being fairly toxic). note that modern electronics are defluxed with water or water based sollutions when manufactured.
of course i then dry it for 24+ hours on edge in a warm, safe area, usually on cardboard (you don't want metal, honest and cardboard is neutral to static even dry) by my baseboard heaters in winter (when static is it's worst) or just on edge in warmer weather (dry here in colorado). obviously you don't want to get water near drives. also IF you clean the power supply this way give it at least 48+ hours, any moisture left in the power supply can easily damage your' system etc. where as a damp mother board will simply not function and is unlikely to be damaged unless noticeably damp (or left on or with cmos battery installed for an extended period though it's not likely). of course you also need to remove the cmos battery first.
also realize that tight spaces under components can take awhile to dry, especially without the alcohol. some wisdom and skill is required but i've never had an issue and have done this at least half a dozen times to several of my machines and many others, i'm an electronics tech and this is the best way. also as house dust is largely dead skin cells etc. it can be greasy, in which case warm water and a little mild detergent (which must be thoroughly rinsed) will help a lot (a SOFT natural fiber brush can be used when wet, or a stiffer one with care). on the other hand i wouldn't recommend this without some experience with electronics, and appropriate caution with more expensive hardware.
i usually leave the cpu in to avoid the very high risk of bent pins which also requires added drying time in many cases. compressed air is not your' computers' friend due to static and blowing high velocity dust around a computer is an excellent way to cause problems though people do it all the time.
i DO remove the heatsink, remove fan from heatsink, and thoroughly clean heatsink with hot soapy water (clean fan with damp paper towel etc. to avoid damaging the motor/lubrication and washing dust into it!). cleaning the heat sink this way is the best and as safe as any removal and reinstall of the heatsink (always, always clean old grease and replace to avoid air bubbles and hot spots). best/easiest way i've found to remove heatsink grease is with rubbing alcohol and qtips. the alcohol doesn't dissolve the grease but the alcohol and water keep it from sticking back onto the metal once removed. and do ground yourself, especially in winter, and avoid going near carpet or wearing synthetic fibers, cotton etc. is good)
-- Securely sent with Tutanota. Claim your encrypted mailbox today! https://tutanota.com
13. Aug 2017 10:55 by gordon.messmer@gmail.com:
On 08/13/2017 05:18 AM, ken wrote: Also, cowboys scoff, but I always wear a grounded wrist strap when handling electronics.
It's a good idea, especially in low-humidity climates. Also noteworthy: the air moving through a hose can cause a vacuum's hose or attachment to build up a static charge, which is another reason it can be a bad idea to use a vacuum in a computer.
On Sun, Aug 13, 2017 at 08:18:24AM -0400, ken wrote:
On 08/12/2017 07:24 PM, Fred Smith wrote:
Well. overheating is possible... we don't live in the cleanest possible house, AND we have cats. so, in general I open up this box twice a year and vacuum out the house dirt and cat fuzzies. I'm probably overdue for this task.
Cleaning is a good thing to do, but not with a vacuum... the vacuum could loosen components, even make them disappear. Much better would be to use a blower or bellows of some kind.
thanks for the reminder.
I don't actually use a vacuum, I was just being, er, loose with my terminology. I use a can of compressed "air" where possible, remove fans on heatsinks and blow or wipe/brush out the clogs, remove the inlet filters and wash 'em. I get amazing amounts of cat fur.
On Sat, Aug 12, 2017 at 1:50 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
I had a series of kernel hardware error reports today while I was away from my computer:
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
Message from syslogd@fcshome at Aug 12 10:12:24 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Cosmic ray corrupted data in RAM, and ECC detected and corrected it? Whatever it was, working as intended.