[CentOS] Intel SE7210TP1-E giving memory errors

Wed Dec 7 09:07:55 UTC 2011
John R Pierce <pierce at hogranch.com>

On 12/07/11 12:55 AM, John Hodrien wrote:
> In my limited experience, if you can disable ECC in your BIOS, memtest 
> is just
> as good at spotting errors on ECC as non-ECC.  With ECC enabled, 
> you'll need
> seriously messed up ECC before it'll be detected. 

except with ECC disabled, the extra 8 ECC bits per 64bit memory word 
aren't touched at all.

I'd leave ECC on, and skip running memtest entirely, just run real OS 
workloads and let the ECC do the memory test on the fly, as its meant to.

does linux have an ECC scrubber process?   'real' Unix servers (Solaris, 
AIX, etc) generally have a background process, sometimes its part of the 
Idle process, that does a read/write of every memory location when the 
machine is otherwise idle, this catches and fixes soft ECC errors in 
otherwise idle memory, which in turn gets logged.  Solaris (on Sun Sparc 
hardware at least) keeps track of what locations have had bad memory, 
and will stop using a memory page entirely (with a logged alert) if 
there are too many soft ECC errors in the same area.

-- 
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast