[CentOS] ECC RAM Error

Thu Oct 11 14:48:06 UTC 2007
Peter Arremann <loony at loonybin.org>

On Thursday 11 October 2007, Centos wrote:
> The ECC errors only happens when I am transferring data from other
> storage to this one that we get error.
> it only happens when it is writing data to it.

ECC errors can happen anywhere. It can be that the data is corrupted while it 
is transmitted to the storage device. Or the data can degrade while stored. 
And of course, on the transmission from the storage you have another chance 
to screw it up.

Problem is, in almost all cases, you won't see those errors until you read the 
data. The memory controller will then perform the ECC checksum and see that 
the data that was returned is bad. What happens then depends on what type of 
memory and memory controller you have. 

Simple (old) x86 setups will correct single bit errors and report double bit 
errors as uncorrectable. If you happen to have 3 bits that changed in the 
same dataword, ECC will actually screw you up worse - it will see it as a 
single bit error and correct the wrong way. That way you get corrupt data and 
a soft error. 

Newer, more complex x86 configs and most proprietary unix boxes protect 
against that by using fancier ECC algorithms, memory raid and things like 
that. 

Anyway - ECC errors to me mean that I need to trigger a failover and get off 
the box asap. There is no ECC algorithm and hardware setup out there that 
does the right thing every single time. If you don't have a failover, see if 
you can take the system down now, remove the offending dimm/bank and run with 
the remaining ram until you get replacements. 

Peter.