[CentOS] ECC RAM Error
loony at loonybin.org
Thu Oct 11 14:48:06 UTC 2007
On Thursday 11 October 2007, Centos wrote:
> The ECC errors only happens when I am transferring data from other
> storage to this one that we get error.
> it only happens when it is writing data to it.
ECC errors can happen anywhere. It can be that the data is corrupted while it
is transmitted to the storage device. Or the data can degrade while stored.
And of course, on the transmission from the storage you have another chance
to screw it up.
Problem is, in almost all cases, you won't see those errors until you read the
data. The memory controller will then perform the ECC checksum and see that
the data that was returned is bad. What happens then depends on what type of
memory and memory controller you have.
Simple (old) x86 setups will correct single bit errors and report double bit
errors as uncorrectable. If you happen to have 3 bits that changed in the
same dataword, ECC will actually screw you up worse - it will see it as a
single bit error and correct the wrong way. That way you get corrupt data and
a soft error.
Newer, more complex x86 configs and most proprietary unix boxes protect
against that by using fancier ECC algorithms, memory raid and things like
Anyway - ECC errors to me mean that I need to trigger a failover and get off
the box asap. There is no ECC algorithm and hardware setup out there that
does the right thing every single time. If you don't have a failover, see if
you can take the system down now, remove the offending dimm/bank and run with
the remaining ram until you get replacements.
More information about the CentOS