[CentOS] ECC RAM Error
centos at unixplanet.biz
Thu Oct 11 14:57:53 UTC 2007
the interesting thing is I only see these ECC errors when I am writting
data to this box,
and no error shows up when I am reading data from it, so if it was
corrupted Memory or controller
those errors should show up even when I am reading them.
am I missing some thing here ?
Peter Arremann wrote:
> On Thursday 11 October 2007, Centos wrote:
>> The ECC errors only happens when I am transferring data from other
>> storage to this one that we get error.
>> it only happens when it is writing data to it.
> ECC errors can happen anywhere. It can be that the data is corrupted while it
> is transmitted to the storage device. Or the data can degrade while stored.
> And of course, on the transmission from the storage you have another chance
> to screw it up.
> Problem is, in almost all cases, you won't see those errors until you read the
> data. The memory controller will then perform the ECC checksum and see that
> the data that was returned is bad. What happens then depends on what type of
> memory and memory controller you have.
> Simple (old) x86 setups will correct single bit errors and report double bit
> errors as uncorrectable. If you happen to have 3 bits that changed in the
> same dataword, ECC will actually screw you up worse - it will see it as a
> single bit error and correct the wrong way. That way you get corrupt data and
> a soft error.
> Newer, more complex x86 configs and most proprietary unix boxes protect
> against that by using fancier ECC algorithms, memory raid and things like
> Anyway - ECC errors to me mean that I need to trigger a failover and get off
> the box asap. There is no ECC algorithm and hardware setup out there that
> does the right thing every single time. If you don't have a failover, see if
> you can take the system down now, remove the offending dimm/bank and run with
> the remaining ram until you get replacements.
> CentOS mailing list
> CentOS at centos.org
More information about the CentOS