On Thursday 11 October 2007, Centos wrote: > The ECC errors only happens when I am transferring data from other > storage to this one that we get error. > it only happens when it is writing data to it. ECC errors can happen anywhere. It can be that the data is corrupted while it is transmitted to the storage device. Or the data can degrade while stored. And of course, on the transmission from the storage you have another chance to screw it up. Problem is, in almost all cases, you won't see those errors until you read the data. The memory controller will then perform the ECC checksum and see that the data that was returned is bad. What happens then depends on what type of memory and memory controller you have. Simple (old) x86 setups will correct single bit errors and report double bit errors as uncorrectable. If you happen to have 3 bits that changed in the same dataword, ECC will actually screw you up worse - it will see it as a single bit error and correct the wrong way. That way you get corrupt data and a soft error. Newer, more complex x86 configs and most proprietary unix boxes protect against that by using fancier ECC algorithms, memory raid and things like that. Anyway - ECC errors to me mean that I need to trigger a failover and get off the box asap. There is no ECC algorithm and hardware setup out there that does the right thing every single time. If you don't have a failover, see if you can take the system down now, remove the offending dimm/bank and run with the remaining ram until you get replacements. Peter.