[CentOS] ECC RAM Error

Centos centos at unixplanet.biz
Mon Oct 15 12:16:10 UTC 2007


Thanks every one for help and response.

I just noticed that these errors might be soft error, because only 
happens when I overload the
storage with copying simultaneously large files on the same port and 
scsi controller, so I was thinking
it should be  ECC speed to calculation of the parity or ram shortage.

hardware supposed to take care of ECC erros and also device should
be panic or hang by seeing these error, but device just keep going.

what do you think ?


John R Pierce wrote:
> Peter Arremann wrote:
>> On Thursday 11 October 2007, John R Pierce wrote:
>>  
>>> Peter Arremann wrote:
>>>    
>>>> On Thursday 11 October 2007, Centos wrote:
>>>>      
>>>>> The ECC errors only happens when I am transferring data from other
>>>>> storage to this one that we get error.
>>>>> it only happens when it is writing data to it.
>>>>>         
>>> What do you mean by "transferring data from other storage to this one"
>>> ?    These are main memory (RAM) ECC errors and have nothing to do with
>>> disk storage, networking, or anything else.
>>>
>>> and, 'writing data to it', its not clear what the 'it' is referring to.
>>>     
>> Storage - anything that can store data. ECC is a generic term and 
>> covers all kinds of checksumming algorithms. I was talking pretty 
>> generic - don't care if ram, caches, nand-flash, ficon or anything else.
>> I was trying to get across that its hard to pinpoint where your bits 
>> flipped - in the storage device, on the transmission there or back.   
> if they flipped in a disk storage device, you would have gotten a disk 
> storage related error, typically reported as a "CRC" error even tho 
> modern disks haven't actually used CRC since the 80s. 
> systems don't tend to transmit and store the ECC across different 
> device domains.    
> if they had been read off the disk in a 'flipped' state, then they 
> would have been written to RAM just as they were read, and the RAM 
> would have created its own ECC on the 'wrong' data quite happily.
>
> anyways, the error in question...
>
> > : EXCEPTION: ECC Error Interrupt (Two or more Bit Error)
> > 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
>
> is almost certainly a RAM ECC error, so I was asking the original 
> poster just what he meant by 'transferring data from other storage to 
> this one' as it sounded like he was thinking in terms of overall 
> server operation rather than the specific component level.
>
>
> if he was talking about copying files from one disk to another, then 
> that data has to be read from the source disk and written into ram, 
> then read back from ram and written to the other device.  in fact, its 
> possibly been copied a few times in ram in the process too, so talking 
> about transfering between storage really isn't very helpful here.
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>




More information about the CentOS mailing list