Hello
Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be related to RAM it self or it is related to bad connection and noise. please let me know if you have a good document regarding ECC Errors, specially I want to know if data will be retransmitted when error happens.
02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
Thanks
On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote:
Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be related to RAM it self or it is related to bad connection and noise. please let me know if you have a good document regarding ECC Errors, specially I want to know if data will be retransmitted when error happens. 02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
Change the memory; see if the errors persist.
was wondering if it is safe to use the device, until we receive RAM. that device is our main storage.
does data retransmit when ECC errors happen. I don't want to have data corruption.
Matthew Miller wrote:
On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote:
Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be related to RAM it self or it is related to bad connection and noise. please let me know if you have a good document regarding ECC Errors, specially I want to know if data will be retransmitted when error happens. 02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
Change the memory; see if the errors persist.
On Thu, 11 Oct 2007, Centos wrote:
was wondering if it is safe to use the device, until we receive RAM. that device is our main storage.
does data retransmit when ECC errors happen. I don't want to have data corruption.
You are not talking about data transission - but storage
If two or more bit errors occur then ECC is not able to correct them and you are likely to get data corruption.
Regards Lance
Matthew Miller wrote:
On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote:
Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be related to RAM it self or it is related to bad connection and noise. please let me know if you have a good document regarding ECC Errors, specially I want to know if data will be retransmitted when error happens. 02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
Change the memory; see if the errors persist.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
was wondering if it is safe to use the device, until we receive RAM. that device is our main storage.
does data retransmit when ECC errors happen. I don't want to have data corruption.
You are not talking about data transission - but storage
If two or more bit errors occur then ECC is not able to correct them and you are likely to get data corruption.
Regards Lance
Matthew Miller wrote:
On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote:
Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be
related to > RAM it self or
it is related to bad connection and noise. please let me know if you have a good document regarding ECC Errors, specially I want to know if data will be retransmitted when error happens. 02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
Change the memory; see if the errors persist.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, 11 Oct 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
Well that is when it is detected ...
As I said ECC RAM errors are concerned with an error in storage - not an error in transmission.
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
was wondering if it is safe to use the device, until we receive RAM. that device is our main storage.
does data retransmit when ECC errors happen. I don't want to have data corruption.
You are not talking about data transission - but storage
If two or more bit errors occur then ECC is not able to correct them and you are likely to get data corruption.
Regards Lance
Matthew Miller wrote:
On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote:
Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be
related to > RAM it self or
it is related to bad connection and noise. please let me know if you have a good document regarding ECC Errors, specially I want to know if data will be retransmitted when error happens. 02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
Change the memory; see if the errors persist.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
do you think replacing ram will solve our problem ? how can I make sure it is the ram ?
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
Well that is when it is detected ...
As I said ECC RAM errors are concerned with an error in storage - not an error in transmission.
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
was wondering if it is safe to use the device, until we receive RAM. that device is our main storage.
does data retransmit when ECC errors happen.
I don't want to have data corruption.
You are not talking about data transission - but storage
If two or more bit errors occur then ECC is not able to correct them and you are likely to get data corruption.
Regards Lance
> Matthew Miller wrote:
On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote:
> > Has any one have any experience in ECC RAM Errors. we are seeing ECC fault Errors but I am not sure if it can be
related to > RAM it self or
it is related to bad connection and noise. please let me know if you have a good document regarding ECC
Errors,
specially I want to know if data will be retransmitted when
error > > > happens.
02:00:31, Thursday, 10/11/2007 : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82 > > > > Change the memory; see if the errors persist. > >
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, 11 Oct 2007, Centos wrote:
do you think replacing ram will solve our problem ?
assuming it is RAM gone faulty and not some other issue then it should.
how can I make sure it is the ram ?
memtest86 ??
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
Well that is when it is detected ...
As I said ECC RAM errors are concerned with an error in storage - not an error in transmission.
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
was wondering if it is safe to use the device, until we receive RAM. that device is our main storage.
does data retransmit when ECC errors happen.
I don't want to have data corruption.
You are not talking about data transission - but storage
If two or more bit errors occur then ECC is not able to correct them and you are likely to get data corruption.
Regards Lance
> > Matthew Miller wrote: On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote: > > > Has any one have any experience in ECC RAM Errors. > we are seeing ECC fault Errors but I am not sure if it can be related to > RAM it self or > it is related to bad connection and noise. > please let me know if you have a good document regarding ECC
Errors,
> specially I want to know if data will be retransmitted when
error > > > happens.
> 02:00:31, Thursday, 10/11/2007 > : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) > 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82 > > > > > Change the memory; see if the errors persist. > > >
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Thank you Lance,
We will change the memory to see if it is resolving the problem. that storage only has basic linux kernel , which unfortunately does not carry memtest86.
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
do you think replacing ram will solve our problem ?
assuming it is RAM gone faulty and not some other issue then it should.
how can I make sure it is the ram ?
memtest86 ??
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
Well that is when it is detected ...
As I said ECC RAM errors are concerned with an error in storage - not an error in transmission.
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
> > was wondering if it is safe to use the device, until we
receive RAM.
that device is our main storage. > does data retransmit when ECC errors happen. I don't want to have data corruption. > You are not talking about data transission - but storage > If two or more bit errors occur then ECC is not able to
correct them > > and
you are likely to get data corruption.
> Regards
Lance
> > > > > Matthew Miller wrote: > On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote: > > > > Has any one have any experience in ECC RAM Errors. > > we are seeing ECC fault Errors but I am not sure if it can
be > > > > related to > RAM it self or
> > it is related to bad connection and noise. > > please let me know if you have a good document regarding
ECC > > Errors,
> > specially I want to know if data will be retransmitted
when > > error > > > happens.
> > 02:00:31, Thursday, 10/11/2007 > > : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) > > 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82 > > > > > > Change the memory; see if the errors persist. > > > > _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos >
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, 11 Oct 2007, Centos wrote:
Thank you Lance,
We will change the memory to see if it is resolving the problem. that storage only has basic linux kernel , which unfortunately does not carry memtest86.
memtest86 is usually a package that you boot into ...
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
do you think replacing ram will solve our problem ?
assuming it is RAM gone faulty and not some other issue then it should.
how can I make sure it is the ram ?
memtest86 ??
Regards Lance
Lance Davis wrote:
On Thu, 11 Oct 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
Well that is when it is detected ...
As I said ECC RAM errors are concerned with an error in storage - not an error in transmission.
Regards Lance
> Lance Davis wrote: On Thu, 11 Oct 2007, Centos wrote: > > > was wondering if it is safe to use the device, until we
receive RAM.
> that device is our main storage. > > does data retransmit when ECC errors happen. > I don't want to have data corruption. > > You are not talking about data transission - but storage > > If two or more bit errors occur then ECC is not able to
correct them > > and
you are likely to get data corruption. > > Regards Lance > > > > > > Matthew Miller wrote: > > On Thu, Oct 11, 2007 at 09:57:12AM -0300, Centos wrote: > > > > > Has any one have any experience in ECC RAM Errors. > > > we are seeing ECC fault Errors but I am not sure if it can
be > > > > related to > RAM it self or
> > > it is related to bad connection and noise. > > > please let me know if you have a good document regarding
ECC > > Errors,
> > > specially I want to know if data will be retransmitted
when > > error > > > happens.
> > > 02:00:31, Thursday, 10/11/2007 > > > : EXCEPTION: ECC Error Interrupt (Two or more Bit Error) > > > 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82 > > > > > > > Change the memory; see if the errors persist. > > > > > > _______________________________________________ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos > >
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
do you think replacing ram will solve our problem ? how can I make sure it is the ram ?
This is almost certainly a hardware problem. It could be the RAM, a particular motherboard DIMM slot, or maybe the RAM is just not seated quite right in the memory slot. I have seen all three of these problems.
Try running the standalone memory tester. First run: # yum install memtest86+
This will add a boot option of booting into memtest86+ instead of into CentOS. See if you can reproduce the error with memtest86+. That may save some time.
When you have reproduced the error, try just reseating all the DIMM's. Pop them out and push them back in firmly. Try blowing out any dust that may be in the memory slots.
Assuming it still fails, then pull out the memory DIMM's one at a time (unless you need to do it in pairs), and keep running your test until it doesn't fail. When it stops failing, try the suspicious DIMM all by itself in a different slot and see if it fails.
By doing this kind of divide and conquer, you will be able to determine whether it is the DIMM or the motherboard.
Dan
On Thursday 11 October 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
ECC errors can happen anywhere. It can be that the data is corrupted while it is transmitted to the storage device. Or the data can degrade while stored. And of course, on the transmission from the storage you have another chance to screw it up.
Problem is, in almost all cases, you won't see those errors until you read the data. The memory controller will then perform the ECC checksum and see that the data that was returned is bad. What happens then depends on what type of memory and memory controller you have.
Simple (old) x86 setups will correct single bit errors and report double bit errors as uncorrectable. If you happen to have 3 bits that changed in the same dataword, ECC will actually screw you up worse - it will see it as a single bit error and correct the wrong way. That way you get corrupt data and a soft error.
Newer, more complex x86 configs and most proprietary unix boxes protect against that by using fancier ECC algorithms, memory raid and things like that.
Anyway - ECC errors to me mean that I need to trigger a failover and get off the box asap. There is no ECC algorithm and hardware setup out there that does the right thing every single time. If you don't have a failover, see if you can take the system down now, remove the offending dimm/bank and run with the remaining ram until you get replacements.
Peter.
the interesting thing is I only see these ECC errors when I am writting data to this box, and no error shows up when I am reading data from it, so if it was corrupted Memory or controller those errors should show up even when I am reading them.
am I missing some thing here ?
Peter Arremann wrote:
On Thursday 11 October 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
ECC errors can happen anywhere. It can be that the data is corrupted while it is transmitted to the storage device. Or the data can degrade while stored. And of course, on the transmission from the storage you have another chance to screw it up.
Problem is, in almost all cases, you won't see those errors until you read the data. The memory controller will then perform the ECC checksum and see that the data that was returned is bad. What happens then depends on what type of memory and memory controller you have.
Simple (old) x86 setups will correct single bit errors and report double bit errors as uncorrectable. If you happen to have 3 bits that changed in the same dataword, ECC will actually screw you up worse - it will see it as a single bit error and correct the wrong way. That way you get corrupt data and a soft error.
Newer, more complex x86 configs and most proprietary unix boxes protect against that by using fancier ECC algorithms, memory raid and things like that.
Anyway - ECC errors to me mean that I need to trigger a failover and get off the box asap. There is no ECC algorithm and hardware setup out there that does the right thing every single time. If you don't have a failover, see if you can take the system down now, remove the offending dimm/bank and run with the remaining ram until you get replacements.
Peter. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Peter Arremann wrote:
On Thursday 11 October 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
What do you mean by "transferring data from other storage to this one" ? These are main memory (RAM) ECC errors and have nothing to do with disk storage, networking, or anything else.
and, 'writing data to it', its not clear what the 'it' is referring to.
ECC errors can happen anywhere. It can be that the data is corrupted while it is transmitted to the storage device. Or the data can degrade while stored. And of course, on the transmission from the storage you have another chance to screw it up.
ECC errors in RAM will be detected when data is read back from RAM and the associated ECC is incorrect. the ECC codes are generated when the RAM is written to.
So I dunno what you're talking about 'transmitted ot storage devices', etc.... Disk drives have their OWN ECC, this is quite different and seperate and has no relationship to the ECC in main memory.
On Thursday 11 October 2007, John R Pierce wrote:
Peter Arremann wrote:
On Thursday 11 October 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
What do you mean by "transferring data from other storage to this one" ? These are main memory (RAM) ECC errors and have nothing to do with disk storage, networking, or anything else.
and, 'writing data to it', its not clear what the 'it' is referring to.
Storage - anything that can store data. ECC is a generic term and covers all kinds of checksumming algorithms. I was talking pretty generic - don't care if ram, caches, nand-flash, ficon or anything else.
I was trying to get across that its hard to pinpoint where your bits flipped - in the storage device, on the transmission there or back.
So I dunno what you're talking about 'transmitted ot storage devices', etc.... Disk drives have their OWN ECC, this is quite different and seperate and has no relationship to the ECC in main memory.
Right - and again I was using generic terms because at the time that I posted, the question was if the RAM is bad or any other component... :)
Peter.
Peter Arremann wrote:
On Thursday 11 October 2007, John R Pierce wrote:
Peter Arremann wrote:
On Thursday 11 October 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
What do you mean by "transferring data from other storage to this one" ? These are main memory (RAM) ECC errors and have nothing to do with disk storage, networking, or anything else.
and, 'writing data to it', its not clear what the 'it' is referring to.
Storage - anything that can store data. ECC is a generic term and covers all kinds of checksumming algorithms. I was talking pretty generic - don't care if ram, caches, nand-flash, ficon or anything else.
I was trying to get across that its hard to pinpoint where your bits flipped - in the storage device, on the transmission there or back.
if they flipped in a disk storage device, you would have gotten a disk storage related error, typically reported as a "CRC" error even tho modern disks haven't actually used CRC since the 80s.
systems don't tend to transmit and store the ECC across different device domains.
if they had been read off the disk in a 'flipped' state, then they would have been written to RAM just as they were read, and the RAM would have created its own ECC on the 'wrong' data quite happily.
anyways, the error in question...
: EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
is almost certainly a RAM ECC error, so I was asking the original poster just what he meant by 'transferring data from other storage to this one' as it sounded like he was thinking in terms of overall server operation rather than the specific component level.
if he was talking about copying files from one disk to another, then that data has to be read from the source disk and written into ram, then read back from ram and written to the other device. in fact, its possibly been copied a few times in ram in the process too, so talking about transfering between storage really isn't very helpful here.
Thanks every one for help and response.
I just noticed that these errors might be soft error, because only happens when I overload the storage with copying simultaneously large files on the same port and scsi controller, so I was thinking it should be ECC speed to calculation of the parity or ram shortage.
hardware supposed to take care of ECC erros and also device should be panic or hang by seeing these error, but device just keep going.
what do you think ?
John R Pierce wrote:
Peter Arremann wrote:
On Thursday 11 October 2007, John R Pierce wrote:
Peter Arremann wrote:
On Thursday 11 October 2007, Centos wrote:
The ECC errors only happens when I am transferring data from other storage to this one that we get error. it only happens when it is writing data to it.
What do you mean by "transferring data from other storage to this one" ? These are main memory (RAM) ECC errors and have nothing to do with disk storage, networking, or anything else.
and, 'writing data to it', its not clear what the 'it' is referring to.
Storage - anything that can store data. ECC is a generic term and covers all kinds of checksumming algorithms. I was talking pretty generic - don't care if ram, caches, nand-flash, ficon or anything else. I was trying to get across that its hard to pinpoint where your bits flipped - in the storage device, on the transmission there or back.
if they flipped in a disk storage device, you would have gotten a disk storage related error, typically reported as a "CRC" error even tho modern disks haven't actually used CRC since the 80s. systems don't tend to transmit and store the ECC across different device domains. if they had been read off the disk in a 'flipped' state, then they would have been written to RAM just as they were read, and the RAM would have created its own ECC on the 'wrong' data quite happily.
anyways, the error in question...
: EXCEPTION: ECC Error Interrupt (Two or more Bit Error) 0C18:00020001 0C68:00000000 Lcause:74630001 Lerr:1C855F82
is almost certainly a RAM ECC error, so I was asking the original poster just what he meant by 'transferring data from other storage to this one' as it sounded like he was thinking in terms of overall server operation rather than the specific component level.
if he was talking about copying files from one disk to another, then that data has to be read from the source disk and written into ram, then read back from ram and written to the other device. in fact, its possibly been copied a few times in ram in the process too, so talking about transfering between storage really isn't very helpful here.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
on 10/15/2007 5:16 AM Centos spake the following:
Thanks every one for help and response.
I just noticed that these errors might be soft error, because only happens when I overload the storage with copying simultaneously large files on the same port and scsi controller, so I was thinking it should be ECC speed to calculation of the parity or ram shortage.
hardware supposed to take care of ECC erros and also device should be panic or hang by seeing these error, but device just keep going.
what do you think ?
I have had systems so overloaded that I couldn't log in on an ssh session, but when the load cleared, there weren't any ECC errors. I still think you have a hardware problem, and just because it takes a high load now doesn't mean that it is OK. A faulty timing capacitor on the motherboard can cause all sorts of corruption in memory, and it will probably deteriorate over time. You need to methodically test the memory by running memory tests, and then moving ram and testing again. Or replace the hardware if it is mission critical.