Hi List, I've been getting the following EDAC memory errors EDAC MC0: CE page 0xeb0dd, offset 0x0, grain 4096, syndrome 0x45, row 3, channel 0, label "": i82875p CE and from this seeing that these errors have been corrected. Checking cat /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count gives me a count of 4 thus I now know that csrow3 - ch0 is the problem
My question is, how does this map to the on board labels DIMM 1A DIMM 1B DIMM 2A DIMM 2B
Am I correct in assuming csrow 3 is DIMM 2B?
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records - can this relate to the memory errors?? - I am just really surprised to have two drives show almost identical number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB TIA for your insightful comments
Rob Kampen wrote:
Hi List, I've been getting the following EDAC memory errors EDAC MC0: CE page 0xeb0dd, offset 0x0, grain 4096, syndrome 0x45, row 3, channel 0, label "": i82875p CE and from this seeing that these errors have been corrected. Checking cat /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count gives me a count of 4 thus I now know that csrow3 - ch0 is the problem
My question is, how does this map to the on board labels DIMM 1A DIMM 1B DIMM 2A DIMM 2B
Am I correct in assuming csrow 3 is DIMM 2B?
Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records
- can this relate to the memory errors??
- I am just really surprised to have two drives show almost identical
number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB
Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read error rate and hardware ECC recovered in the tens of millions - I think this is not a drive issue but related to the ECC mem errors?? Anyone with experience?
TIA for your insightful comments
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 12/05/11 12:17 AM, Rob Kampen wrote:
Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B
often an indication the problem is board/socket related rather than memory DIMM. unless its the other pair you didn't swap.
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records
- can this relate to the memory errors??
- I am just really surprised to have two drives show almost identical
number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB
Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read er
hmmm. no, they shouldn't be remotely related. unless its something else, like a power supply with noisy or out of spec voltage(s).
80GB 3.5" SATA drives? aren't those kind of old? like, ancient ? looked up that PN, thats a Baracuda 7200.7 from circa 2003-2005. http://www.seagate.com/support/disc/manuals/sata/cuda7200_sata_pm.pdf
those are past their shelf date.
John R Pierce wrote:
On 12/05/11 12:17 AM, Rob Kampen wrote:
Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B
often an indication the problem is board/socket related rather than memory DIMM. unless its the other pair you didn't swap.
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records
- can this relate to the memory errors??
- I am just really surprised to have two drives show almost identical
number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB
Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read er
hmmm. no, they shouldn't be remotely related. unless its something else, like a power supply with noisy or out of spec voltage(s).
80GB 3.5" SATA drives? aren't those kind of old? like, ancient ? looked up that PN, thats a Baracuda 7200.7 from circa 2003-2005. http://www.seagate.com/support/disc/manuals/sata/cuda7200_sata_pm.pdf
those are past their shelf date.
Yes Christmas 2004 - never a problem until one of the md raid sets dropped out today. However I put a brand new - never used 120G drive in and it too shows these errors - something doesn't seem right Getting too tired to think straight so I'll leave it limping along until tomorrow Thanks for thoughts
Rob Kampen wrote:
John R Pierce wrote:
On 12/05/11 12:17 AM, Rob Kampen wrote:
Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B
often an indication the problem is board/socket related rather than memory DIMM. unless its the other pair you didn't swap.
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records
- can this relate to the memory errors??
- I am just really surprised to have two drives show almost identical
number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB
Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read er
hmmm. no, they shouldn't be remotely related. unless its something else, like a power supply with noisy or out of spec voltage(s).
as John suggests, I think it sounds like a tired PSU try swapping it if you have a spare
Vreme: 12/05/2011 10:24 AM, Rob Kampen piše:
However I put a brand new - never used 120G drive in and it too shows these errors - something doesn't seem right Getting too tired to think straight so I'll leave it limping along until tomorrow Thanks for thoughts
Download Hiren's BootCD and use bundled Memory Test if your way is complicated.
Vreme: 12/05/2011 12:00 PM, John R Pierce piše:
On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:
Download Hiren's BootCD and use bundled Memory Test if your way is complicated.
that won't do much to detect soft ECC errors, will it?
They are (there are 4-5 apps) checking various patterns in memory (write then read), and you can run it for a longer period of time.
As for ECC errors I can not say, I never ever used ECC memory or got familiar with it.
On Mon, 5 Dec 2011, Ljubomir Ljubojevic wrote:
Vreme: 12/05/2011 12:00 PM, John R Pierce piše:
On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:
Download Hiren's BootCD and use bundled Memory Test if your way is complicated.
that won't do much to detect soft ECC errors, will it?
They are (there are 4-5 apps) checking various patterns in memory (write then read), and you can run it for a longer period of time.
As for ECC errors I can not say, I never ever used ECC memory or got familiar with it.
In my limited experience, if you can disable ECC in your BIOS, memtest is just as good at spotting errors on ECC as non-ECC. With ECC enabled, you'll need seriously messed up ECC before it'll be detected.
jh
On 12/07/11 12:55 AM, John Hodrien wrote:
In my limited experience, if you can disable ECC in your BIOS, memtest is just as good at spotting errors on ECC as non-ECC. With ECC enabled, you'll need seriously messed up ECC before it'll be detected.
except with ECC disabled, the extra 8 ECC bits per 64bit memory word aren't touched at all.
I'd leave ECC on, and skip running memtest entirely, just run real OS workloads and let the ECC do the memory test on the fly, as its meant to.
does linux have an ECC scrubber process? 'real' Unix servers (Solaris, AIX, etc) generally have a background process, sometimes its part of the Idle process, that does a read/write of every memory location when the machine is otherwise idle, this catches and fixes soft ECC errors in otherwise idle memory, which in turn gets logged. Solaris (on Sun Sparc hardware at least) keeps track of what locations have had bad memory, and will stop using a memory page entirely (with a logged alert) if there are too many soft ECC errors in the same area.