Intel SE7210TP1-E giving memory errors

List overview All Threads
Download

newer

older

6.1 .iso size?

Design question about VG / LV in a...

Rob Kampen

5 Dec 2011 5 Dec '11

7:13 a.m.

Hi List, I've been getting the following EDAC memory errors EDAC MC0: CE page 0xeb0dd, offset 0x0, grain 4096, syndrome 0x45, row 3, channel 0, label "": i82875p CE and from this seeing that these errors have been corrected. Checking cat /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count gives me a count of 4 thus I now know that csrow3 - ch0 is the problem

My question is, how does this map to the on board labels DIMM 1A DIMM 1B DIMM 2A DIMM 2B

Am I correct in assuming csrow 3 is DIMM 2B?

Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records - can this relate to the memory errors?? - I am just really surprised to have two drives show almost identical number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB TIA for your insightful comments

Show replies by date

Rob Kampen

5 Dec 5 Dec

8:17 a.m.

Rob Kampen wrote:

...

Hi List, I've been getting the following EDAC memory errors EDAC MC0: CE page 0xeb0dd, offset 0x0, grain 4096, syndrome 0x45, row 3, channel 0, label "": i82875p CE and from this seeing that these errors have been corrected. Checking cat /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count gives me a count of 4 thus I now know that csrow3 - ch0 is the problem

My question is, how does this map to the on board labels DIMM 1A DIMM 1B DIMM 2A DIMM 2B

Am I correct in assuming csrow 3 is DIMM 2B?

Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B

...

Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records

can this relate to the memory errors??

I am just really surprised to have two drives show almost identical

number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB

Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read error rate and hardware ECC recovered in the tens of millions - I think this is not a drive issue but related to the ECC mem errors?? Anyone with experience?

...

TIA for your insightful comments

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

John R Pierce

8:34 a.m.

On 12/05/11 12:17 AM, Rob Kampen wrote:

...

Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B

often an indication the problem is board/socket related rather than memory DIMM. unless its the other pair you didn't swap.

...

Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records

can this relate to the memory errors??

I am just really surprised to have two drives show almost identical

number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB

Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read er

hmmm. no, they shouldn't be remotely related. unless its something else, like a power supply with noisy or out of spec voltage(s).

80GB 3.5" SATA drives? aren't those kind of old? like, ancient ? looked up that PN, thats a Baracuda 7200.7 from circa 2003-2005. http://www.seagate.com/support/disc/manuals/sata/cuda7200_sata_pm.pdf

those are past their shelf date.

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Rob Kampen

9:24 a.m.

John R Pierce wrote:

...

On 12/05/11 12:17 AM, Rob Kampen wrote:

...
Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B

often an indication the problem is board/socket related rather than memory DIMM. unless its the other pair you didn't swap.

...
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records

can this relate to the memory errors??

I am just really surprised to have two drives show almost identical

number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB

Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read er

hmmm. no, they shouldn't be remotely related. unless its something else, like a power supply with noisy or out of spec voltage(s).

80GB 3.5" SATA drives? aren't those kind of old? like, ancient ? looked up that PN, thats a Baracuda 7200.7 from circa 2003-2005. http://www.seagate.com/support/disc/manuals/sata/cuda7200_sata_pm.pdf

those are past their shelf date.

Yes Christmas 2004 - never a problem until one of the md raid sets dropped out today. However I put a brand new - never used 120G drive in and it too shows these errors - something doesn't seem right Getting too tired to think straight so I'll leave it limping along until tomorrow Thanks for thoughts

Nicolas Thierry-Mieg

10:51 a.m.

Rob Kampen wrote:

...

John R Pierce wrote:

...
On 12/05/11 12:17 AM, Rob Kampen wrote:

...
Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row 3, channel 0 - thus did not move with the RAM?? Next reboot I'll try swapping 1A and 1B

often an indication the problem is board/socket related rather than memory DIMM. unless its the other pair you didn't swap.

...
Also I have just discovered that both the OS drives sda and sdb have huge number of errors shown on the SMART records

can this relate to the memory errors??

I am just really surprised to have two drives show almost identical

number of errors at the same time, yet no apparent data errors - Drives are ATA ST380013AS 74.53 GB

Just for safety I swapped /dev/sda with a new slightly larger drive did the sfdisk foo and added it to the md raid drives. This brand new drive immediately shows high raw read er

hmmm. no, they shouldn't be remotely related. unless its something else, like a power supply with noisy or out of spec voltage(s).

as John suggests, I think it sounds like a tired PSU try swapping it if you have a spare

Ljubomir Ljubojevic

10:57 a.m.

Vreme: 12/05/2011 10:24 AM, Rob Kampen piše:

...

However I put a brand new - never used 120G drive in and it too shows these errors - something doesn't seem right Getting too tired to think straight so I'll leave it limping along until tomorrow Thanks for thoughts

Download Hiren's BootCD and use bundled Memory Test if your way is complicated.

-- Ljubomir Ljubojevic (Love is in the Air) PL Computers Serbia, Europe Google is the Mother, Google is the Father, and traceroute is your trusty Spiderman... StarOS, Mikrotik and CentOS/RHEL/Linux consultant

John R Pierce

11 a.m.

On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:

...

Download Hiren's BootCD and use bundled Memory Test if your way is complicated.

that won't do much to detect soft ECC errors, will it?

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Ljubomir Ljubojevic

12:28 p.m.

Vreme: 12/05/2011 12:00 PM, John R Pierce piše:

...

On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:

...
Download Hiren's BootCD and use bundled Memory Test if your way is complicated.

that won't do much to detect soft ECC errors, will it?

They are (there are 4-5 apps) checking various patterns in memory (write then read), and you can run it for a longer period of time.

As for ECC errors I can not say, I never ever used ECC memory or got familiar with it.

John Hodrien

7 Dec 7 Dec

8:55 a.m.

On Mon, 5 Dec 2011, Ljubomir Ljubojevic wrote:

...

Vreme: 12/05/2011 12:00 PM, John R Pierce piše:

...
On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:

...
Download Hiren's BootCD and use bundled Memory Test if your way is complicated.

that won't do much to detect soft ECC errors, will it?

They are (there are 4-5 apps) checking various patterns in memory (write then read), and you can run it for a longer period of time.

As for ECC errors I can not say, I never ever used ECC memory or got familiar with it.

In my limited experience, if you can disable ECC in your BIOS, memtest is just as good at spotting errors on ECC as non-ECC. With ECC enabled, you'll need seriously messed up ECC before it'll be detected.

John R Pierce

9:07 a.m.

On 12/07/11 12:55 AM, John Hodrien wrote:

...

In my limited experience, if you can disable ECC in your BIOS, memtest is just as good at spotting errors on ECC as non-ECC. With ECC enabled, you'll need seriously messed up ECC before it'll be detected.

except with ECC disabled, the extra 8 ECC bits per 64bit memory word aren't touched at all.

I'd leave ECC on, and skip running memtest entirely, just run real OS workloads and let the ECC do the memory test on the fly, as its meant to.

does linux have an ECC scrubber process? 'real' Unix servers (Solaris, AIX, etc) generally have a background process, sometimes its part of the Idle process, that does a read/write of every memory location when the machine is otherwise idle, this catches and fixes soft ECC errors in otherwise idle memory, which in turn gets logged. Solaris (on Sun Sparc hardware at least) keeps track of what locations have had bad memory, and will stop using a memory page entirely (with a logged alert) if there are too many soft ECC errors in the same area.

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

5236

Age (days ago)

5238

Last active (days ago)

discuss@lists.centos.org

9 comments

5 participants

tags (0)

participants (5)

John Hodrien
John R Pierce
Ljubomir Ljubojevic
Nicolas Thierry-Mieg
Rob Kampen