data corruption on AMD AM2 systems with 4GB of RAM or more

List overview All Threads
Download

newer

older

CentOS-announce Digest, Vol 28,...

rar incompatiblity from 4.4 to 4.5

Feizhou

29 May 2007 29 May '07

3:46 a.m.

Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?

Show replies by date

Dan Halbert

29 May 29 May

3:23 p.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Feizhou wrote:

...

Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?

I assume you are talking about the "GART pages must be uncacheable bug", which affects AMD processors.

The kernel bug is http://bugzilla.kernel.org/show_bug.cgi?id=7768, which has now been closed after a patch was submitted.

I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.

Dan

Feizhou

30 May 30 May

9:25 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Dan Halbert wrote:

...

Feizhou wrote:

...
Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?

I assume you are talking about the "GART pages must be uncacheable bug", which affects AMD processors.

The kernel bug is http://bugzilla.kernel.org/show_bug.cgi?id=7768, which has now been closed after a patch was submitted.

I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.

I see. So forget RAM >= 4GB on Centos 4 on AMD AM2. Hmmph.

Akemi Yagi

2:30 p.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 5/30/07, Feizhou feizhou@graffiti.net wrote:

...

Dan Halbert wrote:

...
Feizhou wrote: I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.

I see. So forget RAM >= 4GB on Centos 4 on AMD AM2. Hmmph.

Well, not really. There seems to be a test kernel (2.6.9-42.EL) with this bug fix at:

http://people.redhat.com/coldwell/kernel/bugs/223238/

Therefore, upstream may be working on backporting for RHEL4 (hopefully). Note also that the official patched version for CentOS 5 will not be available for a while either. The 2.6.18-18.el5 kernel might be for RHEL 5.1.

Because the patch is available now, another option is to rebuild the kernel by applying it. Certainly not for everyone but if the fix is needed right now, this is the olny option.

Akemi

Akemi Yagi

4 Jun 4 Jun

6:19 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 5/30/07, Akemi Yagi amyagi@gmail.com wrote:

...

On 5/30/07, Feizhou feizhou@graffiti.net wrote:

...
Dan Halbert wrote:

...
Feizhou wrote: I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.

I see. So forget RAM >= 4GB on Centos 4 on AMD AM2. Hmmph.

Well, not really. There seems to be a test kernel (2.6.9-42.EL) with this bug fix at:

http://people.redhat.com/coldwell/kernel/bugs/223238/

Therefore, upstream may be working on backporting for RHEL4 (hopefully). Note also that the official patched version for CentOS 5 will not be available for a while either. The 2.6.18-18.el5 kernel might be for RHEL 5.1.

Because the patch is available now, another option is to rebuild the kernel by applying it. Certainly not for everyone but if the fix is needed right now, this is the olny option.

Akemi

The patch file for this bug did not work on the CentOS source file as such. I have recreated it for CentOS 5.0 x86_64 (see below) and was able to rebuild kernels. It you ever decided to do the same, here's the modified patch, pci-gart.c :

--- 2.6-git.orig/arch/x86_64/kernel/pci-gart.c +++ 2.6-git/arch/x86_64/kernel/pci-gart.c @@ -523,6 +523,10 @@ gatt = (void *)__get_free_pages(GFP_KERNEL, get_order(gatt_size)); if (!gatt) panic("Cannot allocate GATT table"); + if (change_page_attr_addr((unsigned long)gatt, gatt_size >> PAGE_SHIFT, PAGE_KERNEL_NOCACHE)) + panic("Could not set GART PTEs to uncacheable pages"); + global_flush_tlb(); + memset(gatt, 0, gatt_size); agp_gatt_table = gatt;

================================

Then edit the kernel-2.6.spec file as follows:

Add this at line ~935 or so Patch40000: pci_new.patch

Add this at line ~1908 or so %patch40000 -p1

Good luck, Akemi

Akemi Yagi

6:27 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 6/3/07, Akemi Yagi amyagi@gmail.com wrote:

...

Add this at line ~935 or so Patch40000: pci_new.patch

Sorry, this must be

Patch40000: pci-gart.c

Akemi Yagi

10:25 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 6/3/07, Akemi Yagi amyagi@gmail.com wrote:

...

On 6/3/07, Akemi Yagi amyagi@gmail.com wrote:

...
Add this at line ~935 or so Patch40000: pci_new.patch

Sorry, this must be

Patch40000: pci-gart.c

Argh! Only if I can get it right :( You shoud use the name of the patch file there. For example:

Patch40000: pci-gart.patch

Akemi

Bart Schaefer

30 May 30 May

11:51 p.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 5/29/07, Dan Halbert halbert@bbn.com wrote:

...

Feizhou wrote:

...
Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?

I assume you are talking about the "GART pages must be uncacheable bug", which affects AMD processors.

The kernel bug is http://bugzilla.kernel.org/show_bug.cgi?id=7768, which has now been closed after a patch was submitted.

Could this be the same bug as

http://bugs.centos.org/view.php?id=1774

?? We are experiencing the exact symptoms described in centos/1774 on dual-CPU Opteron system with 16GB RAM.

Dan Halbert

31 May 31 May

12:07 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Bart Schaefer wrote:

...

Could this be the same bug as

http://bugs.centos.org/view.php?id=1774

?? We are experiencing the exact symptoms described in centos/1774 on dual-CPU Opteron system with 16GB RAM.

I don't believe this is the same problem. The symptoms in 1774 are kernel panics with CPU ECC errors, if I'm reading it correctly. The 7768 kernel bug is corrupted 4k blocks in the pagecache, not kernel panics. Is this replicated on several machines? You could try booting with iommu=soft, which is the 7768 workaround, but I think that's a long shot. You might try "noapic", but hardware ECC errors would seem to point to a timing problem or bad hardware.

Bart Schaefer

3:19 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 5/30/07, Dan Halbert halbert@bbn.com wrote:

...

Bart Schaefer wrote:

...
Could this be the same bug as

http://bugs.centos.org/view.php?id=1774

I don't believe this is the same problem. The symptoms in 1774 are kernel panics with CPU ECC errors, if I'm reading it correctly. The 7768 kernel bug is corrupted 4k blocks in the pagecache, not kernel panics. Is this replicated on several machines?

No; we have 4 identical machines (hardware-wise) and this only occurs on one of them. However, the failing one is the only one running CentOS 5, the other 3 are using CentOS 3.8.

Feizhou

3:26 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Bart Schaefer wrote:

...

On 5/30/07, Dan Halbert halbert@bbn.com wrote:

...
Bart Schaefer wrote:

...
Could this be the same bug as

http://bugs.centos.org/view.php?id=1774

I don't believe this is the same problem. The symptoms in 1774 are kernel panics with CPU ECC errors, if I'm reading it correctly. The 7768 kernel bug is corrupted 4k blocks in the pagecache, not kernel panics. Is this replicated on several machines?

No; we have 4 identical machines (hardware-wise) and this only occurs on one of them. However, the failing one is the only one running CentOS 5, the other 3 are using CentOS 3.8.

Have you tried the latest mainline kernel on the Centos 5 box?

Bart Schaefer

3:30 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 5/30/07, Feizhou feizhou@graffiti.net wrote:

...

Have you tried the latest mainline kernel on the Centos 5 box?

What do you mean by "mainline"?

We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.

Feizhou

3:46 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Bart Schaefer wrote:

...

On 5/30/07, Feizhou feizhou@graffiti.net wrote:

...
Have you tried the latest mainline kernel on the Centos 5 box?

What do you mean by "mainline"?

www.kernel.org kernel. Latest 2.6.21 has the fix for the faulty kernel GART IOMMU code.

...

We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.

So you also do not have the IOMMU enabled in your boxes?

Bart Schaefer

10:46 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

On 5/30/07, Feizhou feizhou@graffiti.net wrote:

...

Bart Schaefer wrote:

...
We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.

So you also do not have the IOMMU enabled in your boxes?

% dmesg | grep -i iommu Please enable the IOMMU option in the BIOS setup PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

Feizhou

1 Jun 1 Jun

2:33 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Bart Schaefer wrote:

...

On 5/30/07, Feizhou feizhou@graffiti.net wrote:

...
Bart Schaefer wrote:

...
We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.

So you also do not have the IOMMU enabled in your boxes?

% dmesg | grep -i iommu Please enable the IOMMU option in the BIOS setup PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

Try building a 2.6.21 kernel and see if your problems go away.

Florin Andrei

30 May 30 May

11:08 p.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Feizhou wrote:

...

Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?

Is the bug still triggered if the system is running the 32 bit OS instead of 64 bit?

-- Florin Andrei http://florin.myip.org/

Dan Halbert

31 May 31 May

12:10 a.m.

New subject: data corruption on AMD AM2 systems with 4GB of RAM or more

Florin Andrei wrote:

...

Is the bug still triggered if the system is running the 32 bit OS instead of 64 bit?

Kernel bug 7768 is an x86_64 bug and has only been reported as such. I did not see it on 32 bit Centos. I saw bad floating-point answers under load (!), which were fixed by booting with noapic on our motherboards. (I've already written about that on this list, a few months ago.)

6640

Age (days ago)

6646

Last active (days ago)

discuss@lists.centos.org

16 comments

5 participants

tags (0)

participants (5)

Akemi Yagi
Bart Schaefer
Dan Halbert
Feizhou
Florin Andrei