Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?
Feizhou wrote:
Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?
I assume you are talking about the "GART pages must be uncacheable bug", which affects AMD processors.
The kernel bug is http://bugzilla.kernel.org/show_bug.cgi?id=7768, which has now been closed after a patch was submitted.
I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.
Dan
Dan Halbert wrote:
Feizhou wrote:
Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?
I assume you are talking about the "GART pages must be uncacheable bug", which affects AMD processors.
The kernel bug is http://bugzilla.kernel.org/show_bug.cgi?id=7768, which has now been closed after a patch was submitted.
I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.
I see. So forget RAM >= 4GB on Centos 4 on AMD AM2. Hmmph.
On 5/30/07, Feizhou feizhou@graffiti.net wrote:
Dan Halbert wrote:
Feizhou wrote: I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.
I see. So forget RAM >= 4GB on Centos 4 on AMD AM2. Hmmph.
Well, not really. There seems to be a test kernel (2.6.9-42.EL) with this bug fix at:
http://people.redhat.com/coldwell/kernel/bugs/223238/
Therefore, upstream may be working on backporting for RHEL4 (hopefully). Note also that the official patched version for CentOS 5 will not be available for a while either. The 2.6.18-18.el5 kernel might be for RHEL 5.1.
Because the patch is available now, another option is to rebuild the kernel by applying it. Certainly not for everyone but if the fix is needed right now, this is the olny option.
Akemi
On 5/30/07, Akemi Yagi amyagi@gmail.com wrote:
On 5/30/07, Feizhou feizhou@graffiti.net wrote:
Dan Halbert wrote:
Feizhou wrote: I searched for the commit id in the Redhat bugzilla, and found it for RHEL5: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238709. That bug says the fix has been backported and is in RHEL5 kernel 2.6.18-18.el5. But I don't see anything about backporting to RHEL4.
I see. So forget RAM >= 4GB on Centos 4 on AMD AM2. Hmmph.
Well, not really. There seems to be a test kernel (2.6.9-42.EL) with this bug fix at:
http://people.redhat.com/coldwell/kernel/bugs/223238/
Therefore, upstream may be working on backporting for RHEL4 (hopefully). Note also that the official patched version for CentOS 5 will not be available for a while either. The 2.6.18-18.el5 kernel might be for RHEL 5.1.
Because the patch is available now, another option is to rebuild the kernel by applying it. Certainly not for everyone but if the fix is needed right now, this is the olny option.
Akemi
The patch file for this bug did not work on the CentOS source file as such. I have recreated it for CentOS 5.0 x86_64 (see below) and was able to rebuild kernels. It you ever decided to do the same, here's the modified patch, pci-gart.c :
--- 2.6-git.orig/arch/x86_64/kernel/pci-gart.c +++ 2.6-git/arch/x86_64/kernel/pci-gart.c @@ -523,6 +523,10 @@ gatt = (void *)__get_free_pages(GFP_KERNEL, get_order(gatt_size)); if (!gatt) panic("Cannot allocate GATT table"); + if (change_page_attr_addr((unsigned long)gatt, gatt_size >> PAGE_SHIFT, PAGE_KERNEL_NOCACHE)) + panic("Could not set GART PTEs to uncacheable pages"); + global_flush_tlb(); + memset(gatt, 0, gatt_size); agp_gatt_table = gatt;
================================
Then edit the kernel-2.6.spec file as follows:
Add this at line ~935 or so Patch40000: pci_new.patch
Add this at line ~1908 or so %patch40000 -p1
Good luck, Akemi
On 6/3/07, Akemi Yagi amyagi@gmail.com wrote:
Add this at line ~935 or so Patch40000: pci_new.patch
Sorry, this must be
Patch40000: pci-gart.c
On 6/3/07, Akemi Yagi amyagi@gmail.com wrote:
On 6/3/07, Akemi Yagi amyagi@gmail.com wrote:
Add this at line ~935 or so Patch40000: pci_new.patch
Sorry, this must be
Patch40000: pci-gart.c
Argh! Only if I can get it right :( You shoud use the name of the patch file there. For example:
Patch40000: pci-gart.patch
Akemi
On 5/29/07, Dan Halbert halbert@bbn.com wrote:
Feizhou wrote:
Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?
I assume you are talking about the "GART pages must be uncacheable bug", which affects AMD processors.
The kernel bug is http://bugzilla.kernel.org/show_bug.cgi?id=7768, which has now been closed after a patch was submitted.
Could this be the same bug as
http://bugs.centos.org/view.php?id=1774
?? We are experiencing the exact symptoms described in centos/1774 on dual-CPU Opteron system with 16GB RAM.
Bart Schaefer wrote:
Could this be the same bug as
http://bugs.centos.org/view.php?id=1774
?? We are experiencing the exact symptoms described in centos/1774 on dual-CPU Opteron system with 16GB RAM.
I don't believe this is the same problem. The symptoms in 1774 are kernel panics with CPU ECC errors, if I'm reading it correctly. The 7768 kernel bug is corrupted 4k blocks in the pagecache, not kernel panics. Is this replicated on several machines? You could try booting with iommu=soft, which is the 7768 workaround, but I think that's a long shot. You might try "noapic", but hardware ECC errors would seem to point to a timing problem or bad hardware.
On 5/30/07, Dan Halbert halbert@bbn.com wrote:
Bart Schaefer wrote:
Could this be the same bug as
I don't believe this is the same problem. The symptoms in 1774 are kernel panics with CPU ECC errors, if I'm reading it correctly. The 7768 kernel bug is corrupted 4k blocks in the pagecache, not kernel panics. Is this replicated on several machines?
No; we have 4 identical machines (hardware-wise) and this only occurs on one of them. However, the failing one is the only one running CentOS 5, the other 3 are using CentOS 3.8.
Bart Schaefer wrote:
On 5/30/07, Dan Halbert halbert@bbn.com wrote:
Bart Schaefer wrote:
Could this be the same bug as
I don't believe this is the same problem. The symptoms in 1774 are kernel panics with CPU ECC errors, if I'm reading it correctly. The 7768 kernel bug is corrupted 4k blocks in the pagecache, not kernel panics. Is this replicated on several machines?
No; we have 4 identical machines (hardware-wise) and this only occurs on one of them. However, the failing one is the only one running CentOS 5, the other 3 are using CentOS 3.8.
Have you tried the latest mainline kernel on the Centos 5 box?
On 5/30/07, Feizhou feizhou@graffiti.net wrote:
Have you tried the latest mainline kernel on the Centos 5 box?
What do you mean by "mainline"?
We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.
Bart Schaefer wrote:
On 5/30/07, Feizhou feizhou@graffiti.net wrote:
Have you tried the latest mainline kernel on the Centos 5 box?
What do you mean by "mainline"?
www.kernel.org kernel. Latest 2.6.21 has the fix for the faulty kernel GART IOMMU code.
We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.
So you also do not have the IOMMU enabled in your boxes?
On 5/30/07, Feizhou feizhou@graffiti.net wrote:
Bart Schaefer wrote:
We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.
So you also do not have the IOMMU enabled in your boxes?
% dmesg | grep -i iommu Please enable the IOMMU option in the BIOS setup PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
Bart Schaefer wrote:
On 5/30/07, Feizhou feizhou@graffiti.net wrote:
Bart Schaefer wrote:
We have not tried building a kernel, nor installing any drivers or other kernel-related software other than that available from the CentOS project repos. However, the machine is fully yum-updated.
So you also do not have the IOMMU enabled in your boxes?
% dmesg | grep -i iommu Please enable the IOMMU option in the BIOS setup PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
Try building a 2.6.21 kernel and see if your problems go away.
Feizhou wrote:
Anybody know if the fix for the incorrect kernel behaviour that leads to data corruption will be available for RHEL4/5 kernels?
Is the bug still triggered if the system is running the 32 bit OS instead of 64 bit?
Florin Andrei wrote:
Is the bug still triggered if the system is running the 32 bit OS instead of 64 bit?
Kernel bug 7768 is an x86_64 bug and has only been reported as such. I did not see it on 32 bit Centos. I saw bad floating-point answers under load (!), which were fixed by booting with noapic on our motherboards. (I've already written about that on this list, a few months ago.)