A couple of months ago I reported some problems with a batch of Tyan K8SSA (S3870) based machines. We are continuing to have an odd problem with these boxes, and if anyone has seen something similar elsewhere, I'd appreciate hearing about it.
These boxes are running Centos 4.4 x86_64 with kernel 2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making them all the same, and the vendor made the substitutions. We have also clocked the memory down from 400 MHz to 266 MHz, also on the advice of Tyan.
The symptom is that some large (700MB to >1GB) files opened for read and then closed show corruption in the pagecache. One or more 4k blocks in a file will be completely trashed. It's as if a random page of other data is substituted. A reboot or a flush of the pagecache fixes the problem, so it's only in the pagecache, not on disk. We are doing regular MD5 checksums of the files, which shows up the problem, in addition to having our application crash from time to time.
We have some older Tyan motherboards that don't show this problem. At this point it seems it is either a hardware problem or a kernel motherboard-support problem, but it's pretty baffling.
Thanks, Dan
What you should be doing, is swapping the CPUs and ram modules from board to board. With the help of someone who can do some statistical analysis for you, you can quickly pinpoint whether the problem resides in the motherboards, or some CPUs or ram modules, or combinations thereof.
Presumably, since the servers are so prone to error at the moment, they will not be doing anything important, allowing you to easily swap stuff around. If you can include in this trial, some identical servers which seem to be working fine, this will greatly speed up the process of apportioning blame.
Dan Halbert spake the following on 2/28/2007 8:21 PM:
A couple of months ago I reported some problems with a batch of Tyan K8SSA (S3870) based machines. We are continuing to have an odd problem with these boxes, and if anyone has seen something similar elsewhere, I'd appreciate hearing about it.
These boxes are running Centos 4.4 x86_64 with kernel 2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making them all the same, and the vendor made the substitutions. We have also clocked the memory down from 400 MHz to 266 MHz, also on the advice of Tyan.
The symptom is that some large (700MB to >1GB) files opened for read and then closed show corruption in the pagecache. One or more 4k blocks in a file will be completely trashed. It's as if a random page of other data is substituted. A reboot or a flush of the pagecache fixes the problem, so it's only in the pagecache, not on disk. We are doing regular MD5 checksums of the files, which shows up the problem, in addition to having our application crash from time to time.
We have some older Tyan motherboards that don't show this problem. At this point it seems it is either a hardware problem or a kernel motherboard-support problem, but it's pretty baffling.
Thanks, Dan
Have you tried a newer kernel to see if it changes the problem?
Scott Silva wrote:
These boxes are running Centos 4.4 x86_64 with kernel 2.6.9-42.0.3.ELsmp.
Have you tried a newer kernel to see if it changes the problem?
We have not tried the 0.8 kernel extensively, but there is nothing in the release notes that seems related. Our next trial will be with FC6. Unfortunately it can take days for this problem to show up.
A previous responder suggested swapping CPU's or RAM around. The RAM has been shuffled and in some cases replaced (the vendor replaced all the 1GB DIMM's with 2GB). These problems are occurring across 25 machines, so it does not seem to be isolated bad components.
Dan
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Dan Halbert Sent: Thursday, March 01, 2007 1:12 PM To: CentOS mailing list Subject: Re: [CentOS] Re: pagecache corruption on Tyan S3870
Scott Silva wrote:
These boxes are running Centos 4.4 x86_64 with kernel 2.6.9-42.0.3.ELsmp.
Have you tried a newer kernel to see if it changes the problem?
We have not tried the 0.8 kernel extensively, but there is nothing in the release notes that seems related. Our next trial will be with FC6. Unfortunately it can take days for this problem to show up.
A previous responder suggested swapping CPU's or RAM around. The RAM has been shuffled and in some cases replaced (the vendor replaced all the 1GB DIMM's with 2GB). These problems are occurring across 25 machines, so it does not seem to be isolated bad components.
Double check your BIOS settings with your memory type there may be timing issues there.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Dan Halbert spake the following on 3/1/2007 10:11 AM:
Scott Silva wrote:
These boxes are running Centos 4.4 x86_64 with kernel 2.6.9-42.0.3.ELsmp.
Have you tried a newer kernel to see if it changes the problem?
We have not tried the 0.8 kernel extensively, but there is nothing in the release notes that seems related. Our next trial will be with FC6. Unfortunately it can take days for this problem to show up.
A previous responder suggested swapping CPU's or RAM around. The RAM has been shuffled and in some cases replaced (the vendor replaced all the 1GB DIMM's with 2GB). These problems are occurring across 25 machines, so it does not seem to be isolated bad components.
Dan
Are you using the 105 bios? It looks to be the latest.
Scott Silva wrote:
Are you using the 105 bios? It looks to be the latest.
We are, though thanks for the suggestion. We've also fiddled with various BIOS settings, including IOMMU. I was hoping that someone had seen something similar with pagecache corruption. I've done a lot of searching on bugzilla.redhat.com. If it were a straight memory timing problem, it would seem to me we would see more random crashes and not this weird 4k-page corruption, which seems like issues with page tables or something like that.
A previous responder suggested swapping CPU's or RAM around. The RAM has been shuffled and in some cases replaced (the vendor replaced all the 1GB DIMM's with 2GB). These problems are occurring across 25 machines, so it does not seem to be isolated bad components.
Umm. I didn't say to swap components around, but to _systematically_ swap them around. BIG difference. To make this work, you also need to ensure the mainboards are exact same revision, have same BIOS, and bios is set to factory defaults. If this is done properly, you can not only pinpoint faulty hardware but faulty interaction between components.
Much easier to troubleshoot hardware than monkeying about with kernels and associated modules.
To follow up on issues we are having with the Tyan S3870 (K8SSA) Opteron motherboards:
We actually saw another problem with these boxes, but only with i386 Linux (CentOS, FC6, etc.). A certain compute-intensive application that also read about 10MB of data files would get wrong answers when several instances were run in parallel. (Interestingly, a yum update I ran on the box also got occasional strange errors.)
This was an easier error to check for, since I could reproduce the error in a few minutes.
After systematically trying many different memory swaps and BIOS settings (including memory timings), I discovered that booting with "noapic" fixed the problem above. We haven't yet completed an x86_64 pagecache corruption test with "noapic", but I am pretty suspicious that these problems are related. Running with maxcpus=1 also fixes the problem, which confirms it's an smp-related problem.
I'll report back one more time if noapic fixes our pagecache problems.
Tyan updated the BIOS for this board a few versions back to fix a booting problem with x86_64 Redhat. Some people worked around that problem with noapic. I wonder if the BIOS still has some problems...
Thanks for all your suggestions, Dan
Dan Halbert wrote:
A couple of months ago I reported some problems with a batch of Tyan K8SSA (S3870) based machines. ... The symptom is that some large (700MB to >1GB) files opened for read and then closed show corruption in the pagecache. One or more 4k blocks in a file will be completely trashed... A reboot or a flush of the pagecache fixes the problem, so it's only in the pagecache, not on disk.
One more followup on this, for posterity. (I don't like unanswered questions in mailing-list archives.) It turns out this problem seems to be the same one reported in this kernel bug: http://bugzilla.kernel.org/show_bug.cgi?id=7768. It has also been discussed on LKML.
The bug was reported on AMD Nvidia boards; we have AMD ServerWorks, but the problem aooears to be the same. AMD is working on this. The current workaround is to boot with "iommu=soft".
Dan