[CentOS] pagecache corruption on Tyan S3870

Thu Mar 1 04:21:42 UTC 2007
Dan Halbert <halbert at bbn.com>

A couple of months ago I reported some problems with a batch of Tyan 
K8SSA (S3870) based machines. We are continuing to have an odd problem 
with these boxes, and if anyone has seen something similar elsewhere, 
I'd appreciate hearing about it.

These boxes are running Centos 4.4 x86_64 with kernel 
2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB 
DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making 
them all the same, and the vendor made the substitutions. We have also 
clocked the memory down from 400 MHz to 266 MHz, also on the advice of Tyan.

The symptom is that some large (700MB to >1GB) files opened for read and 
then closed show corruption in the pagecache. One or more 4k blocks in a 
file will be completely trashed. It's as if a random page of other data 
is substituted. A reboot or a flush of the pagecache fixes the problem, 
so it's only in the pagecache, not on disk. We are doing regular MD5 
checksums of the files, which shows up the problem, in addition to 
having our application crash from time to time.

We have some older Tyan motherboards that don't show this problem. At 
this point it seems it is either a hardware problem or a kernel 
motherboard-support problem, but it's pretty baffling.