[CentOS] pagecache corruption on Tyan S3870
halbert at bbn.com
Thu Mar 1 04:21:42 UTC 2007
A couple of months ago I reported some problems with a batch of Tyan
K8SSA (S3870) based machines. We are continuing to have an odd problem
with these boxes, and if anyone has seen something similar elsewhere,
I'd appreciate hearing about it.
These boxes are running Centos 4.4 x86_64 with kernel
2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB
DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making
them all the same, and the vendor made the substitutions. We have also
clocked the memory down from 400 MHz to 266 MHz, also on the advice of Tyan.
The symptom is that some large (700MB to >1GB) files opened for read and
then closed show corruption in the pagecache. One or more 4k blocks in a
file will be completely trashed. It's as if a random page of other data
is substituted. A reboot or a flush of the pagecache fixes the problem,
so it's only in the pagecache, not on disk. We are doing regular MD5
checksums of the files, which shows up the problem, in addition to
having our application crash from time to time.
We have some older Tyan motherboards that don't show this problem. At
this point it seems it is either a hardware problem or a kernel
motherboard-support problem, but it's pretty baffling.
More information about the CentOS