We have been seeing failures with CentOS 4.4 i386 (not x86_64) running
compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running
Opteron 265's. This motherboard is used in the Tyan barebones box GT24
(B2881). We have these boards populated with 8GB of RAM, consisting of
mixed 2GB and 1GB sticks.
The symptom is that CPU-bound programs (may or may not be related to
floating point) fail randomly and intermittently, with wrong answers or
segfaults. Running several in parallel seems to make the failures more
likely. We have not seen any kernel crashes. It is not hard to reproduce
the problem with some internal programs we have; it takes only a few
minutes.
This is using a completely-up-to-date-as-of-yesterday CentOS 4.4 i386,
hugemem or not doesn't make a difference. We have seen this on many
boxes, so it's not bad memory. We do NOT see this problem if we run
CentOS 4.4 x86_64 on the same boxes, using the same 32-bit test
executables. We also don't see this problem on some slightly older boxes
with Tyan K8SD motherboards running CentOS 4.4 i386 (also Opteron 265's,
with 8GB of 1GB DIMMs).
We have been looking at BIOS settings, but haven't seen anything that
stands out. memtest86 does not show errors.
Thanks for any suggestions of what this issue might be,
Dan