[CentOS] Tyan K8SRE troubles with CentOS 4.4 i386

Wed Dec 13 20:58:30 UTC 2006
Dan Halbert <halbert at bbn.com>

We have been seeing failures with CentOS 4.4 i386 (not x86_64) running 
compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running 
Opteron 265's. This motherboard is used in the Tyan barebones box GT24 
(B2881). We have these boards populated with 8GB of RAM, consisting of 
mixed 2GB and 1GB sticks.

The symptom is that CPU-bound programs (may or may not be related to 
floating point) fail randomly and intermittently, with wrong answers or 
segfaults. Running several in parallel seems to make the failures more 
likely. We have not seen any kernel crashes. It is not hard to reproduce 
the problem with some internal programs we have; it takes only a few 
minutes.

This is using a completely-up-to-date-as-of-yesterday CentOS 4.4 i386, 
hugemem or not doesn't make a difference. We have seen this on many 
boxes, so it's not bad memory. We do NOT see this problem if we run 
CentOS 4.4 x86_64 on the same boxes, using the same 32-bit test 
executables. We also don't see this problem on some slightly older boxes 
with Tyan K8SD motherboards running CentOS 4.4 i386 (also Opteron 265's, 
with 8GB of 1GB DIMMs).

We have been looking at BIOS settings, but haven't seen anything that 
stands out. memtest86 does not show errors.

Thanks for any suggestions of what this issue might be,
Dan