On Mon, Dec 21, 2009 at 08:24, Gordon McLellan gordonthree@gmail.com wrote:
Thank you all for the suggestions. I will grab a test suite or two and do some burn in testing over the upcoming weekends. These machines are new, built from scratch. I've been building systems for over fifteen years and haven't had anywhere near this amount of trouble which is really aggravating!
I realize garbage in equals garbage out and some of the chosen components are pretty low-end, but I did spend close to six months researching the components, and couldn't find substantial evidence to dissuade me from any of the choices. The only parts not new are the 250G seagates where are basically left-over parts from an old server that was upgraded. They're all known-good as that server gave me no trouble through its service life.
I know someone mentioned this earlier in the thread, but before you spend a lot of time, looking at power supplies, drives, etc., you might want to consider installing any motherboard BIOS updates that the vendor has released. It's quick, cheap, and easy, and the symptoms fit.
I had basically identical symptoms on a cluster of storage systems I built, about a year ago. It was terrible--machines kept crashing with no explanation, under load, at random times. Similar to your situation, we custom-built our own machines from identical boards, CPUs, etc.
The problem turned out to be a combination of the CPU and the motherboard. Our procs were the newest CPU stepping in that particular product line (AMD Opteron 4xxx, I think), and the board's original BIOS wasn't 100% compatible with the new stepping. After we'd updated the BIOS, the problems disappeared and the system was basically rock-solid.
I was pretty surprised by the whole thing: I was skeptical about the BIOS update, because I imagined that an incompatible CPU wouldn't even boot. But the bug was more subtle than that, and I learned something new.
Whatever happens, good luck, and I hope you find the problem quickly.
-Ryan