[CentOS] storage servers crashing, hair being pulled out!

Ryan Lynch ryan.b.lynch at gmail.com
Mon Dec 21 15:35:08 UTC 2009


On Mon, Dec 21, 2009 at 08:24, Gordon McLellan <gordonthree at gmail.com> wrote:
> Thank you all for the suggestions.  I will grab a test suite or two
> and do some burn in testing over the upcoming weekends.  These
> machines are new, built from scratch.  I've been building systems for
> over fifteen years and haven't had anywhere near this amount of
> trouble which is really aggravating!
>
> I realize garbage in equals garbage out and some of the chosen
> components are pretty low-end, but I did spend close to six months
> researching the components, and couldn't find substantial evidence to
> dissuade me from any of the choices.  The only parts not new are the
> 250G seagates where are basically left-over parts from an old server
> that was upgraded.  They're all known-good as that server gave me no
> trouble through its service life.

I know someone mentioned this earlier in the thread, but before you
spend a lot of time, looking at power supplies, drives, etc., you
might want to consider installing any motherboard BIOS updates that
the vendor has released. It's quick, cheap, and easy, and the symptoms
fit.

I had basically identical symptoms on a cluster of storage systems I
built, about a year ago. It was terrible--machines kept crashing with
no explanation, under load, at random times. Similar to your
situation, we custom-built our own machines from identical boards,
CPUs, etc.

The problem turned out to be a combination of the CPU and the
motherboard. Our procs were the newest CPU stepping in that particular
product line (AMD Opteron 4xxx, I think), and the board's original
BIOS wasn't 100% compatible with the new stepping. After we'd updated
the BIOS, the problems disappeared and the system was basically
rock-solid.

I was pretty surprised by the whole thing: I was skeptical about the
BIOS update, because I imagined that an incompatible CPU wouldn't even
boot. But the bug was more subtle than that, and I learned something
new.

Whatever happens, good luck, and I hope you find the problem quickly.

-Ryan


More information about the CentOS mailing list