[CentOS] storage servers crashing, hair being pulled out!

Sun Dec 20 03:55:27 UTC 2009

I have a trio of servers that like to reboot during high disk /
network IO operations.  They don't appear to panic, as I have
kernel.panic = 0 in sysctl.conf.  The syslog just shows normal
messages, like samba complaining about browse master and then just
syslogd starting up.

The machines seem to crash when I'm not near the console, usually when
I'm trying to pull data off them to another machine running backups.
But, they've also crashed trying to copy data off them to other
servers (via iscsi).  Also, they have crashed being on the receiving
end of data via nfs.

Two of the servers are linked using drbd and heartbeat, the third is
stand alone.

Centos 5.4 x86-64 is the flavor of linux on all of them, pretty much
vanilla except for the drbd/iscsi stuff.

I want to go after the motherboard manufactorer, since I'm more
willing to suspect three mobos in a bad lot than three CPUs,
especially since one cpu is completely different than the other two.

The other variable is the two machines running drbd have promise raid
cards in them.  I also have the same raid card in my personal server
at home.  That server also has a nack of crashing during heavy disk IO
to the raid volume.  The entire OS doesn't crash, just the raid
volume, and the only way to bring it back is a reboot.

I'm really at a loss on what to do next... Any suggestions?

Gordon

The hardware config of the drbd servers:

Tyan i3210 ICH9 mobo
Intel C2D 7500 cpu
4GB A-Data ram
Promise ex8650 raid
Supermicro 742TQ-865 chassis (865w psu)
8x 1Tb western digital green power drives

The third machine:

Tyan i3210 ICH9 mobo
Intel C2Q 9400 cpu
8GB Mushkin ram
dmraid 5
Antec something or other chassis
550W PC Power and Cooling PSU
7x 250gb seagate 7200's

[CentOS] storage servers crashing, hair being pulled out!

Gordon McLellan