[CentOS] Harware vs Kernel RAID (was Re: External SATA enclosures: SiI3124 and CentOS 5?)

Tue Jun 2 03:33:53 UTC 2009
nate <centos at linuxpowered.net>

Michael A. Peters wrote:

> I'd be very interested in hearing opinions on this subject.


I mainly like hardware raid (good hardware raid not hybrid
software/hardware raid) because of the simplicity, the system
can easily boot from it, in many cases drives are hot swappable
and you don't have to touch the software/driver you just yank
the disk and put in a new one.

In the roughly 600 server class systems I've been exposed to
over the years I have seen only one or two bad RAID cards, one
of them I specifically remember was caught being bad during
a burn-in test so it never went live, I think the other went
bad after several years of service. While problems certainly
can happen, the raid card seems to not be an issue provided
your using a good one. I recall the one being "DOA" was a
3Ware 8006-2 and the other one was an HP, I believe a
DL360G1.

The most crazy thing I've experienced on a RAID array was on
some cheap shit LSI logic storage systems where a single
disk failure somehow crippled it's storage controllers(both
of them) knocking the entire array offline for an extended
period of time. I think the drive spat out a bunch of errors
on the fiber bus causing the controllers to flip out. The
system eventually recovered on it's own. I have been told
similar stories about other LSI logic systems(several big
companies OEM them), though I'm sure the problem isn't
limited to them, it's an architectural problem rather than
an implementation issue.

The only time in my experience where we actually lost data
(that I'm aware of) due to a storage/RAID/controller issue
was back in 2004 with an EMC CLARiiON CX600, where a
misconfiguration by the storage admin caused a catastrophic
failure of the backup controller when the primary controller
crashed. We spent a good 60 hours of downtime the following
week rebuilding corrupt portions of the database as we came
across them. More than a year later we still occasionally
found corruption from that incident. Fortunately the data
on the volumes that suffered corruption was quite old and
rarely accessed. Ideally the array should of made the
configuration error obvious or better yet prevented the
error from occurring in the first place. Those old style
enterprise arrays were too overly complicated(and yes
that CX600 ran embedded Windows NT as it's OS!)

For servers, I like 3Ware for SATA and HP for SAS. Though
these days the only things that sit on internal storage
is the operating system. All important data is on
enterprise grade storage systems, which for me means
3PAR(not to be confused with 3Ware), which get upwards of
double the usable capacity vs any other system in the world
while still being dead easy to use and the fastest arrays
in the world(priced pretty good too), and the drives have
point to point switched connections, they don't sit on a
shared bus. Our array can recover from a failed 750GB SATA
drive in (worst case) roughly 3.5 hours with no performance
impact to the system. Our previous array would take more
than 24 hours to rebuild a 400GB SATA drive, with a major
performance hit to the array. I could go on all day why
their arrays are so great!

My current company has mostly dell servers, and so far I
don't have many good things to say about their controllers
or drives(drives themselves are "OK" though Dell doesn't do
a good enough job on QA with them, we had to manually flash
dozens of drive firmwares because of performance problems,
and the only way to flash the disk firmware is to boot to
DOS, unlike flashing the BIOS or controller firmware).
I believe the Dell SAS/SATA controllers are LSI logic. I
have seen several kernel panics that seem to point to the
storage array on the Dell systems.

HP is coming out with their G6 servers tomorrow, the new
SmartArray controllers sound pretty nice, though I have
had a couple incidents with older HP arrays where a
failing drive caused massive performance problems on the
array, and we weren't able to force fail the drive from
remote we had to send someone on site to yank it out. No
data loss though. Funny that the controller detected the
drive was failing, but didn't give us the ability to take it
off line. Support said it was fixed in a newer version of
firmware, which of course required downtime to install.

nate