Michael A. Peters wrote: > I'd be very interested in hearing opinions on this subject. I mainly like hardware raid (good hardware raid not hybrid software/hardware raid) because of the simplicity, the system can easily boot from it, in many cases drives are hot swappable and you don't have to touch the software/driver you just yank the disk and put in a new one. In the roughly 600 server class systems I've been exposed to over the years I have seen only one or two bad RAID cards, one of them I specifically remember was caught being bad during a burn-in test so it never went live, I think the other went bad after several years of service. While problems certainly can happen, the raid card seems to not be an issue provided your using a good one. I recall the one being "DOA" was a 3Ware 8006-2 and the other one was an HP, I believe a DL360G1. The most crazy thing I've experienced on a RAID array was on some cheap shit LSI logic storage systems where a single disk failure somehow crippled it's storage controllers(both of them) knocking the entire array offline for an extended period of time. I think the drive spat out a bunch of errors on the fiber bus causing the controllers to flip out. The system eventually recovered on it's own. I have been told similar stories about other LSI logic systems(several big companies OEM them), though I'm sure the problem isn't limited to them, it's an architectural problem rather than an implementation issue. The only time in my experience where we actually lost data (that I'm aware of) due to a storage/RAID/controller issue was back in 2004 with an EMC CLARiiON CX600, where a misconfiguration by the storage admin caused a catastrophic failure of the backup controller when the primary controller crashed. We spent a good 60 hours of downtime the following week rebuilding corrupt portions of the database as we came across them. More than a year later we still occasionally found corruption from that incident. Fortunately the data on the volumes that suffered corruption was quite old and rarely accessed. Ideally the array should of made the configuration error obvious or better yet prevented the error from occurring in the first place. Those old style enterprise arrays were too overly complicated(and yes that CX600 ran embedded Windows NT as it's OS!) For servers, I like 3Ware for SATA and HP for SAS. Though these days the only things that sit on internal storage is the operating system. All important data is on enterprise grade storage systems, which for me means 3PAR(not to be confused with 3Ware), which get upwards of double the usable capacity vs any other system in the world while still being dead easy to use and the fastest arrays in the world(priced pretty good too), and the drives have point to point switched connections, they don't sit on a shared bus. Our array can recover from a failed 750GB SATA drive in (worst case) roughly 3.5 hours with no performance impact to the system. Our previous array would take more than 24 hours to rebuild a 400GB SATA drive, with a major performance hit to the array. I could go on all day why their arrays are so great! My current company has mostly dell servers, and so far I don't have many good things to say about their controllers or drives(drives themselves are "OK" though Dell doesn't do a good enough job on QA with them, we had to manually flash dozens of drive firmwares because of performance problems, and the only way to flash the disk firmware is to boot to DOS, unlike flashing the BIOS or controller firmware). I believe the Dell SAS/SATA controllers are LSI logic. I have seen several kernel panics that seem to point to the storage array on the Dell systems. HP is coming out with their G6 servers tomorrow, the new SmartArray controllers sound pretty nice, though I have had a couple incidents with older HP arrays where a failing drive caused massive performance problems on the array, and we weren't able to force fail the drive from remote we had to send someone on site to yank it out. No data loss though. Funny that the controller detected the drive was failing, but didn't give us the ability to take it off line. Support said it was fixed in a newer version of firmware, which of course required downtime to install. nate