Rudi Ahlers wrote:
But the one piece of of the puzzle that I don't understand, will a self-build-Linux NAS device, or even Openfiler / FreeNAS give us that kind of uptime.
You say that downtime is not an option, so I can say with absolute confidence there really is nothing you can build for the budget your looking for that will provide 100% uptime.
Either set expectations for the budget you have or get a bigger budget to satisfy the requirements.
There are really only a few storage systems in the world that will put money down on 100% SLA uptime and they are all multi million dollar systems, and even then they will just pay you for any downtime caused by the storage, that doesn't mean there won't ever be downtime. And one vendor at least - Hitachi claims they have yet to have had to pay out on that guarantee(at least as of late last year when I last talked to them).
Depending on space and performance requirements you can get a system that's built for 99.999% uptime for about $90-120k in the U.S.
Even my own new storage system which as configured lists for about $990k does not guarantee 100% uptime, their goal is 99.999%, so far we've had 100% uptime over the past year, we've had two soft failures on the system, one was a Fiber channel HBA firmware crashed and dumped, the system automatically restarted the HBA chip, the second was a system level software component segfaulted(the system runs on Debian), the system auto restarted it, no noticeable impacts in either case as everything is connected to at least two active-active controllers..
Providing high availability storage is not a simple task, take for example a simple thing such as drive firmware upgrades, our storage system had to undergo drive firmware upgrades this past weekend due to a bug in the Seagate SATA drives which under very rare conditions could cause data corruption. The array handled the firmware upgrades itself, upgrading one drive at a time, took about 16 hours for 200 disks, zero impact to the system.
If your building a system yourself in my experience its highly unlikely that you are ever alerted to such a problem in the drive firmware yet alone have to go through the process of upgrading the drives. Fortunately critical drive firmware updates are somewhat rare, but I think they will become more common as more systems move to SATA, which for the most part is lower quality/less testing.
One guy I met with a couple of years ago had an entirely SATA drive system from another vendor using Western Digital drives, and there was a NASTY firmware bug in that system as well, and it continually impacted production, the drives at random times would just flat out stall, and you had to physically remove them from the array and re-insert them to cycle them and get them up again. And the array vendor had no way of flashing drives automatically at the time, he was faced with flashing each and every drive individually in another system(s). Eventually the vendor fixed their software to allow automatic firmware updates but that's just another example of the complexities involved with high availability storage and that's just at the block storage level.
On some of our Dell servers we had to manually boot with a floppy to DOS to flash some Seagate SCSI drive firmwares as the firmware they shipped with killed performance(500% faster with newer firmware for our app).
Then you need to take into account things like MPIO and active-active or active-passive storage controllers. Then if you get into the file based storage then there is another layer of availability bolted on top of that as well which can further complicate things.
Our last NAS vendor is well known in the ultra high performance arena, but even with an active-active NAS cluster they could not do a major software upgrade without hard cluster downtime. And fail over took upwards of 60 seconds.
Ideally I would like have a highly-redundant storage device which can be used by numerous users, and also host Virtual Machines on it. So IO will be the biggest concern, in terms of speed, with reliability the 2nd biggest concern.
You say IO is the biggest concern yet below you plan to use SATA disks?! Doesn't make sense. Unless you plan to have a large amount of SATA disks. SATA has 1/2 the I/O capacity of 10k RPM, and 1/3rd the I/O capacity of 15k RPM.
The other question is, how well will my own Linux / UNIX based NAS perform? Surely these companies who build their own NAS devices spend a lot of time fine-tuning the OS to deliver the best performance, and probably spend a lot of time researching and testing different hardware devices and configurations to see what works best?
You sound like you want something that is fast, very highly available, cheap, has lots of space, and easy to manage, such a system doesn't really exist(depending on your view of how cheap is cheap). The reason it doesn't exist is because it's really complicated to get right.
Your setting yourself up for major disappointment or a massive headache down the road. Pick a subset of your requirements, find a solution that fits it and set expectations/SLAs to match that solution whatever it may be.
If I were in your position I would opt for something that is as simple to manage as possible, and limit the services you provide through it, get decent hardware and setup some sort of replication to a 2nd identical system(myself would avoid things like DRBD)
At the very least opt for a good SCSI raid controller and a shelf of external disks, don't use the internal drive bays on a system. On the low end HP is a good fit, Infortrend has some pretty good stuff, LSI as well.
If you want to go a bit higher end, get a fiber channel storage system(same vendors) with redundant controllers and connect to it via FC.
Make sure the drive models and firmware is certified with the controller/storage system your getting.
nate