[CentOS] Software RAID1 with CentOS-6.2

Thu Mar 1 20:05:16 UTC 2012
Luke S. Crawford <lsc at prgmr.com>

> A friend of mine has had a couple of strange problems with the RE (RAID) 
> series of Caviars, which utilize the same mechanics as the non-RE 
> Blacks.  For software RAID, I would recommend that you stick with the 
> non-RE versions because of differences in the firmware.

I would recommend the opposite.  If you use a RAID at all, the 'raid 
edition' or 'enterprise' or whatever drives give you significantly
better uptime.  (Now, is that worth the additional cost? that depends
on your application.   They are rather more expensive, and for some
applications, a few hours of downtime every few years isn't a big deal,
in which case, by all means use the WD Black drives; you save a bunch 
of money, and they are about as fast.  I use consumer drives in my own
desktops, because nobody but me cares if the thing is working, and I only
care when I'm physically nearby and can fix the damn thing if it's broken.)

I have tried /very/ hard to get some sort of consumer-grade drive working
in a raid, hardware or software.  I've tried on the order of 50 drives, 
many of them WD black, and many of those with the WDTLER.exe hack set.  
I still have a few of these in production, but whenever I get half a 
chance, I swap out the consumer-grade drives for enterprise or raid-edition 
drives, even at today's ridiculous drive prices.   I RMA the bad
consumer drives and have been giving them away to friends and people
that have done me favors.  (If anyone wants to arrange a 'many consumer 
drives for few enterprise drives' swap, let me know.) 

The WD blacks mostly work, and they perform well.    The problem is that 
they don't fail reliably.  

The other problem I notice is that it's very rare to see a system with 
1 raid edition drive that is significantly slower than others of the 
same model in the same system.  It's fairly common to see this with 
consumer drives.

I mean, everything should be RAID.  soft/hard/whatever- spinning
rust without redundancy is just a bad idea, unless you are using it as
'really slow ram' kind of scratch space, and even then, redundancy is
a good idea unless you have auto provisioning down so well that rebuilding
a box is no additional work.  My auto provisioning is not that good yet,
so I RAID even data I don't care about, because I don't want to have
to bother rebuilding the system when the drive fails.  So when a drive 
has problems, I want it to fail and let the RAID system handle it.  

The problem with consumer grade drives (and I've seen this with WD Black
consumer drives, both with WDTLER.EXE and without) is that often, before
they fail?  they get really, really slow, sometimes to the point of 
completely or nearly completely hanging the raid.

To be clear, similar things occasionally happen with 'enterprise' or 'raid 
edition' drives too, but it's very rare and usually not as bad.  I can 
count the number of times this has happened to me ever without taking off
my shoes.  Just yesterday I had a WD RE4 500gb drive (a fairly nice 
'enterprise' drive-  most of my current drives are similar)  fail in 
such a way that my 4 disk raid10 dropped from it's normal 100+MB/sec 
sequential throughput down below 30Mb/sec, and latency went through the 
roof.   The box was nigh unusable until I failed that drive out.   But, 
it's very rare that with 'enterprise' drives the whole thing completely 
freezes up, and this way, at least I could log in, figure out what was
up and fail the bad drive.   

With consumer grade drives (Including the wd blacks with WDTLER)  it's
pretty common for the whole raid to completely freeze.

It isn't a problem with software raid alone;  I've seen it with 3ware and
LSI RAID cards, too.  (In one test with a known bad but not reporting it 
drive, the 3ware didn't freeze up quite as hard.  It kept 'retrying' the 
bad disk, then you had a second or two of access to the RAID, then it went
back to 'retrying' and the RAID was frozen for a few seconds, etc...)    

I mean, my total fleet is under a thousand drives, and I don't have
a good inventory system tracking errors, but I only have a few consumer-grade
drives left in production, but it's still fairly common for a bad consumer
drive to hang up an old server and set off my pager in the middle of the 
night, 'cause I/O has hung on one of the servers.   When a raid 
edition/enterprise drive fails, I get an email, but I can deal with
that in the morning.  The RAID continues chugging along as long as
I have enough good drives left.


-- 
Luke S. Crawford
http://prgmr.com/xen/         -   Hosting for the technically adept
http://nostarch.com/xen.htm   -   We don't assume you are stupid.