[CentOS] 40TB File System Recommendations

Fri Apr 15 12:51:50 UTC 2011
Peter Kjellström <cap at nsc.liu.se>

On Thursday, April 14, 2011 05:26:41 PM Ross Walker wrote:
> 2011/4/14 Peter Kjellström <cap at nsc.liu.se>:
> > While I do concede the obvious point regarding rebuild time (raid6 takes
> > from long to very long to rebuild) I'd like to point out:
> > 
> >  * If you do the math for a 12 drive raid10 vs raid6 then (using actual
> > data from ~500 1T drives on HP cciss controllers during two years)
> > raid10 is ~3x more likely to cause hard data loss than raid6.
> > 
> >  * mtbf is not everything there's also the thing called unrecoverable
> > read errors. If you hit one while rebuilding your raid10 you're toast
> > while in the raid6 case you'll use your 2nd parity and continue the
> > rebuild.
> You mean if the other side of the mirror fails while rebuilding it.

No, the drive (unrecoverably) failing to read a sector is not the same thing 
as a drive failure. Drive failure frequency expressed in mtbf is around 1M 
hours (even though including predictive fail we see more like 250K hours). 
Unrecoverable read error rate (per sector) was quite recently on the order of 
1x to 10x of the drive size (a drive I looked up now was spec'ed alot higher 
at ~1000x drive size). If we assume a raid10 rebuild time of 12h and an 
unrecoverable read error once every 10x of drive size then the effective mean 
time between read error is 120h (two to ten thousand times worse than the 
drive mtbf). Admittedly these numbers are hard to get and equally hard to 
trust (or double check).

What it all comes down to is that raid10 (assuming just double- not tripple 
copy) stores your data with one extra copy/parity and in a single drive 
failure scenario you have zero extra data left (on that part of the array). 
That is, you depend on each and every bit of that (meaning the degraded part) 
data being correctly read. This means you very much want both:

 1) Very fast rebuilds (=> you need hot-spare)
 2) An unrecoverable read error rate much larger than your drive size

or as you suggest below:

 3) Tripple copy

> Yes this is true, of course if this happens with RAID6 it will rebuild
> from parity IF there is a second hotspare available,

This is wrong, hot-spares are not that necessary when using raid6. This has to 
do with the fact that rebuild times (time from you start being vulnerable to 
whatever rebuild completes) are already long. An added 12h for a tech to swap 
in the spare only marginally increases your risks.

> cause remember
> the first failure wasn't cleared before the second failure occurred.
> Now your RAID6 is in severe degraded state, one more failure before
> either of these disks is rebuilt will mean toast for the array.

All of this was taken into account in my original example above. In the end 
(with my data) raid10 was around 3x more likely to cause ultimate data loss 
than raid6.

> Now
> the performance of the array is practically unusable and the load on
> the disks is high as it does a full recalculation rebuild, and if they
> are large it will be high for a very long time, now if any other disk
> in the very large RAID6 array is near failure, or has a bad sector,
> this taxing load could very well push it over the edge

In my example a 12 drive raid6 rebuild takes 6-7 days this works out to < 5 
MB/s seq read per drive. This added load is not very noticeable in our 
environment (taking into account normal patrol reads and user data traffic).

Either way, the general problem of "[rebuild stress] pushing drives over the 
edge" is a larger threat to raid10 than raid6 (it being fatal in the first 

> and the risk of
> such an event occurring increases with the size of the array and the
> size of the disk surface.
> I think this is where the mdraid raid10 shines because it can have 3
> copies (or more) of the data instead of just two,

I think we've now moved into what most people would call unreasonable. Let's 
see what we have for a 12 drive box (quite common 2U size):

 raid6: 12x on raid6 no hot spare (see argument above) => 10 data drives
 raid10: 11x tripple store on raid10 one spare => 3.66 data drives

or (if your raid's not odd-drive capable):

 raid10: 9x tripple store on raid10 one to three spares => 3 data drives

(ok, yes you could get 4 data drives out of it if you skipped hot-spare)

That is almost a 2.7x-3.3x diff! My users sure care if their X $ results in 
1/3 the space (or cost => 3x for the same space if you prefer).

On top of this most raid implementations for raid10 lacks tripple copy 

Also note that raid10 that allows for odd number of drives is more vulnerable 
to 2nd drive failures resulting in an even larger than 3x improvement using 
raid6 (vs double copy odd drive handling raid10).


> of course a three
> times (or more) the cost. It also allows for uneven number of disks as
> it just saves copies on different spindles rather then "mirrors". This
> I think provides the best protection against failure and the best
> performance, but at the worst cost, but with 2TB and 4TB disks coming
> out
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.centos.org/pipermail/centos/attachments/20110415/caef7c75/attachment-0003.sig>