[CentOS] suggestions for large filesystem server setup (n * 100 TB)

On 02/28/2014 06:30 AM, Mauricio Tavares wrote:
> On Fri, Feb 28, 2014 at 8:55 AM, Phelps, Matt <mphelps at cfa.harvard.edu> wrote:
>> I'd highly recommend getting a NetApp storage device for something that big.
>>
>> It's more expensive up front, but the amount of heartache/time saved in the
>> long run is WELL worth it.
>>
>        My vote would be for a ZFS-based storage solution, be it
> homegrown or appliance (like nextenta). Remember, as far as ZFS (and
> similar filesystems whose acronyms are more than 3 letters) is
> concerned, a petabyte is still small fry.

Ditto on ZFS! I've been experimenting with it for about 5-6 months and 
it really is the way to go for any filesystem greater than about 10 GB 
IMHO. We're in the process of transitioning several of our big data 
pools to ZFS because it's so obviously better.

Just remember that ZFS Isn't casual! You have to take the time to 
understand what it is and how it works, because if you make the wrong 
mistake, it's curtains for your data. ZFS has a few maddening 
limitations** that you have to plan for. But it is far and away the 
leader in Copy-On-Write, large scale file systems, and once you know how 
to plan for it, ZFS capabilities are jaw-dropping. Here are a few off 
the top of my head:

1) Check for and fix filesystem errors without ever taking it offline.
2) Replace failed HDDs from a raidz pool without ever taking it offline.
3) Works best with inexpensive JBOD drives - it's actually recommended 
to not use expensive HW raid devices.
4) Native, built-in compression: double your usable disk space for free.
5) Extend (grow) your zfs pool without ever taking it offline.
6) Create a snapshot in seconds that you can keep or expire at any time. 
(snapshots are read-only, and take no disk space initially)
7) Send a snapshot (entire filesystem) to another server. Binary perfect 
copies in a single command, much faster than rsync when you have a large 
data set.
8) Ability to make a clone - a writable copy of a snapshot in seconds. A 
clone of a snapshot is writable, and snapshots can be created of a 
clone. A clone initially uses no disk space, and as you use it, it only 
uses the disk space of the changes between the current state of the 
clone and the snapshot it's derived from.

** Limitations? ZFS? Say it isn't so! But here they are:

1) You can't add redundancy after creating a vdev in a zfs pool. So if 
you make a ZFS vdev and don't make it raidz at the start, you can't add 
another more drives to get raidz. You also can't "add" redundancy to an 
existing raidz partition. Once you've made it raidz1, you can't add a 
drive to get raidz2. I've found a workaround, where you create a "fake" 
drive with a sparse file, and add the fake drive(s) to your RAIDZ pool 
upon creation, and immediately remove them. But you have to do this on 
initial creation!
http://jeanbruenn.info/2011/01/18/setting-up-zfs-with-3-out-of-4-discs-raidz/

2) Zpools are grouped into vdevs, which you can think of like a block 
device made from 1 or more HDs. You can add vdevs without issue, but you 
can't remove them. EVER. Combine this fact with #1 and you had better be 
planning carefully when you extend a file system. See "Hating your data" 
section in this excellent ZFS walkthrough:
http://arstechnica.com/information-technology/2014/02/ars-walkthrough-using-the-zfs-next-gen-filesystem-on-linux/

3) Like any COW file system, ZFS tends to fragment. This cuts into 
performance, especially when you have less than about 20-30% free space. 
This isn't as bad as it sounds, you can enable compression to double 
your usable space.

Bug) ZFS on Linux has been quite stable in my testing, but as of this 
writing, has a memory leak. The workaround is manageable but if you 
don't do it ZFS servers will eventually lock up. The workaround is 
fairly simple, google for "zfs /bin/echo 3 > /proc/sys/vm/drop_caches;"