[CentOS] ZFS on Linux in production?

Fri Oct 25 02:01:28 UTC 2013
Lists <lists at benjamindsmith.com>

On 10/24/2013 05:29 PM, Warren Young wrote:
> On 10/24/2013 17:12, Lists wrote:
>> 2) The ability to make the partition  bigger by adding drives with very
>> minimal/no downtime.
> Be careful: you may have been reading some ZFS hype that turns out not
> as rosy in realiIdeally, ZFS would work like a Drobo with an infinite number of drive
> bays.  Need to add 1 TB of disk space or so?  Just whack another 1 TB
> disk into the pool, no problem, right?
>
> Doesn't work like that.
>
> You can add another disk to an existing pool, but it doesn't instantly
> make the pool bigger.  You can make it a hot spare, but you can't tell
> ZFS to expand the pool over the new drive.
>
> "But," you say, "didn't I read that...."   Yes, you did.  ZFS *can* do
> what you want, just not in the way you were probably expecting.
>
> The least complicated *safe* way to add 1 TB to a pool is add *two* 1 TB
> disks to the system, create a ZFS mirror out of them, and add *that*
> vdev to the pool.  That gets you 1 TB of redundant space, which is what
> you actually wanted.  Just realize, you now have two separate vdevs
> here, both providing storage space to a single pool.
>
> You could instead turn that new single disk into a non-redundant
> separate vdev and add that to the pool, but then that one disk can take
> down the entire pool if it dies.

We have redundancy at the server/host level, so even if we have a 
fileserver go completely offline,
our application retains availability. We have an API in our application 
stack that negotiates with the (typically 2 or 3) file stores.

> Another problem is that you have now created a system where ZFS has to
> guess which vdev to put a given block of data on.  Your 2-disk mirror of
> newer disks probably runs faster than the old 3+ disk raidz vdev, but
> ZFS isn't going to figure that out on its own.  There are ways to
> "encourage" ZFS to use one vdev over another.  There's even a special
> case mode where you can tell it about an SSD you've added to act purely
> as an intermediary cache, between the spinning disks and the RAM caches.
Performance isn't so much an issue - we'd partition our cluster and 
throw a few more boxes into place if it became a bottle neck.

> The more expensive way to go -- which is simpler in the end -- is to
> replace each individual disk in the existing pool with a larger one,
> letting ZFS resilver each new disk, one at a time.  Once all disks have
> been replaced, *then* you can grow that whole vdev, and thus the pool.
Not sure enough of the vernacular but lets say you have 4 drives in a 
RAID 1 configuration, 1 set of TB drives and another set of 2 TB drives.

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage.
B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage.

We have 3 TB of available storage. Are you suggesting we add a couple of 
4 TB drives:

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage.
B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage.
C1 <-> C2 = 2x 4TB drives, 4 TB redundant storage.

Then wait until ZFS moves A1/A2 over to C1/C2 before removing A1/A2? If 
so, that's capability I'm looking for.

> But, XFS and ext4 can do that, too.  ZFS only wins when you want to add
> space by adding vdevs.

The only way I'm aware of ext4 doing this is with resizee2fs, which is 
extending a partition on a block device. The only way to do that with 
multiple disks is to use a virtual block device like LVM/LVM2 which (as 
I've stated before) I'm hesitant to do.

>> 3) The ability to remove an older, (smaller) drive or drives in order to
>> replace with larger capacity drives without downtime or having to copy
>> over all the files manually.
> Some RAID controllers will let you do this.  XFS and ext4 have specific
> support for growing an existing filesystem to fill a larger volume.

LVM2 will let you remove a drive without taking it offline. Can XFS do 
this without some block device virtualization like LVM2? (I didn't think 
so)

>> 6) Reasonable failure mode. Things *do* go south sometimes. Simple is
>> better, especially when it's simpler for the (typically highly stressed)
>> administrator.
> I find it simpler to use ZFS to replace a failed disk than any RAID BIOS
> or RAID management tool I've ever used.  ZFS's command line utilities
> are quite simply slick.  It's an under-hyped feature of the filesystem,
> if anything.
>
> A lot of thought clearly went into the command language, so that once
> you learn a few basics, you can usually guess the right command in any
> given situation.  That sort of good design doesn't happen by itself.
>
> All other disk management tools I've used seem to have just accreted
> features until they're a pile of crazy.  The creators of ZFS came along
> late enough in the game that they were able to look at everything and
> say, "No no no, *this* is how you do it."

I sooo hear your music here! What really sucks about filesystem 
management is that at the time when you really need to get it right is 
when everything seems to be the most complex.

>> I think ZFS and BTRFS are the only candidates that claim to do all the
>> above. Btrfs seems to have been "stable in a year or so" for as long as
>> I could keep a straight face around the word "Gigabyte", so it's a
>> non-starter at this point.
> I don't think btrfs's problem is stability as much as lack of features.
>    It only just got parity redundancy ("RAID-5/6") features recently, for
> example.

For an example, btrfs didn't have any sort of fsck, even though it was 
touted as at least "release candidate". There was one released a while 
back that had some severe limitations. This has made me wary.

> It's arguably been *stable* since it appeared in release kernels about
> four years ago.
>
> One big thing may push you to btrfs: With ZFS on Linux, you have to
> patch your local kernels, and you can't then sell those machines as-is
> outside the company.  Are you willing to keep those kernels patched
> manually, whenever a new fix comes down from upstream?  Do your servers
> spend their whole life in house?

Are you sure about that? There are dkml RPMs on the website. 
http://zfsonlinux.org/epel.html

The install instructions:

$ sudo yum localinstall --nogpgcheck 
http://archive.zfsonlinux.org/epel/zfs-release-1-3.el6.noarch.rpm
$ sudo yum install zfs

>> Not as sure about ZFS' stability on Linux (those who run direct Unix
>> derivatives seem to rave about it) and failure modes.
> It wouldn't surprise me if ZFS on Linux is less mature than on Solaris
> and FreeBSD, purely due to the age of the effort.
>
> Here, we've been able to use FreeBSD on the big ZFS storage box, and
> share it out to the Linux and Windows boxes over NFS and Samba.

Much as I'm a Linux Lover, we may end up doing the same and putting up 
with the differences between *BSD and CentOS.