[CentOS] ZFS on Linux in production?

Fri Oct 25 00:29:20 UTC 2013
Warren Young <warren at etr-usa.com>

On 10/24/2013 17:12, Lists wrote:
>
> 2) The ability to make the partition  bigger by adding drives with very
> minimal/no downtime.

Be careful: you may have been reading some ZFS hype that turns out not 
as rosy in reality.

Ideally, ZFS would work like a Drobo with an infinite number of drive 
bays.  Need to add 1 TB of disk space or so?  Just whack another 1 TB 
disk into the pool, no problem, right?

Doesn't work like that.

You can add another disk to an existing pool, but it doesn't instantly 
make the pool bigger.  You can make it a hot spare, but you can't tell 
ZFS to expand the pool over the new drive.

"But," you say, "didn't I read that...."   Yes, you did.  ZFS *can* do 
what you want, just not in the way you were probably expecting.

The least complicated *safe* way to add 1 TB to a pool is add *two* 1 TB 
disks to the system, create a ZFS mirror out of them, and add *that* 
vdev to the pool.  That gets you 1 TB of redundant space, which is what 
you actually wanted.  Just realize, you now have two separate vdevs 
here, both providing storage space to a single pool.

You could instead turn that new single disk into a non-redundant 
separate vdev and add that to the pool, but then that one disk can take 
down the entire pool if it dies.

Another problem is that you have now created a system where ZFS has to 
guess which vdev to put a given block of data on.  Your 2-disk mirror of 
newer disks probably runs faster than the old 3+ disk raidz vdev, but 
ZFS isn't going to figure that out on its own.  There are ways to 
"encourage" ZFS to use one vdev over another.  There's even a special 
case mode where you can tell it about an SSD you've added to act purely 
as an intermediary cache, between the spinning disks and the RAM caches.

The more expensive way to go -- which is simpler in the end -- is to 
replace each individual disk in the existing pool with a larger one, 
letting ZFS resilver each new disk, one at a time.  Once all disks have 
been replaced, *then* you can grow that whole vdev, and thus the pool.

But, XFS and ext4 can do that, too.  ZFS only wins when you want to add 
space by adding vdevs.

> 3) The ability to remove an older, (smaller) drive or drives in order to
> replace with larger capacity drives without downtime or having to copy
> over all the files manually.

Some RAID controllers will let you do this.  XFS and ext4 have specific 
support for growing an existing filesystem to fill a larger volume.

> 6) Reasonable failure mode. Things *do* go south sometimes. Simple is
> better, especially when it's simpler for the (typically highly stressed)
> administrator.

I find it simpler to use ZFS to replace a failed disk than any RAID BIOS 
or RAID management tool I've ever used.  ZFS's command line utilities 
are quite simply slick.  It's an under-hyped feature of the filesystem, 
if anything.

A lot of thought clearly went into the command language, so that once 
you learn a few basics, you can usually guess the right command in any 
given situation.  That sort of good design doesn't happen by itself.

All other disk management tools I've used seem to have just accreted 
features until they're a pile of crazy.  The creators of ZFS came along 
late enough in the game that they were able to look at everything and 
say, "No no no, *this* is how you do it."

> I think ZFS and BTRFS are the only candidates that claim to do all the
> above. Btrfs seems to have been "stable in a year or so" for as long as
> I could keep a straight face around the word "Gigabyte", so it's a
> non-starter at this point.

I don't think btrfs's problem is stability as much as lack of features. 
  It only just got parity redundancy ("RAID-5/6") features recently, for 
example.

It's arguably been *stable* since it appeared in release kernels about 
four years ago.

One big thing may push you to btrfs: With ZFS on Linux, you have to 
patch your local kernels, and you can't then sell those machines as-is 
outside the company.  Are you willing to keep those kernels patched 
manually, whenever a new fix comes down from upstream?  Do your servers 
spend their whole life in house?

> Not as sure about ZFS' stability on Linux (those who run direct Unix
> derivatives seem to rave about it) and failure modes.

It wouldn't surprise me if ZFS on Linux is less mature than on Solaris 
and FreeBSD, purely due to the age of the effort.

Here, we've been able to use FreeBSD on the big ZFS storage box, and 
share it out to the Linux and Windows boxes over NFS and Samba.