[CentOS] ZFS on Linux in production?

Fri Oct 25 06:18:12 UTC 2013

On Oct 24, 2013, at 8:01 PM, Lists <lists at benjamindsmith.com> wrote:

> Not sure enough of the vernacular

Yes, ZFS is complicated enough to have a specialized vocabulary.

I used two of these terms in my previous post:

- vdev, which is a virtual device, something like a software RAID.  It is one or more disks, configured together, typically with some form of redundancy.

- pool, which is one or more vdevs, which has a capacity equal to all of its vdevs added together.

> but lets say you have 4 drives in a 
> RAID 1 configuration, 1 set of TB drives and another set of 2 TB drives.
> 
> A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage.
> B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage.
> 
> We have 3 TB of available storage.

Well, maybe.

You would have 3 TB *if* you configured these disks as two separate vdevs.

If you tossed all four disks into a single vdev, you could have only 2 TB because the smallest disk in a vdev limits the total capacity.

(This is yet another way ZFS isn't like a Drobo[*], despite the fact that a lot of people hype it as if it were the same thing.)

> Are you suggesting we add a couple of 
> 4 TB drives:
> 
> A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage.
> B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage.
> C1 <-> C2 = 2x 4TB drives, 4 TB redundant storage.
> 
> Then wait until ZFS moves A1/A2 over to C1/C2 before removing A1/A2? If 
> so, that's capability I'm looking for.

No.  ZFS doesn't let you remove a vdev from a pool once it's been added, without destroying the pool.

The supported method is to add disks C1 and C2 to the *A* vdev, then tell ZFS that C1 replaces A1, and C2 replaces A2.  The filesystem will then proceed to migrate the blocks in that vdev from the A disks to the C disks. (I don't remember if ZFS can actually do both in parallel.)

Hours later, when that replacement operation completes, you can kick disks A1 and A2 out of the vdev, then physically remove them from the machine at your leisure.  Finally, you tell ZFS to expand the vdev.

(There's an auto-expand flag you can set, so that last step can happen automatically.)

If you're not seeing the distinction, it is that there never were 3 vdevs at any point during this upgrade.  The two C disks are in the A vdev, which never went away.

>> But, XFS and ext4 can do that, too.  ZFS only wins when you want to add
>> space by adding vdevs.
> 
> The only way I'm aware of ext4 doing this is with resizee2fs, which is 
> extending a partition on a block device. The only way to do that with 
> multiple disks is to use a virtual block device like LVM/LVM2 which (as 
> I've stated before) I'm hesitant to do.

Yes, implicit in my comments was that you were using XFS or ext4 with some sort of RAID (Linux md RAID or hardware) and Linux's LVM2.   

You can use XFS and ext4 without RAID and LVM, but if you're going to compare to ZFS, you can't fairly ignore these features just because it makes ZFS look better.

> btrfs didn't have any sort of fsck

Neither does ZFS.

btrfs doesn't need an fsck for pretty much the same reason ZFS doesn't.  Both filesystems effectively keep themselves fsck'd all the time, and you can do an online scrub if you're ever feeling paranoid.

ZFS is nicer in this regard, in that it lets you schedule the scrub operation.  You can obviously schedule one for btrfs, but that doesn't take into account scrub time.  If you tell ZFS to scrub every day, there will be 24 hours of gap between scrubs.

We use 1 week at the office, and each scrub takes about a day, so the scrub date rotates around the calendar by about a day per week.

ZFS also has better checksumming than btrfs: up to 256 bits, vs 32 in btrfs.  (1 in 4 billion odds of irrecoverable data per block is still pretty good, though.)

> There was one released a while 
> back that had some severe limitations. This has made me wary.

All of the ZFSes out there are crippled relative to what's shipping in Solaris now, because Oracle has stopped releasing code.  There are nontrivial features in zpool v29+, which simply aren't in the free forks of older versions o the Sun code.

Some of the still-active forks are of even older versions.  I'm aware of one popular ZFS implementation still based on zpool *v8*.

If all you're doing is looking at feature sets, you can find reasons to reject every single option.

> There are dkml RPMs on the website. 
> http://zfsonlinux.org/epel.html

It is *possible* that keeping the CDDL ZFS code in a separate module manages to avoid tainting the GPL kernel code, in the same way that some people talk themselves into allowing proprietary GPU drivers with DRM support into their kernels.

You're playing with fire here.  Bring good gloves.

[*] or other hybrid RAID system; I don't mean to suggest that only Drobo can do this