ZFS on Linux in production?

List overview All Threads
Download

newer

older

Re: [CentOS] How should I...

What the heck is a "text/html...

Lists

24 Oct 2013 24 Oct '13

8:41 p.m.

We are a CentOS shop, and have the lucky, fortunate problem of having ever-increasing amounts of data to manage. EXT3/4 becomes tough to manage when you start climbing, especially when you have to upgrade, so we're contemplating switching to ZFS.

As of last spring, it appears that ZFS On Linux http://zfsonlinux.org/ calls itself production ready despite a version number of 0.6.2, and being acknowledged as unstable on 32 bit systems.

However, given the need to do backups, zfs send sounds like a godsend over rsync which is running into scaling problems of its own. (EG: Nightly backups are being threatened by the possibility of taking over 24 hours per backup)

Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

-Ben

PS: I joined their mailing list recently, will be watching there as well. We will, of course, be testing for a while before "making the switch".

Show replies by date

John R Pierce

24 Oct 24 Oct

8:59 p.m.

On 10/24/2013 1:41 PM, Lists wrote:

...

Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

I've only used ZFS on Solaris and FreeBSD. some general observations...

1) you need a LOT of ram for decent performance on large zpools. 1GB ram above your basic system/application requirements per terabyte of zpool is not unreasonable.

2) don't go overboard with snapshots. a few 100 are probably OK, but 1000s (*) will really drag down the performance of operations that enumerate file systems.

3) NEVER let a zpool fill up above about 70% full, or the performance really goes downhill.

4) I prefer using striped mirrors (aka raid10) over raidz/z2, but my applications are primarily database.

(*) ran into a guy who had 100s of zfs 'file systems' (mount points), per user home directories, and was doing nightly snapshots going back several years, and his zfs commands were taking a long long time to do anything, and he couldn't figure out why. I think he had over 10,000 filesystems * snapshots.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Lists

9:59 p.m.

On 10/24/2013 01:59 PM, John R Pierce wrote:

...

you need a LOT of ram for decent performance on large zpools. 1GB ram

above your basic system/application requirements per terabyte of zpool is not unreasonable.

That seems quite reasonable to me. Our existing equipment has far more than enough RAM to make this a comfortable experience.

...

don't go overboard with snapshots. a few 100 are probably OK, but

1000s (*) will really drag down the performance of operations that enumerate file systems.

Our intended use for snapshots is to enable consistent backup points, something we're simulating now with rsync and its hard-link option. We haven't figured out the best way to do this, but in our backup clusters we have rarely more than 100 save points at any one time.

...

NEVER let a zpool fill up above about 70% full, or the performance

really goes downhill.

Thanks for the tip!

...

(*) ran into a guy who had 100s of zfs 'file systems' (mount points), per user home directories, and was doing nightly snapshots going back several years, and his zfs commands were taking a long long time to do anything, and he couldn't figure out why. I think he had over 10,000 filesystems * snapshots.

Wow. Couldn't he have the same results by putting all the home directories on a single ZFS partition?

John R Pierce

10:47 p.m.

On 10/24/2013 2:59 PM, Lists wrote:

...

...
...
(*) ran into a guy who had 100s of zfs 'file systems' (mount points), per user home directories, and was doing nightly snapshots going back several years, and his zfs commands were taking a long long time to do anything, and he couldn't figure out why. I think he had over 10,000 filesystems * snapshots.

Wow. Couldn't he have the same results by putting all the home directories on a single ZFS partition?

I believe he wanted quotas per user. ZFS quotas were only implemented at the file system level, at least as of whatever version he was running (I don't know if thats changed, as I never mess with quotas).

-- john r pierce 37N 122W somewhere on the middle of the left coast

Rainer Duffner

11:17 p.m.

Am 25.10.2013 um 00:47 schrieb John R Pierce pierce@hogranch.com:

...

On 10/24/2013 2:59 PM, Lists wrote:

...
...
...
(*) ran into a guy who had 100s of zfs 'file systems' (mount points), per user home directories, and was doing nightly snapshots going back several years, and his zfs commands were taking a long long time to do anything, and he couldn't figure out why. I think he had over 10,000 filesystems * snapshots.

Wow. Couldn't he have the same results by putting all the home directories on a single ZFS partition?

I believe he wanted quotas per user. ZFS quotas were only implemented at the file system level, at least as of whatever version he was running (I don't know if thats changed, as I never mess with quotas).

User and group quotas have been possible for some time.

ZFS is cool. But there are a lot of issues and stuff that needs to be tuned but is difficult to find out if it needs to be tuned.

Especially, if you run into performance-problems.

Once you have some experience with it, I recommend reading this blog: http://nex7.blogspot.ch

and of course, the FreeNAS forum, where you can read about stuff like that:

https://bugs.freenas.org/issues/1531

On the surface, ZFS is great. But god help you if you run into problems.

Warren Young

25 Oct 25 Oct

12:31 a.m.

On 10/24/2013 14:59, John R Pierce wrote:

...

On 10/24/2013 1:41 PM, Lists wrote:

you need a LOT of ram for decent performance on large zpools. 1GB ram

above your basic system/application requirements per terabyte of zpool is not unreasonable.

To be fair, you want to treat XFS the same way.

And it, too is "unstable" on 32-bit systems with anything but smallish filesystems, due to lack of RAM.

John R Pierce

12:40 a.m.

On 10/24/2013 5:31 PM, Warren Young wrote:

...

To be fair, you want to treat XFS the same way.

And it, too is "unstable" on 32-bit systems with anything but smallish filesystems, due to lack of RAM.

I thought it had stack requirements that 32 bit couldn't meet, and it would simply crash, so it is not built into 32bit versions of EL6.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Ray Van Dolson

26 Oct 26 Oct

1:36 p.m.

On Thu, Oct 24, 2013 at 01:59:15PM -0700, John R Pierce wrote:

...

On 10/24/2013 1:41 PM, Lists wrote:

...
Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

I've only used ZFS on Solaris and FreeBSD. some general observations...

you need a LOT of ram for decent performance on large zpools. 1GB ram

above your basic system/application requirements per terabyte of zpool is not unreasonable.

don't go overboard with snapshots. a few 100 are probably OK, but

1000s (*) will really drag down the performance of operations that enumerate file systems.

NEVER let a zpool fill up above about 70% full, or the performance

really goes downhill.

Have run into this one (again -- with Nexenta) as well. It can be pretty dramatic. We tend to set quotas to ensure we don'get exceed 75% or so max, but....

...at least on the Solaris side, there's a tunable you can set that keeps the metaslab (which gets fragmented and inefficient when pool utilization is high) entirely in memory. This completely resolves our throughput issue, but does require that you have sufficient memory to load the thing...

echo "metaslab_debug/W 1" | mdb -kw

There may be a ZOL equivalent.

...

I prefer using striped mirrors (aka raid10) over raidz/z2, but my

applications are primarily database.

(*) ran into a guy who had 100s of zfs 'file systems' (mount points), per user home directories, and was doing nightly snapshots going back several years, and his zfs commands were taking a long long time to do anything, and he couldn't figure out why. I think he had over 10,000 filesystems * snapshots.

Ray

George Kontostanos

11:38 p.m.

On Sat, Oct 26, 2013 at 4:36 PM, Ray Van Dolson rayvd@bludgeon.org wrote:

...

On Thu, Oct 24, 2013 at 01:59:15PM -0700, John R Pierce wrote:

...
On 10/24/2013 1:41 PM, Lists wrote:

...
Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

I've only used ZFS on Solaris and FreeBSD. some general

observations...

...

you need a LOT of ram for decent performance on large zpools. 1GB ram

above your basic system/application requirements per terabyte of zpool is not unreasonable.

don't go overboard with snapshots. a few 100 are probably OK, but

1000s (*) will really drag down the performance of operations that enumerate file systems.

NEVER let a zpool fill up above about 70% full, or the performance

really goes downhill.

Have run into this one (again -- with Nexenta) as well. It can be pretty dramatic. We tend to set quotas to ensure we don'get exceed 75% or so max, but....

We maybe getting a bit off topic here but on that subject we have noticed

a significant degrade in performance on systems running at 75-80 % of their pool capacity. I understand that the nature of COW will increase fragmentation. On large storages though 70% out of 100TB means that you have to always maintain 30TB free which is not a small number in terms of cost per TB.

- George Kontostanos ---

Markus Falb

4 Nov 4 Nov

6:15 p.m.

On 24.Okt.2013, at 22:59, John R Pierce wrote:

...

On 10/24/2013 1:41 PM, Lists wrote:

...
Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

I've only used ZFS on Solaris and FreeBSD. some general observations...

...

NEVER let a zpool fill up above about 70% full, or the performance

really goes downhill.

Why is it? It sounds cost intensive, if not ridiculous. Disk space not to used, forbidden land... Is the remaining 30% used by some ZFS internals?

-- Markus

Les Mikesell

6:43 p.m.

On Mon, Nov 4, 2013 at 12:15 PM, Markus Falb wnefal@gmail.com wrote:

...

...

NEVER let a zpool fill up above about 70% full, or the performance

really goes downhill.

Why is it? It sounds cost intensive, if not ridiculous. Disk space not to used, forbidden land... Is the remaining 30% used by some ZFS internals?

Probably just simple physics. If ZFS is smart enough to allocate space 'near' other parts of the related files/directories/inodes it will have to do worse when there aren't any good choices and it has to fragment things into the only remaining spaces and make the disk heads seek all over the place. Might not be a big problem on SSD's though.

-- Les Mikesell lesmikesell@gmail.com

John R Pierce

7:01 p.m.

On 11/4/2013 10:43 AM, Les Mikesell wrote:

...

On Mon, Nov 4, 2013 at 12:15 PM, Markus Falb

...
wnefal@gmail.com wrote:

...
...
...

NEVER let a zpool fill up above about 70% full, or the performance

...
...
really goes downhill.

...
Why is it? It sounds cost intensive, if not ridiculous. Disk space not to used, forbidden land... Is the remaining 30% used by some ZFS internals?

Probably just simple physics. If ZFS is smart enough to allocate space 'near' other parts of the related files/directories/inodes it will have to do worse when there aren't any good choices and it has to fragment things into the only remaining spaces and make the disk heads seek all over the place. Might not be a big problem on SSD's though.

even on 0 seek time SSDs, fragmenting files means more extents to track and process in order to read that file.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Nicolas Thierry-Mieg

11:21 p.m.

On 11/04/2013 08:01 PM, John R Pierce wrote:

...

On 11/4/2013 10:43 AM, Les Mikesell wrote:

...
On Mon, Nov 4, 2013 at 12:15 PM, Markus Falb

...
wnefal@gmail.com wrote:

...
...
...

NEVER let a zpool fill up above about 70% full, or the performance

...
...
really goes downhill.

...
Why is it? It sounds cost intensive, if not ridiculous. Disk space not to used, forbidden land... Is the remaining 30% used by some ZFS internals?

Probably just simple physics. If ZFS is smart enough to allocate space 'near' other parts of the related files/directories/inodes it will have to do worse when there aren't any good choices and it has to fragment things into the only remaining spaces and make the disk heads seek all over the place. Might not be a big problem on SSD's though.

even on 0 seek time SSDs, fragmenting files means more extents to track and process in order to read that file.

but why would this be much worse with ZFS than eg ext4?

John R Pierce

11:37 p.m.

On 11/4/2013 3:21 PM, Nicolas Thierry-Mieg wrote:

...

but why would this be much worse with ZFS than eg ext4?

because ZFS works considerably differently than extfs... its a copy-on-write system to start with.

-- john r pierce 37N 122W somewhere on the middle of the left coast

SilverTip257

24 Oct 24 Oct

9:47 p.m.

On Thu, Oct 24, 2013 at 4:41 PM, Lists lists@benjamindsmith.com wrote:

...

We are a CentOS shop, and have the lucky, fortunate problem of having ever-increasing amounts of data to manage. EXT3/4 becomes tough to manage when you start climbing, especially when you have to upgrade, so we're contemplating switching to ZFS.

You didn't mention XFS. Just curious if you considered it or not.

...

As of last spring, it appears that ZFS On Linux http://zfsonlinux.org/ calls itself production ready despite a version number of 0.6.2, and being acknowledged as unstable on 32 bit systems.

However, given the need to do backups, zfs send sounds like a godsend over rsync which is running into scaling problems of its own. (EG: Nightly backups are being threatened by the possibility of taking over 24 hours per backup)

Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

-Ben

PS: I joined their mailing list recently, will be watching there as well. We will, of course, be testing for a while before "making the switch".

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- ---~~.~~--- Mike // SilverTip257 //

Keith Keller

10:27 p.m.

On 2013-10-24, SilverTip257 silvertip257@gmail.com wrote:

...

On Thu, Oct 24, 2013 at 4:41 PM, Lists lists@benjamindsmith.com wrote:

...
We are a CentOS shop, and have the lucky, fortunate problem of having ever-increasing amounts of data to manage. EXT3/4 becomes tough to manage when you start climbing, especially when you have to upgrade, so we're contemplating switching to ZFS.

You didn't mention XFS. Just curious if you considered it or not.

XFS is better than ext3/4 for many applications, but it's still not as powerful as ZFS, which basically combines RAID, filesystem, and LVM into one. It sounds like the OP is really looking to take advantage of the extra features of ZFS.

...

...
Was wondering if anybody here could weigh in with real-life experience?

I don't have my own, but I have heard of other shops which have had lots of success with ZFS on OpenSolaris and their variants. I know of some places which are starting to put ZFS on linux into testing or preproduction, but nothing really extensive yet.

--keith

-- kkeller@wombat.san-francisco.ca.us

Rajagopal Swaminathan

3 Nov 3 Nov

10:04 p.m.

Greetings,

On Fri, Oct 25, 2013 at 3:57 AM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:

...

I don't have my own, but I have heard of other shops which have had lots of success with ZFS on OpenSolaris and their variants.

And I know of a shop which could not recover a huge ZFS on freebsd and had to opt for something like isilon or something like that due to unavailability of controller drivers for freebsd.

-- Regards, Rajagopal

Lists

24 Oct 24 Oct

11:12 p.m.

On 10/24/2013 02:47 PM, SilverTip257 wrote:

...

You didn't mention XFS. Just curious if you considered it or not.

Most definitely. There are a few features that I'm looking for:

1) MOST IMPORTANT: STABLE!

2) The ability to make the partition bigger by adding drives with very minimal/no downtime.

3) The ability to remove an older, (smaller) drive or drives in order to replace with larger capacity drives without downtime or having to copy over all the files manually.

4) The ability to create snapshots with no downtime.

5) The ability to synchronize snapshots quickly and without having to scan every single file. (backups)

6) Reasonable failure mode. Things *do* go south sometimes. Simple is better, especially when it's simpler for the (typically highly stressed) administrator.

7) Big. Basically all filesystems in question can handle our size requirements. We might hit a 100 TB partition in the next 5 years.

I think ZFS and BTRFS are the only candidates that claim to do all the above. Btrfs seems to have been "stable in a year or so" for as long as I could keep a straight face around the word "Gigabyte", so it's a non-starter at this point.

LVM2/Ext4 can do much of the above. However, horror stories abound, particularly around very large volumes. Also, LVM2 can be terrible in failure situations.

XFS does snapshots, but don't you have to freeze the volume first? Xfsrestore looks interesting for backups, though I don't know if there's a consistent "freeze point". (what about ongoing writes?) Not sure about removing HDDs in a volume with XFS.

Not as sure about ZFS' stability on Linux (those who run direct Unix derivatives seem to rave about it) and failure modes.

George Kontostanos

11:25 p.m.

We tested ZFS on CentOS 6.4 a few months ago using a descend Supermicro server with 16GB RAM and 11 drives on RaidZ3. Same specs as a middle range storage server that we build mainly using FreeBSD.

Performance was not bad but eventually we run into a situation were we could not import a pool anymore after a kernel / modules update.

I would not recommend it for production...

On Fri, Oct 25, 2013 at 2:12 AM, Lists lists@benjamindsmith.com wrote:

...

On 10/24/2013 02:47 PM, SilverTip257 wrote:

...
You didn't mention XFS. Just curious if you considered it or not.

Most definitely. There are a few features that I'm looking for:

MOST IMPORTANT: STABLE!

The ability to make the partition bigger by adding drives with very

minimal/no downtime.

The ability to remove an older, (smaller) drive or drives in order to

replace with larger capacity drives without downtime or having to copy over all the files manually.

The ability to create snapshots with no downtime.

The ability to synchronize snapshots quickly and without having to

scan every single file. (backups)

Reasonable failure mode. Things *do* go south sometimes. Simple is

better, especially when it's simpler for the (typically highly stressed) administrator.

Big. Basically all filesystems in question can handle our size

requirements. We might hit a 100 TB partition in the next 5 years.

I think ZFS and BTRFS are the only candidates that claim to do all the above. Btrfs seems to have been "stable in a year or so" for as long as I could keep a straight face around the word "Gigabyte", so it's a non-starter at this point.

LVM2/Ext4 can do much of the above. However, horror stories abound, particularly around very large volumes. Also, LVM2 can be terrible in failure situations.

XFS does snapshots, but don't you have to freeze the volume first? Xfsrestore looks interesting for backups, though I don't know if there's a consistent "freeze point". (what about ongoing writes?) Not sure about removing HDDs in a volume with XFS.

Not as sure about ZFS' stability on Linux (those who run direct Unix derivatives seem to rave about it) and failure modes. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- George Kontostanos --- http://www.aisecure.net

John R Pierce

11:39 p.m.

On 10/24/2013 4:12 PM, Lists wrote:

...

On 10/24/2013 02:47 PM, SilverTip257 wrote:

...
...
You didn't mention XFS. Just curious if you considered it or not.

Most definitely. There are a few features that I'm looking for:

MOST IMPORTANT: STABLE!

XFS is quite stable in CentOS 6.4 64bit. there was a flakey kernel issue circa 6.2.

...

The ability to make the partition bigger by adding drives with very

minimal/no downtime.

XFS+LVM+mdraid does this, but it requires several manual steps...

I'd take the new drives, add them to a new md mirror, then add that md device to the volume group, then lvextend the logical volume, and finally xfs_grow the file system. yes, thats a bunch more steps than the zpool/zfs commands, but in fact zfs is doing much the same thing internally.

I believe lvm also lets you replace pv's in the vg with new larger ones. I haven't had to do this yet.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Warren Young

25 Oct 25 Oct

12:29 a.m.

On 10/24/2013 17:12, Lists wrote:

...

The ability to make the partition bigger by adding drives with very

minimal/no downtime.

Be careful: you may have been reading some ZFS hype that turns out not as rosy in reality.

Ideally, ZFS would work like a Drobo with an infinite number of drive bays. Need to add 1 TB of disk space or so? Just whack another 1 TB disk into the pool, no problem, right?

Doesn't work like that.

You can add another disk to an existing pool, but it doesn't instantly make the pool bigger. You can make it a hot spare, but you can't tell ZFS to expand the pool over the new drive.

"But," you say, "didn't I read that...." Yes, you did. ZFS *can* do what you want, just not in the way you were probably expecting.

The least complicated *safe* way to add 1 TB to a pool is add *two* 1 TB disks to the system, create a ZFS mirror out of them, and add *that* vdev to the pool. That gets you 1 TB of redundant space, which is what you actually wanted. Just realize, you now have two separate vdevs here, both providing storage space to a single pool.

You could instead turn that new single disk into a non-redundant separate vdev and add that to the pool, but then that one disk can take down the entire pool if it dies.

Another problem is that you have now created a system where ZFS has to guess which vdev to put a given block of data on. Your 2-disk mirror of newer disks probably runs faster than the old 3+ disk raidz vdev, but ZFS isn't going to figure that out on its own. There are ways to "encourage" ZFS to use one vdev over another. There's even a special case mode where you can tell it about an SSD you've added to act purely as an intermediary cache, between the spinning disks and the RAM caches.

The more expensive way to go -- which is simpler in the end -- is to replace each individual disk in the existing pool with a larger one, letting ZFS resilver each new disk, one at a time. Once all disks have been replaced, *then* you can grow that whole vdev, and thus the pool.

But, XFS and ext4 can do that, too. ZFS only wins when you want to add space by adding vdevs.

...

The ability to remove an older, (smaller) drive or drives in order to

replace with larger capacity drives without downtime or having to copy over all the files manually.

Some RAID controllers will let you do this. XFS and ext4 have specific support for growing an existing filesystem to fill a larger volume.

...

Reasonable failure mode. Things *do* go south sometimes. Simple is

better, especially when it's simpler for the (typically highly stressed) administrator.

I find it simpler to use ZFS to replace a failed disk than any RAID BIOS or RAID management tool I've ever used. ZFS's command line utilities are quite simply slick. It's an under-hyped feature of the filesystem, if anything.

A lot of thought clearly went into the command language, so that once you learn a few basics, you can usually guess the right command in any given situation. That sort of good design doesn't happen by itself.

All other disk management tools I've used seem to have just accreted features until they're a pile of crazy. The creators of ZFS came along late enough in the game that they were able to look at everything and say, "No no no, *this* is how you do it."

...

I think ZFS and BTRFS are the only candidates that claim to do all the above. Btrfs seems to have been "stable in a year or so" for as long as I could keep a straight face around the word "Gigabyte", so it's a non-starter at this point.

I don't think btrfs's problem is stability as much as lack of features. It only just got parity redundancy ("RAID-5/6") features recently, for example.

It's arguably been *stable* since it appeared in release kernels about four years ago.

One big thing may push you to btrfs: With ZFS on Linux, you have to patch your local kernels, and you can't then sell those machines as-is outside the company. Are you willing to keep those kernels patched manually, whenever a new fix comes down from upstream? Do your servers spend their whole life in house?

...

Not as sure about ZFS' stability on Linux (those who run direct Unix derivatives seem to rave about it) and failure modes.

It wouldn't surprise me if ZFS on Linux is less mature than on Solaris and FreeBSD, purely due to the age of the effort.

Here, we've been able to use FreeBSD on the big ZFS storage box, and share it out to the Linux and Windows boxes over NFS and Samba.

John R Pierce

12:42 a.m.

On 10/24/2013 5:29 PM, Warren Young wrote:

...

The least complicated*safe* way to add 1 TB to a pool is add*two* 1 TB disks to the system, create a ZFS mirror out of them, and add*that* vdev to the pool. That gets you 1 TB of redundant space, which is what you actually wanted. Just realize, you now have two separate vdevs here, both providing storage space to a single pool.

yeah, I guess I should have made that clearer, thats exactly what you do.

and, it doesn't restripe old files til they get rewritten. new stuff will be striped across all the vdevs, old stuff stays where it is.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Lists

2:01 a.m.

On 10/24/2013 05:29 PM, Warren Young wrote:

...

On 10/24/2013 17:12, Lists wrote:

...

The ability to make the partition bigger by adding drives with very

minimal/no downtime.

Be careful: you may have been reading some ZFS hype that turns out not as rosy in realiIdeally, ZFS would work like a Drobo with an infinite number of drive bays. Need to add 1 TB of disk space or so? Just whack another 1 TB disk into the pool, no problem, right?

Doesn't work like that.

You can add another disk to an existing pool, but it doesn't instantly make the pool bigger. You can make it a hot spare, but you can't tell ZFS to expand the pool over the new drive.

"But," you say, "didn't I read that...." Yes, you did. ZFS *can* do what you want, just not in the way you were probably expecting.

The least complicated *safe* way to add 1 TB to a pool is add *two* 1 TB disks to the system, create a ZFS mirror out of them, and add *that* vdev to the pool. That gets you 1 TB of redundant space, which is what you actually wanted. Just realize, you now have two separate vdevs here, both providing storage space to a single pool.

You could instead turn that new single disk into a non-redundant separate vdev and add that to the pool, but then that one disk can take down the entire pool if it dies.

We have redundancy at the server/host level, so even if we have a fileserver go completely offline, our application retains availability. We have an API in our application stack that negotiates with the (typically 2 or 3) file stores.

...

Another problem is that you have now created a system where ZFS has to guess which vdev to put a given block of data on. Your 2-disk mirror of newer disks probably runs faster than the old 3+ disk raidz vdev, but ZFS isn't going to figure that out on its own. There are ways to "encourage" ZFS to use one vdev over another. There's even a special case mode where you can tell it about an SSD you've added to act purely as an intermediary cache, between the spinning disks and the RAM caches.

Performance isn't so much an issue - we'd partition our cluster and throw a few more boxes into place if it became a bottle neck.

...

The more expensive way to go -- which is simpler in the end -- is to replace each individual disk in the existing pool with a larger one, letting ZFS resilver each new disk, one at a time. Once all disks have been replaced, *then* you can grow that whole vdev, and thus the pool.

Not sure enough of the vernacular but lets say you have 4 drives in a RAID 1 configuration, 1 set of TB drives and another set of 2 TB drives.

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage. B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage.

We have 3 TB of available storage. Are you suggesting we add a couple of 4 TB drives:

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage. B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage. C1 <-> C2 = 2x 4TB drives, 4 TB redundant storage.

Then wait until ZFS moves A1/A2 over to C1/C2 before removing A1/A2? If so, that's capability I'm looking for.

...

But, XFS and ext4 can do that, too. ZFS only wins when you want to add space by adding vdevs.

The only way I'm aware of ext4 doing this is with resizee2fs, which is extending a partition on a block device. The only way to do that with multiple disks is to use a virtual block device like LVM/LVM2 which (as I've stated before) I'm hesitant to do.

...

...

The ability to remove an older, (smaller) drive or drives in order to

replace with larger capacity drives without downtime or having to copy over all the files manually.

Some RAID controllers will let you do this. XFS and ext4 have specific support for growing an existing filesystem to fill a larger volume.

LVM2 will let you remove a drive without taking it offline. Can XFS do this without some block device virtualization like LVM2? (I didn't think so)

...

...

Reasonable failure mode. Things *do* go south sometimes. Simple is

better, especially when it's simpler for the (typically highly stressed) administrator.

I find it simpler to use ZFS to replace a failed disk than any RAID BIOS or RAID management tool I've ever used. ZFS's command line utilities are quite simply slick. It's an under-hyped feature of the filesystem, if anything.

A lot of thought clearly went into the command language, so that once you learn a few basics, you can usually guess the right command in any given situation. That sort of good design doesn't happen by itself.

All other disk management tools I've used seem to have just accreted features until they're a pile of crazy. The creators of ZFS came along late enough in the game that they were able to look at everything and say, "No no no, *this* is how you do it."

I sooo hear your music here! What really sucks about filesystem management is that at the time when you really need to get it right is when everything seems to be the most complex.

...

...
I think ZFS and BTRFS are the only candidates that claim to do all the above. Btrfs seems to have been "stable in a year or so" for as long as I could keep a straight face around the word "Gigabyte", so it's a non-starter at this point.

I don't think btrfs's problem is stability as much as lack of features. It only just got parity redundancy ("RAID-5/6") features recently, for example.

For an example, btrfs didn't have any sort of fsck, even though it was touted as at least "release candidate". There was one released a while back that had some severe limitations. This has made me wary.

...

It's arguably been *stable* since it appeared in release kernels about four years ago.

One big thing may push you to btrfs: With ZFS on Linux, you have to patch your local kernels, and you can't then sell those machines as-is outside the company. Are you willing to keep those kernels patched manually, whenever a new fix comes down from upstream? Do your servers spend their whole life in house?

Are you sure about that? There are dkml RPMs on the website. http://zfsonlinux.org/epel.html

The install instructions:

$ sudo yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release-1-3.el6.noarch.rpm $ sudo yum install zfs

...

...
Not as sure about ZFS' stability on Linux (those who run direct Unix derivatives seem to rave about it) and failure modes.

It wouldn't surprise me if ZFS on Linux is less mature than on Solaris and FreeBSD, purely due to the age of the effort.

Here, we've been able to use FreeBSD on the big ZFS storage box, and share it out to the Linux and Windows boxes over NFS and Samba.

Much as I'm a Linux Lover, we may end up doing the same and putting up with the differences between *BSD and CentOS.

Warren Young

6:18 a.m.

On Oct 24, 2013, at 8:01 PM, Lists lists@benjamindsmith.com wrote:

...

Not sure enough of the vernacular

Yes, ZFS is complicated enough to have a specialized vocabulary.

I used two of these terms in my previous post:

- vdev, which is a virtual device, something like a software RAID. It is one or more disks, configured together, typically with some form of redundancy.

- pool, which is one or more vdevs, which has a capacity equal to all of its vdevs added together.

...

but lets say you have 4 drives in a RAID 1 configuration, 1 set of TB drives and another set of 2 TB drives.

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage. B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage.

We have 3 TB of available storage.

Well, maybe.

You would have 3 TB *if* you configured these disks as two separate vdevs.

If you tossed all four disks into a single vdev, you could have only 2 TB because the smallest disk in a vdev limits the total capacity.

(This is yet another way ZFS isn't like a Drobo[*], despite the fact that a lot of people hype it as if it were the same thing.)

...

Are you suggesting we add a couple of 4 TB drives:

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage. B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage. C1 <-> C2 = 2x 4TB drives, 4 TB redundant storage.

Then wait until ZFS moves A1/A2 over to C1/C2 before removing A1/A2? If so, that's capability I'm looking for.

No. ZFS doesn't let you remove a vdev from a pool once it's been added, without destroying the pool.

The supported method is to add disks C1 and C2 to the *A* vdev, then tell ZFS that C1 replaces A1, and C2 replaces A2. The filesystem will then proceed to migrate the blocks in that vdev from the A disks to the C disks. (I don't remember if ZFS can actually do both in parallel.)

Hours later, when that replacement operation completes, you can kick disks A1 and A2 out of the vdev, then physically remove them from the machine at your leisure. Finally, you tell ZFS to expand the vdev.

(There's an auto-expand flag you can set, so that last step can happen automatically.)

If you're not seeing the distinction, it is that there never were 3 vdevs at any point during this upgrade. The two C disks are in the A vdev, which never went away.

...

...
But, XFS and ext4 can do that, too. ZFS only wins when you want to add space by adding vdevs.

The only way I'm aware of ext4 doing this is with resizee2fs, which is extending a partition on a block device. The only way to do that with multiple disks is to use a virtual block device like LVM/LVM2 which (as I've stated before) I'm hesitant to do.

Yes, implicit in my comments was that you were using XFS or ext4 with some sort of RAID (Linux md RAID or hardware) and Linux's LVM2.

You can use XFS and ext4 without RAID and LVM, but if you're going to compare to ZFS, you can't fairly ignore these features just because it makes ZFS look better.

...

btrfs didn't have any sort of fsck

Neither does ZFS.

btrfs doesn't need an fsck for pretty much the same reason ZFS doesn't. Both filesystems effectively keep themselves fsck'd all the time, and you can do an online scrub if you're ever feeling paranoid.

ZFS is nicer in this regard, in that it lets you schedule the scrub operation. You can obviously schedule one for btrfs, but that doesn't take into account scrub time. If you tell ZFS to scrub every day, there will be 24 hours of gap between scrubs.

We use 1 week at the office, and each scrub takes about a day, so the scrub date rotates around the calendar by about a day per week.

ZFS also has better checksumming than btrfs: up to 256 bits, vs 32 in btrfs. (1 in 4 billion odds of irrecoverable data per block is still pretty good, though.)

...

There was one released a while back that had some severe limitations. This has made me wary.

All of the ZFSes out there are crippled relative to what's shipping in Solaris now, because Oracle has stopped releasing code. There are nontrivial features in zpool v29+, which simply aren't in the free forks of older versions o the Sun code.

Some of the still-active forks are of even older versions. I'm aware of one popular ZFS implementation still based on zpool *v8*.

If all you're doing is looking at feature sets, you can find reasons to reject every single option.

...

There are dkml RPMs on the website. http://zfsonlinux.org/epel.html

It is *possible* that keeping the CDDL ZFS code in a separate module manages to avoid tainting the GPL kernel code, in the same way that some people talk themselves into allowing proprietary GPU drivers with DRM support into their kernels.

You're playing with fire here. Bring good gloves.

[*] or other hybrid RAID system; I don't mean to suggest that only Drobo can do this

John R Pierce

6:44 a.m.

On 10/24/2013 11:18 PM, Warren Young wrote:

...

All of the ZFSes out there are crippled relative to what's shipping in Solaris now, because Oracle has stopped releasing code. There are nontrivial features in zpool v29+, which simply aren't in the free forks of older versions o the Sun code.

openZFS is doing pretty well on the BSD/etc side of things. some of the original developers of ZFS, who long since bailed on Oracle, are contributing code thats not in the Oracle branch, they forked in 2010, with the last release from Sun, when OpenSolaris was discontinued. The current version of OpenZFS no longer relies on 'version numbers', instead it has 'feature flags' for all post v28 features. The version in my BSD 9.1-stable system has feature-flags for...

async_destroy (read-only compatible) Destroy filesystems asynchronously. empty_bpobj (read-only compatible) Snapshots use less space. lz4_compress LZ4 compression algorithm support.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Warren Young

8:26 p.m.

On 10/25/2013 00:44, John R Pierce wrote:

...

current version of OpenZFS no longer relies on 'version numbers', instead it has 'feature flags' for all post v28 features.

This must be the zpool v5000 thing I saw while researching my previous post. Apparently ZFSonLinux is doing the same thing, or is perhaps also based on OpenZFS.

John R Pierce

11:04 p.m.

On 10/25/2013 1:26 PM, Warren Young wrote:

...

On 10/25/2013 00:44, John R Pierce wrote:

...
...
current version of OpenZFS no longer relies on 'version numbers', instead it has 'feature flags' for all post v28 features.

This must be the zpool v5000 thing I saw while researching my previous post. Apparently ZFSonLinux is doing the same thing, or is perhaps also based on OpenZFS.

indeed, it is OpenZFS

-- john r pierce 37N 122W somewhere on the middle of the left coast

Lists

5:33 p.m.

On 10/24/2013 11:18 PM, Warren Young wrote:

...

vdev, which is a virtual device, something like a software RAID. It is one or more disks, configured together, typically with some form of redundancy.

pool, which is one or more vdevs, which has a capacity equal to all of its vdevs added together.

Thanks for the clarification of terms.

...

You would have 3 TB *if* you configured these disks as two separate vdevs.

If you tossed all four disks into a single vdev, you could have only 2 TB because the smallest disk in a vdev limits the total capacity.

(This is yet another way ZFS isn't like a Drobo[*], despite the fact that a lot of people hype it as if it were the same thing.)

Two separate vdevs is pretty much what I was after. Drobo: another interesting option :)

...

...
Are you suggesting we add a couple of 4 TB drives:

A1 <-> A2 = 2x 1TB drives, 1 TB redundant storage. B1 <-> B2 = 2x 2TB drives, 2 TB redundant storage. C1 <-> C2 = 2x 4TB drives, 4 TB redundant storage.

Then wait until ZFS moves A1/A2 over to C1/C2 before removing A1/A2? If so, that's capability I'm looking for.

No. ZFS doesn't let you remove a vdev from a pool once it's been added, without destroying the pool.

The supported method is to add disks C1 and C2 to the *A* vdev, then tell ZFS that C1 replaces A1, and C2 replaces A2. The filesystem will then proceed to migrate the blocks in that vdev from the A disks to the C disks. (I don't remember if ZFS can actually do both in parallel.)

Hours later, when that replacement operation completes, you can kick disks A1 and A2 out of the vdev, then physically remove them from the machine at your leisure. Finally, you tell ZFS to expand the vdev.

(There's an auto-expand flag you can set, so that last step can happen automatically.)

If you're not seeing the distinction, it is that there never were 3 vdevs at any point during this upgrade. The two C disks are in the A vdev, which never went away.

I see the distinction about vdevs vs. block devices. Still, the process you outline is *exactly* the capability that I'm looking for, despite the distinction in semantics.

...

Yes, implicit in my comments was that you were using XFS or ext4 with some sort of RAID (Linux md RAID or hardware) and Linux's LVM2.

You can use XFS and ext4 without RAID and LVM, but if you're going to compare to ZFS, you can't fairly ignore these features just because it makes ZFS look better.

I've had good results with Linux' software RAID+Ext[2-4]. For example, I *love* that you can mount a RAID partitioned drive directly in a worst-case scenario. LVM2 complicates administration terribly. The widely touted, simplified administration of ZFS is quite attractive to me.

I'm just trying to find the best tool for the job. That may well end up being Drobo!

John R Pierce

5:40 p.m.

On 10/25/2013 10:33 AM, Lists wrote:

...

LVM2 complicates administration terribly.

huh? it hugely simplifies it for me, when I have lots of drives. I just wish mdraid and lvm were better integrated. to see how it should have been done, see IBM AIX's version of lvm. you grow a jfs file system, it automatically grows the underlying LV (logical volume), online, live. mirroring in AIX is done via lvm.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Mauricio Tavares

6:16 p.m.

On Fri, Oct 25, 2013 at 1:40 PM, John R Pierce pierce@hogranch.com wrote:

...

On 10/25/2013 10:33 AM, Lists wrote:

...
LVM2 complicates administration terribly.

huh? it hugely simplifies it for me, when I have lots of drives. I just wish mdraid and lvm were better integrated. to see how it should have been done, see IBM AIX's version of lvm. you grow a jfs file system, it automatically grows the underlying LV (logical volume), online, live. mirroring in AIX is done via lvm.

Funny you mentioned AIX's JFS. I really like what they did there.

...

-- john r pierce 37N 122W somewhere on the middle of the left coast

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Peter

26 Oct 26 Oct

12:42 a.m.

On 10/26/2013 06:40 AM, John R Pierce wrote:

...

to see how it should have been done, see IBM AIX's version of lvm. you grow a jfs file system, it automatically grows the underlying LV (logical volume), online, live.

lvm can do this with the --resizefs flag for lvextend, one command to grow both the logical volume and the fs, and it can be done live provided the fs supports it.

Peter

Warren Young

25 Oct 25 Oct

8:57 p.m.

On 10/25/2013 11:33, Lists wrote:

...

I'm just trying to find the best tool for the job.

Try everything. Seriously.

You won't know what you like, and what works *for you* until you have some experience. Buy a Drobo for the home, replace one of your old file servers with a FreeBSD ZFS box, enable LVM on the next Linux workstation.

...

That may well end up being Drobo!

Drobos are no panacea, either.

Years ago, my Drobo FS would disappear from the network occasionally, and have to be rebooted. (This seems to be fixed now.)

My boss's first-generation Drobo killed itself in a power outage. It was directly attached to his Windows box, and on restart, chkdsk couldn't find a filesystem at all. A data recovery program was able to pull files back off the disk, though, so it's not like the unit was actually dead. It just managed to corrupt the NTFS data structures thoroughly, despite the fact that it's supposed to be a redundant filesystem. It implies Drobo isn't using a battery-backed RAM cache, for their low-end units at least.

Every Drobo I've ever used[*] has been much slower than a comparably-priced "dumb" RAID.

The first Drobos would benchmark at about 20 MByte/sec when populated by disks capable of 100 MByte/sec raw. The two subsequent Drobo generations were touted as faster, but I don't think I ever hit even 30 MByte/sec.

Data migration after replacing a disk is also uncomfortably slow. The fastest I've ever seen a disk replacement take is about a day. As disks have gotten bigger, my existing Drobos haven't gotten faster, so now migration might take a week! It's for this single reason that I now refuse to use single-disk redundancy with Drobos. The window without protection is just too big now.

A lot of this is doubtless down to the small embedded processor in these things. ZFS on a "real" computer is simply in a different class.

[*] I haven't yet used a Thunderbolt or "B" series professional version. It is possible they're running at native disk speeds. But then, they're even more expensive.

Warren Young

11:29 p.m.

On re-reading, I realized I didn't complete some of my thoughts:

On 10/25/2013 00:18, Warren Young wrote:

...

ZFS is nicer in this regard, in that it lets you schedule the scrub operation. You can obviously schedule one for btrfs,

...with cron...

...

but that doesn't take into account scrub time.

This is important because a ZFS scrub takes absolute lowest priority. (Presumably true for btrfs, too.) Any time the filesystem has to service an I/O request, the scrub stops, then resumes when the I/O request has completed, unless another has arrived in the meantime.

This means that you cannot know how long a scrub will take unless you can exactly predict your future disk I/O. Scheduling a scrub with cron could land you in a situation where the previous scrub is still running due to unusually high I/O when another scrub request comes in.

I initially set our ZFS file server up so that it would start scrubbing at close of business on Friday, but due to the way ZFS scrub scheduling works, the most recent scrub started late Wednesday and ran into Thursday. This isn't a problem. The scrub doesn't run in parallel to normal I/O, I don't even notice that the array is scrubbing itself unless I go over and watchen das blinkenlights astaunished.

Ray Van Dolson

26 Oct 26 Oct

1:31 p.m.

On Thu, Oct 24, 2013 at 01:41:17PM -0700, Lists wrote:

...

We are a CentOS shop, and have the lucky, fortunate problem of having ever-increasing amounts of data to manage. EXT3/4 becomes tough to manage when you start climbing, especially when you have to upgrade, so we're contemplating switching to ZFS.

As of last spring, it appears that ZFS On Linux http://zfsonlinux.org/ calls itself production ready despite a version number of 0.6.2, and being acknowledged as unstable on 32 bit systems.

However, given the need to do backups, zfs send sounds like a godsend over rsync which is running into scaling problems of its own. (EG: Nightly backups are being threatened by the possibility of taking over 24 hours per backup)

Was wondering if anybody here could weigh in with real-life experience? Performance/scalability?

-Ben

PS: I joined their mailing list recently, will be watching there as well. We will, of course, be testing for a while before "making the switch".

Joining the discussion late, and don't really have anything to contribute on the ZFSonLinux side of things...

At $DAYJOB we have been running ZFS via Nexenta (previously via Solaris 10) for many years. We have about 5PB of this and the primary use case is for backups and handling of imagery.

For the most part, we really, really like ZFS. My feeling is that ZFS itself (at least in the *Solaris form) is rock solid and stable. Other pieces of the stack -- namely SMB/CIFS and some of the management tools provided by the various vendors are a bit more questionable. We spend a bit more time fighting weirdnesses with things higher up the stack than we do say on our NetApp environment. Too be expected.

I'm waiting for Red Hat or someone else to come out and support ZFS -- perhaps unlikely due to legality questions, but if I could marry the power of ZFS with the software stack in Linux (Samba!!), I'd be mighty happy. Yes -- we could run Samba on our Nexenta boxes, but it isn't "supported".

Echo'ing what many others say:

- ZFS is memory hungry. All of our PRD boxes have 144GB of memory in them, and some have SSD's for ZIL or L2ARC depending on the workload. - Powerful redundancy is possible. Our environment is built on top of Dell MD1200 JBOD's all dual pathed up to dual LSI SAS switches. Our vdev's (RAID groups) are sized to match the number of JBODs with the invididual disks spread across each JBOD. We use triple parity RAID (RAIDZ3) and as such can lose three entire JBODs without suffering any data loss. We actually had one JBOD go flaky on us and were able to hot yank it out, put in a new one with zero downtime (and much shorter resilver/rebuild times than you'd get with regular RAID). - We make heavy use of snapshots and clones. Probably have 200-300 on some sysems and we use them to do release management for collections of imagery. Very powerful and haven't run into performance issues yet. * Snapshots let us take "diffs" between versions quite easily. We then stream these diffs to an identical ZFS system at a DR site and merge in the changes. Our network pipe isn't big enough yet to do this quickly, so we typically just plug in another SAS JBOD with a zpool on it, stream the diffs there as a flat file, sneakernet the JBOD to the DR site, plug it in, import the zpool and slurp in the differences. Pretty cool.

As I mentioned, we have run into a few weird quirks. Mainly around stability of the management GUI (or lack of basic features like "useful" SNMP based monitoring), performance with CIFS and oddnesses like high system load in certain edge cases. Some general rough edges I suppose that we've been OK dealing with. The Nexenta guys are super smart, but of course they're a smaller shop and don't have the resources behind them that CentOS does with Red Hat.

My guess is that this would be exacerbated to some extent on the Linux platform at this point. I personally wouldn't want to use ZFS on Linux for our customer data serving workloads, but might consider it for something purely internal.

Ray

4296

Age (days ago)

4307

Last active (days ago)

discuss@lists.centos.org

33 comments

14 participants

tags (0)

participants (14)

George Kontostanos
John R Pierce
Keith Keller
Les Mikesell
Lists
Markus Falb
Mauricio Tavares
Nicolas Thierry-Mieg
Peter
Rainer Duffner
Rajagopal Swaminathan
Ray Van Dolson
SilverTip257
Warren Young