[ Wish there was a generic, active Linux "storage" mailing list out there -- something other than the kernel lists I mean ]
To frame the discussion, we use VMware ESX (vSphere) quite a bit with NFS datastores. Often times with NetApp, but lately, more often with Solaris 10 + ZFS + SSD's for ZIL (intent log or write cache).
The ZIL lets us use synchronous writes (safer) without the normal delay. Were we to try and get the same level of performance with Linux, we'd need to use async mode for our NFS shares -- and we'd lose some reliability.
However, given the latest rumblings and ruminations about Oracle potentially no longer selling entitlements for Solaris 10 on non-Sun hardware -- and then turning around and no longer allowing you to run Solaris 10 "freely", we're left with either OpenSolaris or looking at Linux again (we run Solaris 10 on Silicon Mechanics hardware).
My question is, what are the various options for getting NFS in "sync" mode to run fast on Linux?
Obviously we can buy a really nice disk controller with lots of cache, but I'm thinking more at the filesystem, volume manager or block driver layer. Is there a way to shunt write requests to a quicker medium like an SLC-based SSD (or NVRAM)?
I don't see a way to do this with LVM, ext3/ext4 or even xfs... maybe btrfs will have some options along this line down the road, but that's tomorrow and not today.
So is a beefy disk controller our best option? Even using our 3Ware 9650's w/ BBU so we can enable write-back doesn't seem to give us as good of write performance via NFS as ZFS+ZIL-on-SSD does...
Ray
On 4/22/2010 3:20 PM, Ray Van Dolson wrote:
[ Wish there was a generic, active Linux "storage" mailing list out there -- something other than the kernel lists I mean ]
To frame the discussion, we use VMware ESX (vSphere) quite a bit with NFS datastores. Often times with NetApp, but lately, more often with Solaris 10 + ZFS + SSD's for ZIL (intent log or write cache).
The ZIL lets us use synchronous writes (safer) without the normal delay. Were we to try and get the same level of performance with Linux, we'd need to use async mode for our NFS shares -- and we'd lose some reliability.
However, given the latest rumblings and ruminations about Oracle potentially no longer selling entitlements for Solaris 10 on non-Sun hardware -- and then turning around and no longer allowing you to run Solaris 10 "freely", we're left with either OpenSolaris or looking at Linux again (we run Solaris 10 on Silicon Mechanics hardware).
Is there some problem with OpenSolaris or NexentaStor?
On Thu, Apr 22, 2010 at 03:37:41PM -0500, Les Mikesell wrote:
On 4/22/2010 3:20 PM, Ray Van Dolson wrote:
[ Wish there was a generic, active Linux "storage" mailing list out there -- something other than the kernel lists I mean ]
To frame the discussion, we use VMware ESX (vSphere) quite a bit with NFS datastores. Often times with NetApp, but lately, more often with Solaris 10 + ZFS + SSD's for ZIL (intent log or write cache).
The ZIL lets us use synchronous writes (safer) without the normal delay. Were we to try and get the same level of performance with Linux, we'd need to use async mode for our NFS shares -- and we'd lose some reliability.
However, given the latest rumblings and ruminations about Oracle potentially no longer selling entitlements for Solaris 10 on non-Sun hardware -- and then turning around and no longer allowing you to run Solaris 10 "freely", we're left with either OpenSolaris or looking at Linux again (we run Solaris 10 on Silicon Mechanics hardware).
Is there some problem with OpenSolaris or NexentaStor?
Maybe not, but am trying to see what options there are on the Linux side.
The "delayed allocation" features in ext4 (and xfs, reiser4) sound interesting. Might give a little performance boost for synchronous write workloads....
[ We like Nexenta and OpenSolaris just fine, but really like the stability guarantee Solaris gives us -- much like RHEL. Would rather not have to worry (as much) about needing to reboot storage boxes and even though I have confidence in OpenSolaris, it's still more of a moving / changing target. Not to say we won't ultimately go in that direction though. ]
Thanks, Ray
Ray Van Dolson wrote:
The "delayed allocation" features in ext4 (and xfs, reiser4) sound interesting. Might give a little performance boost for synchronous write workloads....
Doesn't delayed allocation defeat the purpose of a synchronous write?
I think what you want is a proper storage array with mirrored write cache.
nate
On Thu, Apr 22, 2010 at 02:06:47PM -0700, nate wrote:
Ray Van Dolson wrote:
The "delayed allocation" features in ext4 (and xfs, reiser4) sound interesting. Might give a little performance boost for synchronous write workloads....
Doesn't delayed allocation defeat the purpose of a synchronous write?
I don't know for sure. From reading, it sounds like as far as data integrity is concerned it would fall somewhere between complete write-through synchronous writes and asynchronous writes.
I think what you want is a proper storage array with mirrored write cache.
Which is what we have with ZFS + SSD-based ZIL for far less money than a NetApp.
This[1] sounds interesting...
Ray
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
Which is what we have with ZFS + SSD-based ZIL for far less money than a NetApp.
not unless you have a pair of them configured as an active/standby HA cluster, sharing dual port disk storage, and some how (magic?) mirroring the cache pool so that if the active storage controller/server fails, the standby can take over wthout losing a single write.
John R Pierce wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
Which is what we have with ZFS + SSD-based ZIL for far less money than a NetApp.
not unless you have a pair of them configured as an active/standby HA cluster, sharing dual port disk storage, and some how (magic?) mirroring the cache pool so that if the active storage controller/server fails, the standby can take over wthout losing a single write.
OT too but really thought this was a good post/thread on ZFS
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18898.html
"ZFS is designed for high *reliability*" [..] "You want something completely different. You expect it to deliver *availability*.
And availability is something ZFS doesn't promise. It simply can't deliver this."
--
nate
On Thu, Apr 22, 2010 at 03:57:01PM -0700, nate wrote:
John R Pierce wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
Which is what we have with ZFS + SSD-based ZIL for far less money than a NetApp.
not unless you have a pair of them configured as an active/standby HA cluster, sharing dual port disk storage, and some how (magic?) mirroring the cache pool so that if the active storage controller/server fails, the standby can take over wthout losing a single write.
OT too but really thought this was a good post/thread on ZFS
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18898.html
"ZFS is designed for high *reliability*" [..] "You want something completely different. You expect it to deliver *availability*.
And availability is something ZFS doesn't promise. It simply can't deliver this."
Yep... and something you of course know going in.
Don't want to get off on a tangent on that -- am still interested what type of solutions in the Linux world are out there that can approximate what an SSD based ZIL does for ZFS.
Kent Overstreet (from lkml) mentioned that his bcache patch is intented to do something very similar.
So I guess that's my answer -- it's not here yet, so sounds like the controller is the only way to achieve this currently.
Ray
On Apr 22, 2010, at 8:08 PM, Ray Van Dolson rayvd@bludgeon.org wrote:
On Thu, Apr 22, 2010 at 03:57:01PM -0700, nate wrote:
John R Pierce wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
Which is what we have with ZFS + SSD-based ZIL for far less money than a NetApp.
not unless you have a pair of them configured as an active/standby HA cluster, sharing dual port disk storage, and some how (magic?) mirroring the cache pool so that if the active storage controller/server fails, the standby can take over wthout losing a single write.
OT too but really thought this was a good post/thread on ZFS
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18898.html
"ZFS is designed for high *reliability*" [..] "You want something completely different. You expect it to deliver *availability*.
And availability is something ZFS doesn't promise. It simply can't deliver this."
Yep... and something you of course know going in.
Don't want to get off on a tangent on that -- am still interested what type of solutions in the Linux world are out there that can approximate what an SSD based ZIL does for ZFS.
Kent Overstreet (from lkml) mentioned that his bcache patch is intented to do something very similar.
So I guess that's my answer -- it's not here yet, so sounds like the controller is the only way to achieve this currently.
How about locating XFS journal on SSDs and using HW RAID controller with big NVRAM cache.
That should be a lot faster than ZFS with SSD ZIL.
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
-Ross
Ross Walker wrote:
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
Why demand sync on remote storage when you typically don't have it locally? Programs that need transactional integrity should know when to fsync() and for anything else there's not much difference whether you crash before or after a write() was issued in terms of it not completing.
On Apr 24, 2010, at 12:43 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
Why demand sync on remote storage when you typically don't have it locally? Programs that need transactional integrity should know when to fsync () and for anything else there's not much difference whether you crash before or after a write() was issued in terms of it not completing.
Yes, but 'async' ignores those fsyncs and returns immediately.
-Ross
Ross Walker wrote:
On Apr 24, 2010, at 12:43 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
Why demand sync on remote storage when you typically don't have it locally? Programs that need transactional integrity should know when to fsync () and for anything else there's not much difference whether you crash before or after a write() was issued in terms of it not completing.
Yes, but 'async' ignores those fsyncs and returns immediately.
That sounds like a bug in the nfs client code if fsync() doesn't block until all of the data is committed to disk.
On Apr 24, 2010, at 4:34 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
On Apr 24, 2010, at 12:43 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
Why demand sync on remote storage when you typically don't have it locally? Programs that need transactional integrity should know when to fsync () and for anything else there's not much difference whether you crash before or after a write() was issued in terms of it not completing.
Yes, but 'async' ignores those fsyncs and returns immediately.
That sounds like a bug in the nfs client code if fsync() doesn't block until all of the data is committed to disk.
It's not the client side I'm talking about, but the server side. We were talking NFS servers and exporting sync (obey fsyncs) vs async (ignore fsyncs).
The client always mounts async, that's not the problem.
-Ross
Ross Walker wrote:
On Apr 24, 2010, at 4:34 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
On Apr 24, 2010, at 12:43 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
Why demand sync on remote storage when you typically don't have it locally? Programs that need transactional integrity should know when to fsync () and for anything else there's not much difference whether you crash before or after a write() was issued in terms of it not completing.
Yes, but 'async' ignores those fsyncs and returns immediately.
That sounds like a bug in the nfs client code if fsync() doesn't block until all of the data is committed to disk.
It's not the client side I'm talking about, but the server side. We were talking NFS servers and exporting sync (obey fsyncs) vs async (ignore fsyncs).
The client always mounts async, that's not the problem.
That's different. I thought the nfs spec was always sync on the server side and the client says when async is OK. And there's some special case response to handle the case where the server rebooted between the async writes and the subsequent fsync().
On Apr 24, 2010, at 4:53 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
On Apr 24, 2010, at 4:34 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
On Apr 24, 2010, at 12:43 PM, Les Mikesell lesmikesell@gmail.com wrote:
Ross Walker wrote:
NFS should always be 'sync' if performance isn't good, then your storage isn't good.
Why demand sync on remote storage when you typically don't have it locally? Programs that need transactional integrity should know when to fsync () and for anything else there's not much difference whether you crash before or after a write() was issued in terms of it not completing.
Yes, but 'async' ignores those fsyncs and returns immediately.
That sounds like a bug in the nfs client code if fsync() doesn't block until all of the data is committed to disk.
It's not the client side I'm talking about, but the server side. We were talking NFS servers and exporting sync (obey fsyncs) vs async (ignore fsyncs).
The client always mounts async, that's not the problem.
That's different. I thought the nfs spec was always sync on the server side and the client says when async is OK. And there's some special case response to handle the case where the server rebooted between the async writes and the subsequent fsync().
All the NFS info you wanted, but were afraid to ask:
-Ross
On Thu, Apr 22, 2010 at 03:50:11PM -0700, John R Pierce wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
Which is what we have with ZFS + SSD-based ZIL for far less money than a NetApp.
not unless you have a pair of them configured as an active/standby HA cluster, sharing dual port disk storage, and some how (magic?) mirroring the cache pool so that if the active storage controller/server fails, the standby can take over wthout losing a single write.
This is definitely tangental to what I was originally asking. :)
I'm not suggesting this perfectly replaces (or even comes close) a clustered NetApp setup. But it can provide similiar NFS write performance and I can buy three of them and replicate data for DR needs for far less than the price of a NetApp SnapMirror setup.
Ray
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
When ext3 came into widespread use, a popular method to "cache" frequent fsyncs was to run it in a full data journaling mode, with external journal on a separate disk. This turned all random writes to a sequential write, limited to a very small piece of disk and a periodical journal flush to the real file system. This worked amazingly well for busy mail queues - throughput went up 10x and more. People were also reporting improvements in NFS scenarios. Don't know how this is relevant today in times of SSD, but it should be worth to test it.
Jure Pečar wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
When ext3 came into widespread use, a popular method to "cache" frequent fsyncs was to run it in a full data journaling mode, with external journal on a separate disk. This turned all random writes to a sequential write, limited to a very small piece of disk and a periodical journal flush to the real file system. This worked amazingly well for busy mail queues - throughput went up 10x and more. People were also reporting improvements in NFS scenarios. Don't know how this is relevant today in times of SSD, but it should be worth to test it.
separate disk only? Don't forget nvram sticks or bbu ramdrives.
On Fri, Apr 23, 2010 at 10:20:01AM +0200, Jure Pečar wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
When ext3 came into widespread use, a popular method to "cache" frequent fsyncs was to run it in a full data journaling mode, with external journal on a separate disk. This turned all random writes to a sequential write, limited to a very small piece of disk and a periodical journal flush to the real file system. This worked amazingly well for busy mail queues - throughput went up 10x and more. People were also reporting improvements in NFS scenarios. Don't know how this is relevant today in times of SSD, but it should be worth to test it.
Interesting. As long as the requirements of O_SYNC are met once the data is written to the journal (I imagine it would be), then I could definitely see this speeding up NFS...
On the other hand, if no write confirmation is sent until the data actually flushes out of the journal and onto disk, then the wins probably aren't as significant.
Sounds like it'd be worth trying though, thanks.
Ray
On 4/23/2010 11:17 AM, Ray Van Dolson wrote:
On Fri, Apr 23, 2010 at 10:20:01AM +0200, Jure Pečar wrote:
Ray Van Dolson wrote:
I think what you want is a proper storage array with mirrored write cache.
When ext3 came into widespread use, a popular method to "cache" frequent fsyncs was to run it in a full data journaling mode, with external journal on a separate disk. This turned all random writes to a sequential write, limited to a very small piece of disk and a periodical journal flush to the real file system. This worked amazingly well for busy mail queues - throughput went up 10x and more. People were also reporting improvements in NFS scenarios. Don't know how this is relevant today in times of SSD, but it should be worth to test it.
Interesting. As long as the requirements of O_SYNC are met once the data is written to the journal (I imagine it would be), then I could definitely see this speeding up NFS...
On the other hand, if no write confirmation is sent until the data actually flushes out of the journal and onto disk, then the wins probably aren't as significant.
Do any linux filesystems actually get this right now? In the past, the filesystem cache was somewhat divorced from file writes so fsync() and probably any write with O_SYNC would wait until the entire filesystem cache was flushed to disk, not just the related file buffer.