I'm considering setting up a Lustre cluster system for our XEN virtual machines as shared storage, mainly for high availability on our XEN VPS servers.
Does anyone use Lustre in a production environment? What is your opinions / experiences with it?
One of the main reasons I'm looking at using Lustre is that we have a few older machines, of different specs that I want to use for storage purposes. i.e. I don't want to use proprietary, or matching hardware if I don't want to. Many of these machines are still in good standing and of good spec (3.2Ghz PIV + HT & 4GB RAM, Core2Duo + 4GB RAM, single XEON + 4GB RAM, dual core Atoms, etc) Will this work?
On 8/3/10, Rudi Ahlers Rudi@softdux.com wrote:
I'm considering setting up a Lustre cluster system for our XEN virtual machines as shared storage, mainly for high availability on our XEN VPS servers.
Does anyone use Lustre in a production environment? What is your opinions / experiences with it?
I haven't used Lustre but was also researching on using it for same purpose as shared storage for VMs. Dropped it in the end from consideration after some discussion on the Lustre mailing list points out that it's more intended for high performance rather than high availability. So it might not be that suitable as a HA solution.
Have you considered trying Gluster instead?
Emmanuel Noobadmin writes:
On 8/3/10, Rudi Ahlers Rudi@softdux.com wrote:
I'm considering setting up a Lustre cluster system for our XEN virtual machines as shared storage, mainly for high availability on our XEN VPS servers.
Does anyone use Lustre in a production environment? What is your opinions / experiences with it?
I haven't used Lustre but was also researching on using it for same purpose as shared storage for VMs. Dropped it in the end from consideration after some discussion on the Lustre mailing list points out that it's more intended for high performance rather than high availability. So it might not be that suitable as a HA solution.
Have you considered trying Gluster instead?
What do Gluster or Lustre offer that the builtin Red Hat Cluster Suite does not?
--------------------------------------------------------------- This message and any attachments may contain Cypress (or its subsidiaries) confidential information. If it has been received in error, please advise the sender and immediately delete this message. ---------------------------------------------------------------
On 8/3/10, Lars Hecking lhecking@users.sourceforge.net wrote:
What do Gluster or Lustre offer that the builtin Red Hat Cluster Suite does not?
Being a noob admin, I'm not sure and still haven't decided fully on which way to go, largely because it seems the technologies of choice are both still maturing (gluster + non-Solaris ZFS).
I don't know about Rudi's requirements but in my case, having an easily expandable and high availability storage with minimum switch over time that's decoupled from the VM host machines is the objective.
From what I understand, I cannot do the equivalent of network RAID 1
with a normal DRBD/HB style cluster. Gluster with replicate appears to do exactly that. I can have 2 or more storage servers with real time duplicates of the same data so that if any one fails the cluster does not run into problem. By using gluster distribute over pairs of server, it seems that I can also easily add more storage by adding more pairs of replicate server.
On Tue, Aug 3, 2010 at 5:13 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
On 8/3/10, Lars Hecking lhecking@users.sourceforge.net wrote:
What do Gluster or Lustre offer that the builtin Red Hat Cluster Suite  does not?
Being a noob admin, I'm not sure and still haven't decided fully on which way to go, largely because it seems the technologies of choice are both still maturing (gluster + non-Solaris ZFS).
I don't know about Rudi's requirements but in my case, having an easily expandable and high availability storage with minimum switch over time that's decoupled from the VM host machines is the objective.
This is exactly why I'm considering it :)
From what I understand, I cannot do the equivalent of network RAID 1
with a normal DRBD/HB style cluster. Gluster with replicate appears to do exactly that. I can have 2 or more storage servers with real time duplicates of the same data so that if any one fails the cluster does not run into problem. By using gluster distribute over pairs of server, it seems that I can also easily add more storage by adding more pairs of replicate server. _______________________________________________
I'm thinking more in the lines of network RAID10, if it's possible?
On Tue, 3 Aug 2010 at 6:11pm, Rudi Ahlers wrote
On Tue, Aug 3, 2010 at 5:13 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
From what I understand, I cannot do the equivalent of network RAID 1 with a normal DRBD/HB style cluster. Gluster with replicate appears to do exactly that. I can have 2 or more storage servers with real time duplicates of the same data so that if any one fails the cluster does not run into problem. By using gluster distribute over pairs of server, it seems that I can also easily add more storage by adding more pairs of replicate server. _______________________________________________
I'm thinking more in the lines of network RAID10, if it's possible?
Yes, you can do that with Gluster. That's the standard config produced by gluster-volgen if you feed it more than 2 volumes.
On 8/4/10, Rudi Ahlers Rudi@softdux.com wrote:
I'm thinking more in the lines of network RAID10, if it's possible?
Yes, that's one of the thing about Gluster that makes it rather attractive in theory to me. We can stack various translators in different ways, in this case, distribute + replicate for effectively network RAID 10.
Emmanuel Noobadmin wrote, On 08/03/2010 11:13 AM:
From what I understand, I cannot do the equivalent of network RAID 1 with a normal DRBD/HB style cluster. Gluster with replicate appears to do exactly that. I can have 2 or more storage servers with real time duplicates of the same data so that if any one fails the cluster does not run into problem. By using gluster distribute over pairs of server, it seems that I can also easily add more storage by adding more pairs of replicate server.
To have more than one active server with DRBD (or other disk type shared between active machines) you need to be using a file system which supports shared disk resources. http://www.drbd.org/docs/about/ http://www.drbd.org/users-guide-emb/s-dual-primary-mode.html http://www.drbd.org/users-guide-emb/ch-gfs.html http://www.drbd.org/users-guide-emb/ch-ocfs2.html
and perhaps using Gluster (Raid0 on net) with DRBD (Raid 1 on net) as disk space to get HA into Gluster? http://www.drbd.org/users-guide-emb/ch-xen.html
Note that it has been a while since I have ran DRBD on a set of systems and I only ran in active-passive with ext3, so I only know about the resources above that someone would want to look at.
On 8/4/10, Todd Denniston Todd.Denniston@tsb.cranrdte.navy.mil wrote:
To have more than one active server with DRBD (or other disk type shared between active machines) you need to be using a file system which supports shared disk resources. http://www.drbd.org/docs/about/ http://www.drbd.org/users-guide-emb/s-dual-primary-mode.html http://www.drbd.org/users-guide-emb/ch-gfs.html http://www.drbd.org/users-guide-emb/ch-ocfs2.html
and perhaps using Gluster (Raid0 on net) with DRBD (Raid 1 on net) as disk space to get HA into Gluster? http://www.drbd.org/users-guide-emb/ch-xen.html
Thanks for pointing it out, I didn't realize drdb could do that. I think I might have gotten it mixed up earlier with a thread that discussed using rsync.
That said, wouldn't using gluster alone be easier to configure and cheaper for almost equivalent redundancy?
Easier because instead of running gluster raid 0 on top of DRBD raid 1, we can take out the DRBD layer and just use gluster to achieve the equivalent by distribute on replicate.
More importantly there is the issue of cost, DRBD needs a pair of server per node for active-active. However, gluster allows me to get RAID "0.67" redundancy by "round robin" replicate.
i.e. If every storage node has 2 mdraid 1 block devices md0 and md1, I can mirror Server1 md0 to Server2 md1, Server2 md0 to Server3 md1 and so forth. Theoretically capable of surviving up to 50% node failure if no two adjacent node fails together. This for the cost of N+1 as compared to DRBD's Nx2 cost.
Please correct me if I miss out some other crucial consideration.
Emmanuel Noobadmin wrote, On 08/04/2010 11:33 AM:
Easier because instead of running gluster raid 0 on top of DRBD raid 1, we can take out the DRBD layer and just use gluster to achieve the equivalent by distribute on replicate.
More importantly there is the issue of cost, DRBD needs a pair of server per node for active-active. However, gluster allows me to get RAID "0.67" redundancy by "round robin" replicate.
I missed this.
i.e. If every storage node has 2 mdraid 1 block devices md0 and md1, I can mirror Server1 md0 to Server2 md1, Server2 md0 to Server3 md1 and so forth. Theoretically capable of surviving up to 50% node failure if no two adjacent node fails together. This for the cost of N+1 as compared to DRBD's Nx2 cost.
DRBD cost would still be N+1, not Nx2, if setup similarly, I think.
If Gluster is doing the mirror of "Server1 md0 to Server2..." by itself, then yes adding DRBD to it would be a bit overkill, as I would be having DRBD setup to do something similar.
On Tue, Aug 3, 2010 at 4:45 PM, Lars Hecking lhecking@users.sourceforge.net wrote:
Emmanuel Noobadmin writes:
On 8/3/10, Rudi Ahlers Rudi@softdux.com wrote:
I'm considering setting up a Lustre cluster system for our XEN virtual machines as shared storage, mainly for high availability on our XEN VPS servers.
Does anyone use Lustre in a production environment? What is your opinions / experiences with it?
I haven't used Lustre but was also researching on using it for same purpose as shared storage for VMs. Dropped it in the end from consideration after some discussion on the Lustre mailing list points out that it's more intended for high performance rather than high availability. So it might not be that suitable as a HA solution.
Have you considered trying Gluster instead?
What do Gluster or Lustre offer that the builtin Red Hat Cluster Suite  does not?
less documentation?
On Tue, 3 Aug 2010 at 3:45pm, Lars Hecking wrote
Emmanuel Noobadmin writes:
I haven't used Lustre but was also researching on using it for same purpose as shared storage for VMs. Dropped it in the end from consideration after some discussion on the Lustre mailing list points out that it's more intended for high performance rather than high availability. So it might not be that suitable as a HA solution.
Have you considered trying Gluster instead?
What do Gluster or Lustre offer that the builtin Red Hat Cluster Suite does not?
One does not need shared storage for Gluster. Each storage brick has its own storage, and Gluster handles replication/distribution across the nodes. Also, according to RH's site, RHCS is limited to 16 nodes. Gluster has no such limit.
Greetings,
On 8/3/10, Joshua Baker-LePain jlb17@duke.edu wrote:
On Tue, 3 Aug 2010 at 3:45pm, Lars Hecking wrote
nodes. Also, according to RH's site, RHCS is limited to 16 nodes.
Huh! Last time I checked it, it was far over 1024 nodes or something like that considering the MRG and RHEV.
Correct me if I am wrong.
Regards,
Rajagopal
On Tue, 2010-08-03 at 13:12 -0400, Joshua Baker-LePain wrote:
Also, according to RH's site, RHCS is limited to 16 nodes. Gluster has no such limit.
--- https://www.redhat.com/archives/linux-cluster/2010-May/msg00003.html
Thus you can have more than 16 Non Supported. A Single node is supported for Snapshots only as I understand it now.
John
On Tue, Aug 3, 2010 at 4:42 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
On 8/3/10, Rudi Ahlers Rudi@softdux.com wrote:
I'm considering setting up a Lustre cluster system for our XEN virtual machines as shared storage, mainly for high availability on our XEN VPS servers.
Does anyone use Lustre in a production environment? What is your opinions / experiences with it?
I haven't used Lustre but was also researching on using it for same purpose as shared storage for VMs. Dropped it in the end from consideration after some discussion on the Lustre mailing list points out that it's more intended for high performance rather than high availability. So it might not be that suitable as a HA solution.
Have you considered trying Gluster instead? _______________________________________________
Well, I'm really after the speed it offers.
With lustre, from what I understand, I could use say 3 or 5 or 50 servers to spread the load across the server and thus have higher IO. We mainly host shared hosting clients, who often have hundreds & thousands of files in one account. So if their files were "scattered" across multiple servers then the access to those files would be quicker.
In terms of high availability, I'm thinking that if I use more servers and thus have less load on each server, then the rate of failure would also be less. I see they have a high availability option, but would also be interested to know what was said about it. Would you can to point me to the specific conversation about this?
On 8/4/10, Rudi Ahlers Rudi@softdux.com wrote:
With lustre, from what I understand, I could use say 3 or 5 or 50 servers to spread the load across the server and thus have higher IO. We mainly host shared hosting clients, who often have hundreds & thousands of files in one account. So if their files were "scattered" across multiple servers then the access to those files would be quicker.
One of the problem with Lustre's style of distributed storage which Gluster points out is that the bottleneck is the meta server which tells clients where to find the actual data. Gluster supposedly scales with every client machine added because it doesn't use a meta server, file locations are determined using some kind of computed hash.
In terms of high availability, I'm thinking that if I use more servers and thus have less load on each server, then the rate of failure would also be less. I see they have a high availability option, but would
The drives would be constantly spinning anyway, so the increase in failure rate probably won't be significant as a result of that. Better to assume things will fail and have a system that's designed to handle that kind of situation with minimum disruptions :)
also be interested to know what was said about it. Would you can to point me to the specific conversation about this?
I don't have a link because it's in my inbox but you might be able to find it "Question on Lustre redundancy/failure features" in Lustre mailing list archive around 28 Jun 2010.
The general gist of it is that Lustre is basically network RAID 0, it relies entirely on the underlying device (e.g. RAID 1 storage node) for redundancy. If a storage device fails, access to the data is blocked until the device is replaced/rebuilt.
For HPC parallel workload, it probably makes sense, since each work unit is independent, you can just wait for the node to be replaced before processing those data. But in our situation, that's bad idea, just imagine if just one data block each from 50 VM happens to be on that failed node :D
On Tue, Aug 3, 2010 at 9:16 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
One of the problem with Lustre's style of distributed storage which Gluster points out is that the bottleneck is the meta server which tells clients where to find the actual data. Gluster supposedly scales with every client machine added because it doesn't use a meta server, file locations are determined using some kind of computed hash.
But who uses gluster in a production environment then? I have seen less posts (both on forums and mailing lists) about Glusteter, than lustre.
On Tue, 3 Aug 2010 at 10:04pm, Rudi Ahlers wrote
On Tue, Aug 3, 2010 at 9:16 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
One of the problem with Lustre's style of distributed storage which Gluster points out is that the bottleneck is the meta server which tells clients where to find the actual data. Gluster supposedly scales with every client machine added because it doesn't use a meta server, file locations are determined using some kind of computed hash.
But who uses gluster in a production environment then? I have seen less posts (both on forums and mailing lists) about Glusteter, than lustre.
I just finished testing a Gluster setup using some of my compute nodes. Based on those results, I'll be ordering 8 storage bricks (25 drives each) to start my storage cluster. I'll be using Gluster to a) replicate frequently used data (e.g. biologic databases) across the whole storage cluster and b) provide a global scratch space. The clients will be the 570 (and growing) nodes of my HPC cluster, and Gluster will be helping to take some of the load off my overloaded NetApp.
They also have a page on their website listing self-reported users http://www.gluster.org/gluster-users/.
On Tue, Aug 3, 2010 at 10:18 PM, Joshua Baker-LePain jlb17@duke.edu wrote:
On Tue, 3 Aug 2010 at 10:04pm, Rudi Ahlers wrote
On Tue, Aug 3, 2010 at 9:16 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
One of the problem with Lustre's style of distributed storage which Gluster points out is that the bottleneck is the meta server which tells clients where to find the actual data. Gluster supposedly scales with every client machine added because it doesn't use a meta server, file locations are determined using some kind of computed hash.
But who uses gluster in a production environment then? I have seen less posts (both on forums and mailing lists) about Glusteter, than lustre.
I just finished testing a Gluster setup using some of my compute nodes. Based on those results, I'll be ordering 8 storage bricks (25 drives each) to start my storage cluster. Â I'll be using Gluster to a) replicate frequently used data (e.g. biologic databases) across the whole storage cluster and b) provide a global scratch space. Â The clients will be the 570 (and growing) nodes of my HPC cluster, and Gluster will be helping to take some of the load off my overloaded NetApp.
They also have a page on their website listing self-reported users http://www.gluster.org/gluster-users/.
-- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________
Thanx for the feedback. This is what I hoped to get from someone running lustre :)
But I guess I'll look at gluster instead.
On Tue, 3 Aug 2010 at 10:26pm, Rudi Ahlers wrote
Thanx for the feedback. This is what I hoped to get from someone running lustre :)
But I guess I'll look at gluster instead.
You may want to head over to the beowulf mailing list -- you've probably got a higher probability of finding Lustre users there.
On Wed, 2010-08-04 at 03:16 +0800, Emmanuel Noobadmin wrote:
I don't have a link because it's in my inbox but you might be able to find it "Question on Lustre redundancy/failure features" in Lustre mailing list archive around 28 Jun 2010.
--- Would this be it? http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg06952.html
This is something like Virtual Storage like Covalent and IBM have. Very costly to implement the right way and interesting also.
John
On 8/4/10, JohnS jses27@gmail.com wrote:
Would this be it? http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg06952.html
Yes, that is my thread :)
This is something like Virtual Storage like Covalent and IBM have. Very costly to implement the right way and interesting also.
That's what I end up concluding as well based on the replies. Effectively, I would need to double up each storage node with a failover node if I really want to guard against machine failure.
It seems a lot cheaper to use gluster where the failover machine can also be an active node. So with a criss-cross arrangement suggested by one of the gluster experts, I could get machine redundancy with only half the physical servers. e.g. S1, S2, S3 with 2 RAID 1 block devices each. S1-A stores data1 and S1-B replicates S2-A, then S2-A stores data2 and S2-B replicates S3-A etc.
Not as fully redundant as 1 for 1 failover but I could achieve that by replicating on another cheap server with N+1 RAID 5 for every N machine.
So gluster seems a lot more flexible and cost effective to me, especially without the need for a dedicated metadata server.
Last but most importantly, it seems easier to recover from since it works on top of the underlying fs, So I figured I can always pull drives from a dead machine and read the files directly off the disk if really necessary.
Only concern now is the usual split-brain issue and whether linuxZFS is matured enough to be used in conjunction as the underlying fs on Centos5.
Greetings,
On 8/4/10, Emmanuel Noobadmin centos.admin@gmail.com wrote:
On 8/4/10, JohnS jses27@gmail.com wrote:
Only concern now is the usual split-brain issue and whether linuxZFS is matured enough to be used in conjunction as the underlying fs on Centos5.
Dunno if its relevent, but are we talking about inband/power or storage fencing issues for stonith here?
COZ, HA, to the best of my knowledge, requires some form of fencing...
and split-brain is the result of failure of HA mechanism.
<bit cofused ?? !! ???>
Please correct me if I am wrong in any of my understanding..
I am in a learning phase.
But then is ZFS a Cluster filesystem at all like GFS2/OCFS? Haven't studied that angle as yet.
Regards,
Rajagopal
On 08/03/10 11:31 PM, Rajagopal Swaminathan wrote:
But then is ZFS a Cluster filesystem at all like GFS2/OCFS? Haven't studied that angle as yet.
its not. and, afaik, the linux implementation of ZFS is not very well supported, I certainly wouldn't commit to a project relying on it without a LOT of testing. ZFS is very stable on Solaris, and I understand its working quite well with FreeBSD
Supposedly the new btfs for linux will be the choice. i'll remain skeptical of that until its proven itself in varied production environments by others.
Greetings,
On 8/4/10, John R Pierce pierce@hogranch.com wrote:
On 08/03/10 11:31 PM, Rajagopal Swaminathan wrote:
It just triggered an idea, why not leave the storage blocks as clvm bricks and the BMR/Restore and the such be delgated to a lower lever mechnism such as clvm replication/snapshot. Somebody in this list mentioned something about that sometime back that clvm supports some of these features as of 5.4 or so... not sure, gotta check.
Regards,
Rajagopal
John R Pierce wrote:
On 08/03/10 11:31 PM, Rajagopal Swaminathan wrote:
But then is ZFS a Cluster filesystem at all like GFS2/OCFS? Haven't studied that angle as yet.
its not. and, afaik, the linux implementation of ZFS is not very well supported, I certainly wouldn't commit to a project relying on it without a LOT of testing. ZFS is very stable on Solaris, and I understand its working quite well with FreeBSD
Supposedly the new btfs for linux will be the choice. i'll remain skeptical of that until its proven itself in varied production environments by others.
I thought there was a small limit on the number of hard links in btfs - that might make it unsuitable for general purpose use.
On 8/4/10, Rajagopal Swaminathan raju.rajsand@gmail.com wrote:
Dunno if its relevent, but are we talking about inband/power or storage fencing issues for stonith here?
COZ, HA, to the best of my knowledge, requires some form of fencing...
Typically yes, but it doesn't necessarily require STONITH since there is also the quorum approach. i.e, 7 nodes in the cluster, any node which cannot contact at least 3 other nodes considers itself orphaned and on reconnect, sync from the majority. So no STONITH, only temporal fencing until inconsistencies are sync'd transparently.
Unfortunately Gluster does not have a quorum mechanism as is. Otherwise along with the self-healing characteristics, it would be ideal for HA storage.
As it is, from my understanding, gluster will block access to ambiguous files until manual intervention deletes all but one desired copy. Might not really be an issue though unless split rates are high. With redundant network switches/paths, this might never be an issue since it should never happen that two isolated nodes are alive but write-able by guest systems to cause two updated but different copies of the same file.
and split-brain is the result of failure of HA mechanism.
<bit cofused ?? !! ???>
Please correct me if I am wrong in any of my understanding..
I am in a learning phase.
I'm also in the learning process so don't trust my words on this either :)
But then is ZFS a Cluster filesystem at all like GFS2/OCFS? Haven't studied that angle as yet.
ZFS is a local file system as far as I understand it. It's by Solaris but there are two efforts to port it to Linux, one through userspace via Fuse and the other through kernel. It seems like the Fuse approach is more matured and at the moment slightly more desirable from my POV because no messing around with kernel/recompile needed.
The main thing for me is that ZFS comes with inode/sector ECC functionality so that would catch "soft" hardware errors such as a flaky data cable that's silently corrupting data without any immediately observable effect.
It also has RAID functionality but I've seen various reports of failed zpool that couldn't be easily recovered. So my most likely configuration is to configure glusterfs on top of zfs (for the ECC) on top of mdraid 1 (for redundancy and ease of recovery)
Emmanuel Noobadmin wrote:
ZFS is a local file system as far as I understand it. It's by Solaris but there are two efforts to port it to Linux, one through userspace via Fuse and the other through kernel. It seems like the Fuse approach is more matured and at the moment slightly more desirable from my POV because no messing around with kernel/recompile needed.
I thought the GPL on the kernel code would not permit the inclusion of less restricted code like the CDL-covered zfs. For a network share, why not use the OpenSolaris or NexentaStor versions since you wouldn't be using much else from the system anyway.
The main thing for me is that ZFS comes with inode/sector ECC functionality so that would catch "soft" hardware errors such as a flaky data cable that's silently corrupting data without any immediately observable effect.
It also has RAID functionality but I've seen various reports of failed zpool that couldn't be easily recovered. So my most likely configuration is to configure glusterfs on top of zfs (for the ECC) on top of mdraid 1 (for redundancy and ease of recovery)
Snapshots and block-level de-dup are other features of zfs - but I think you'll lose that if you wrap anything else over it. Maybe you could overcommit an iscsi export expecting the de-dup to make up the size difference and use that as a block level component of something else.
On 8/4/10, Les Mikesell lesmikesell@gmail.com wrote:
I thought the GPL on the kernel code would not permit the inclusion of less restricted code like the CDL-covered zfs. For a network share, why not use
That's why the Fuse effort is further along, being in user space it bypasses the limits of the licenses in the sense that it is a derivative work or something along those lines.
the OpenSolaris or NexentaStor versions since you wouldn't be using much else from the system anyway.
If I really have to, but I was hoping I wouldn't need to learn another relatively similar OS and get myself confused and do something catastrophic while in console one day. Especially since I'm way behind schedule on picking up another programming language for projects my boss wants me to evaluate.
Snapshots and block-level de-dup are other features of zfs - but I think you'll lose that if you wrap anything else over it. Maybe you could overcommit an iscsi export expecting the de-dup to make up the size difference and use that as a block level component of something else.
Honestly, I've no idea what all that was about until I go read them up later although I understand vaguely from past reading that snapshot is like a backup copy
However, in my ideal configuration, when a VM host server dies, I just want to be able to start a new VM instance on a surviving machine using the correct VM image/disk file on the network storage and resume full functionality.
Since bulk of the actual changes is to "files" in the virtual disk file, having snapshot capabilities on the underlying fs doesn't seem to be useful. ZFS checksum ensuring that all sectors/inodes of that image file are error free seems more critical. Please do point out if I am mistaken though!
On 8/4/2010 10:10 AM, Emmanuel Noobadmin wrote:
derivative work or something along those lines.
the OpenSolaris or NexentaStor versions since you wouldn't be using much else from the system anyway.
If I really have to, but I was hoping I wouldn't need to learn another relatively similar OS and get myself confused and do something catastrophic while in console one day. Especially since I'm way behind schedule on picking up another programming language for projects my boss wants me to evaluate.
That's sort of the point of nexentastor which gives you a web interface to manage the filesystems and sharing since you don't need anything else. But the free community edition only goes to 12 TB. That might be enough per-host if you are going to layer something else on top, though.
Snapshots and block-level de-dup are other features of zfs - but I think you'll lose that if you wrap anything else over it. Maybe you could overcommit an iscsi export expecting the de-dup to make up the size difference and use that as a block level component of something else.
Honestly, I've no idea what all that was about until I go read them up later although I understand vaguely from past reading that snapshot is like a backup copy
It is good for 2 things - you can snapshot for local 'back-in-time' copies without using extra space, and you can do incremental dump/restores from local to remote snapshots.
However, in my ideal configuration, when a VM host server dies, I just want to be able to start a new VM instance on a surviving machine using the correct VM image/disk file on the network storage and resume full functionality.
The VM host side is simple enough if its disk image is intact. But, if you want to survive a disk server failure you need to have that replicated which seems like your main problem.
Since bulk of the actual changes is to "files" in the virtual disk file, having snapshot capabilities on the underlying fs doesn't seem to be useful. ZFS checksum ensuring that all sectors/inodes of that image file are error free seems more critical. Please do point out if I am mistaken though!
If you can tolerate a 'slightly behind' backup copy, you could probably build it on top of zfs snapshot send/receive replication. Nexenta has some sort of high-availability synchronous replication in their commercial product but I don't know the license terms. The part I wonder about in all of these schemes is how long it takes to recover when the mirroring is broken. Even with local md mirrors I find it takes most of a day even with < 1Tb drives with other operations becoming impractically slow.
On 8/4/10, Les Mikesell lesmikesell@gmail.com wrote:
That's sort of the point of nexentastor which gives you a web interface to manage the filesystems and sharing since you don't need anything else. But the free community edition only goes to 12 TB. That might be enough per-host if you are going to layer something else on top, though.
12TB should be good enough for most use cases. I'm not planning on going up to petabytes since it seems to me at some point, the network will become the bottleneck. Again, I need to remember to look into nexenstor.
It is good for 2 things - you can snapshot for local 'back-in-time' copies without using extra space, and you can do incremental dump/restores from local to remote snapshots.
That sounds good... and bad at the same time because I add yet another factor/feature to consider :D
The VM host side is simple enough if its disk image is intact. But, if you want to survive a disk server failure you need to have that replicated which seems like your main problem.
Which is where Gluster comes in with replicate across servers.
If you can tolerate a 'slightly behind' backup copy, you could probably build it on top of zfs snapshot send/receive replication. Nexenta has some sort of high-availability synchronous replication in their commercial product but I don't know the license terms.
That's the thing, I don't think I can tolerate a slightly behind copy on the system. The transaction once done, must remain done. A situation where a node fails right after a transaction was done and output to user, then recovered to a slightly behind state where the same transaction is then not done or not recorded, is not acceptable for many types of transaction.
The part I wonder about in all of these schemes is how long it takes to recover when the mirroring is broken. Even with local md mirrors I find it takes most of a day even with < 1Tb drives with other operations becoming impractically slow.
In most cases, I'll expect the drives would fail first than the server. So with the propose configuration, I have for each set of data, a pair of server and 2 pairs of mirror drives. If server goes down, Gluster handles self healing and if I'm not wrong, it's smart about it so won't be duplicating every single inode. On the drive side, even if one server is heavily impacted by the resync process, the system as a whole likely won't notice it as much since the other server is still at full speed.
I don't know if there's a way to shutdown a degraded md array and add a new disk without resyncing/building. If that's possible, we have a device which can clone a 1TB disk in about 4 hrs thus reducing the delay to restore full redundancy.
Emmanuel Noobadmin wrote, On 08/05/2010 12:40 AM:
That's the thing, I don't think I can tolerate a slightly behind copy on the system. The transaction once done, must remain done. A situation where a node fails right after a transaction was done and output to user, then recovered to a slightly behind state where the same transaction is then not done or not recorded, is not acceptable for many types of transaction.
You speak of transactions in a way that makes me think you are dealing with databases. If this is the case, then I suggest you take a few searches over to the drbd archives** and look for database issues, IIRC in some cases you are better off (speed and admin understanding/sanity) letting the database's built in replication handle the server to server database transactional sync than to trust a file system or even drbd to do it, because the db engine can/will make sure the backup db server ALSO has the data before reporting the transaction done. Not saying that having the DB on top of gluster or DRBD too would be bad, just suggesting that you may want to have the DB backed by something that fully understands the transactions.
** http://lists.linbit.com/pipermail/drbd-user/
On Thu, 2010-08-05 at 11:04 -0400, Todd Denniston wrote:
You speak of transactions in a way that makes me think you are dealing with databases. If this is the case, then I suggest you take a few searches over to the drbd archives** and look for database issues, IIRC in some cases you are better off (speed and admin understanding/sanity) letting the database's built in replication handle the server to server database transactional sync than to trust a file system or even drbd to do it, because the db engine can/will make sure the backup db server ALSO has the data before reporting the transaction done. Not saying that having the DB on top of gluster or DRBD too would be bad, just suggesting that you may want to have the DB backed by something that fully understands the transactions.
--- Nice analogy have you ever done this? Have you done this with separate Read Write DBs? How about streaming to a file (constant backup). The OP is talking about virtual machine images....
John
JohnS wrote, On 08/05/2010 11:24 AM:
On Thu, 2010-08-05 at 11:04 -0400, Todd Denniston wrote:
You speak of transactions in a way that makes me think you are dealing with databases. If this is the case, then I suggest you take a few searches over to the drbd archives** and look for database issues, IIRC ...
...
Not saying that having the DB on top of gluster or DRBD too would be bad, just suggesting that you may want to have the DB backed by something that fully understands the transactions.
Nice analogy have you ever done this? Have you done this with separate Read Write DBs? How about streaming to a file (constant backup). The OP is talking about virtual machine images....
The reason I suggested the googleing (a few searches) in the drbd list, is that I have _read_ the discussions on the list, and Recalled that some found it more appropriate for the DB to do the work. I on the other hand have fortunately only been an observer of the discussions, not a participant, nor a user of the ideas. i.e. I only have the metadata that there have been some good (well reasoned and polite) discussions of database replication on that list, which I believe would apply equally to a DB on DRBD and to a DB on a replicating file system.
On 8/5/10, Todd Denniston Todd.Denniston@tsb.cranrdte.navy.mil wrote:
You speak of transactions in a way that makes me think you are dealing with databases.
That's part of the application suite. Although we do suggest to clients to have different servers for each particular use, some of them are budget constraints and fortunately also have low loads. Both due largely to the fact they are usually small operations as well. So we end up having to setup servers which double/triple up as web/email/application. We are trying to keep things separate in the sense by using VMs for each purpose so that we can eventually migrate them with minimum complications to individual machines when their business grows to that level.
If this is the case, then I suggest you take a few searches over to the drbd archives** and look for database issues, IIRC in some cases you are better off (speed and admin understanding/sanity) letting the database's built in replication handle the server to server database transactional sync than to trust a file system or even drbd to do it, because the db engine can/will make sure the backup db server ALSO has the data before reporting the transaction done. Not saying that having the DB on top of gluster or DRBD too would be bad, just suggesting that you may want to have the DB backed by something that fully understands the transactions.
Definitely running the DBMS' own transaction logging and replication feature is part of the plan.
I know DRDB has the option to report a write as done only when both copies are written so it's not an issue. However, I thought the slight delay snapshot comment was about using ZFS snapshot send/receive replication? I don't really know the details on that so took your word for it that it involves some kind of perceivable delay likely similar to the several seconds long timing of delayed allocation. That was what I was responding to as being not acceptable, sorry if I caused any confusion.
On 8/4/2010 11:40 PM, Emmanuel Noobadmin wrote:
It is good for 2 things - you can snapshot for local 'back-in-time' copies without using extra space, and you can do incremental dump/restores from local to remote snapshots.
That sounds good... and bad at the same time because I add yet another factor/feature to consider :D
But even if you have live replicated data you might want historical snapshots and/or backup copies to protect against software/operator failure modes that might lose all of the replicated copies at once.
The VM host side is simple enough if its disk image is intact. But, if you want to survive a disk server failure you need to have that replicated which seems like your main problem.
Which is where Gluster comes in with replicate across servers.
If you can tolerate a 'slightly behind' backup copy, you could probably build it on top of zfs snapshot send/receive replication. Nexenta has some sort of high-availability synchronous replication in their commercial product but I don't know the license terms.
That's the thing, I don't think I can tolerate a slightly behind copy on the system. The transaction once done, must remain done. A situation where a node fails right after a transaction was done and output to user, then recovered to a slightly behind state where the same transaction is then not done or not recorded, is not acceptable for many types of transaction.
What you want is difficult to accomplish even in a local file system. I think it would be unreasonably expensive (both in speed and cost) to put your entire data store on something that provides both replication and transactional guarantees. I'd like to be convinced otherwise, though... Is it a requirement that you can recover your transactional state after a complete power loss or is it enough to have reached the buffers of a replica system?
The part I wonder about in all of these schemes is how long it takes to recover when the mirroring is broken. Even with local md mirrors I find it takes most of a day even with< 1Tb drives with other operations becoming impractically slow.
In most cases, I'll expect the drives would fail first than the server. So with the propose configuration, I have for each set of data, a pair of server and 2 pairs of mirror drives. If server goes down, Gluster handles self healing and if I'm not wrong, it's smart about it so won't be duplicating every single inode. On the drive side, even if one server is heavily impacted by the resync process, the system as a whole likely won't notice it as much since the other server is still at full speed.
I don't see how you can have transactional replication if the servers don't have to stay in sync, or how you can avoid being slowed down by the head motion of a good drive being replicated to a new mirror. There's just some physics involved that don't make sense.
I don't know if there's a way to shutdown a degraded md array and add a new disk without resyncing/building. If that's possible, we have a device which can clone a 1TB disk in about 4 hrs thus reducing the delay to restore full redundancy.
As far as I know, linux md devices have to rebuild completely. A raid1 will run at full speed with only one member so you can put off the rebuild for as long as you are willing to not have redundancy and the rebuild doesn't use much CPU, but during the rebuild the good drive's head has to make a complete pass across the drive and will keep getting pulled back there when running applications need it to be elsewhere.
On 8/6/10, Les Mikesell lesmikesell@gmail.com wrote:
But even if you have live replicated data you might want historical snapshots and/or backup copies to protect against software/operator failure modes that might lose all of the replicated copies at once.
That we already do, daily backups of database, configurations and where applicable website data. Kept for two months before dropping to fortnightly archives which are then offloaded and kept for years.
What you want is difficult to accomplish even in a local file system. I think it would be unreasonably expensive (both in speed and cost) to put your entire data store on something that provides both replication and transactional guarantees. I'd like to be convinced otherwise, though... Is it a requirement that you can recover your transactional state after a complete power loss or is it enough to have reached the buffers of a replica system?
For the local side, I can rely on ACID compliant database engines such as InnoDB on MySQL to maintain transactional integrity. What I don't want is if the transaction is committed on the primary disk, an output sent to the user for something supposedly unique such as a serial number. Then before the replication service (in this case, the delayed replicate of zfs send/receive) kicks in, the primary server dies.
For DRBD and gluster, if I'm not mistaken, unless I deliberate set otherwise, a write must have at least reached the replica buffers before it's considered as committed. So this scenario is unlikely to arise thus I don't see this as a problem with using them as machine replication service as compared to the unknown delay of using zfs send/receive replicate.
While I'm using DB as an example, the same issue applies to the VM disk image. The upper layer cannot be told a write is done until it's been at least sent out to the replica system. The way I see it under DRBD or gluster replicate, only if the replica dies after receiving the write, followed by the primary dying after receiving the ack AND reporting the result to the user AND both drives in its mirror dying. Then would I have a consistency issue. I know it's not possible to guarantee 100% but I can live with this kind of probability as compared to a several seconds delay where several transactions/changes could have taken place before a replica receives an update.
In most cases, I'll expect the drives would fail first than the server. So with the propose configuration, I have for each set of data, a pair of server and 2 pairs of mirror drives. If server goes down, Gluster handles self healing and if I'm not wrong, it's smart about it so won't be duplicating every single inode. On the drive side, even if one server is heavily impacted by the resync process, the system as a whole likely won't notice it as much since the other server is still at full speed.
I don't see how you can have transactional replication if the servers don't have to stay in sync, or how you can avoid being slowed down by the head motion of a good drive being replicated to a new mirror. There's just some physics involved that don't make sense.
Sorry for the confusion, I don't mean no slow down or expect the underlying fs to be responsible for transactional replication. That's the job of the DBMS, I just need the fs replication not to fail in such a way that it could cause transactional integrity issue as noted in my reply above.
Also I expect the impact of a rebuild to be lesser as gluster can be configured (temporarily or permanently) to prefer a particular volume(node) to be read from so the responsiveness should still be good (just that the theoretical bandwidth is halved) and reducing the head motion on the rebuilding node as less reads are demanded from it.
As far as I know, linux md devices have to rebuild completely. A raid1
Darn, I was hoping there was the equivalent of "assemble but do not rebuild" option which I had on fakeraid controllers several years back. But I suppose if we clone the drive externally and throw it back into service, it still does help with reducing the degradation window since it is an identical copy even if md doesn't know it.
On 8/5/2010 12:12 PM, Emmanuel Noobadmin wrote:
What you want is difficult to accomplish even in a local file system. I think it would be unreasonably expensive (both in speed and cost) to put your entire data store on something that provides both replication and transactional guarantees. I'd like to be convinced otherwise, though... Is it a requirement that you can recover your transactional state after a complete power loss or is it enough to have reached the buffers of a replica system?
For the local side, I can rely on ACID compliant database engines such as InnoDB on MySQL to maintain transactional integrity.
If you are going to do that, why not also rely on the database engine's replication which is aware of the transactions? Databases rely on filesystem write ordering and fsync() actually working - things that aren't always reliable locally, much less when clustered.
For DRBD and gluster, if I'm not mistaken, unless I deliberate set otherwise, a write must have at least reached the replica buffers before it's considered as committed. So this scenario is unlikely to arise thus I don't see this as a problem with using them as machine replication service as compared to the unknown delay of using zfs send/receive replicate.
But there are lots of ways things can go wrong, and clustering just adds to them. What happens when your replica host dies? Or the network to it, or the disk where you expect the copy to land? And if you don't wait for a sync to disk, what happens if these things break after the remote accepted the buffer copy.
While I'm using DB as an example, the same issue applies to the VM disk image.
The DB will offer a more optimized alternative. A VM image won't. But can you afford to wait for transactional guarantees on all that data that mostly doesn't matter?
The upper layer cannot be told a write is done until it's been at least sent out to the replica system. The way I see it under DRBD or gluster replicate, only if the replica dies after receiving the write, followed by the primary dying after receiving the ack AND reporting the result to the user AND both drives in its mirror dying. Then would I have a consistency issue. I know it's not possible to guarantee 100% but I can live with this kind of probability as compared to a several seconds delay where several transactions/changes could have taken place before a replica receives an update.
So how long do you wait if it is the replica that breaks? And how do you recover/sync later?
I don't see how you can have transactional replication if the servers don't have to stay in sync, or how you can avoid being slowed down by the head motion of a good drive being replicated to a new mirror. There's just some physics involved that don't make sense.
Sorry for the confusion, I don't mean no slow down or expect the underlying fs to be responsible for transactional replication. That's the job of the DBMS, I just need the fs replication not to fail in such a way that it could cause transactional integrity issue as noted in my reply above.
That's a lot to ask. I'd like to be convinced it is possible.
On 8/6/10, Les Mikesell lesmikesell@gmail.com wrote:
If you are going to do that, why not also rely on the database engine's replication which is aware of the transactions? Databases rely on filesystem write ordering and fsync() actually working - things that aren't always reliable locally, much less when clustered.
Mostly because I don't only need to set this up for databases only. I can't just say "ok, the dbms can ensure transactional integrity as well as provide remote replication" and ignore the other uses the system has to support.
Also the secondary consideration that I need to be able to add more storage nodes easily so it seems to make more sense to use a single technology that can support both requirements.
Of course in the end, budget/tech constraints might mean that I have to cut back somewhere eventually but it doesn't hurt to plan for things and then know what I'm cutting out.
But there are lots of ways things can go wrong, and clustering just adds to them. What happens when your replica host dies? Or the network to it, or the disk where you expect the copy to land? And if you don't wait for a sync to disk, what happens if these things break after the remote accepted the buffer copy.
All the nodes will have RAID 1 setup, I also plan on using at least 2 switches to provide network redundancy.
In general, for the planned setup with minimal replicate delay, the only real disaster is if all 4 drives die at the same time. Otherwise I believe only a small window exist where very specific sequence of failures would cause problem and even so only likely for one or two transactions due to the time window. However, using a slower replicate method like zfs send/receive which is a command line thing, the time window enlarges significantly which even if causes reparable damage would take far more time to fix simply due to the fact much more transactions could be lost.
The DB will offer a more optimized alternative. A VM image won't.
I'm not quite sure what's the connection here. The database runs within the VM and is stored in the virtual disk. I'm not using VM to substitute for a database replication but to segregrate functionality. In a way, it would also allow me to pursue different redundancy arrangements if the original configuration is not ideal for one of the functions.
But can you afford to wait for transactional guarantees on all that data that mostly doesn't matter?
Possibly, but of course depends on result of actual testing once a final configuration is decided. Data integrity, redundancy and availability (during working hours anyway) are more important than absolute performance since server load are not usually that high. By the time the customer's load can place significant demands on the hardware, they should also have the budget for more orthodox/proven/expensive solutions :D
So how long do you wait if it is the replica that breaks? And how do you recover/sync later?
I'm not sure what "wait" are you referring to. Is that the wait before the chosen option decides to flag the node as down or the wait before replacing the replica machine or the wait until the system is fully redundant again with a sync'd replica?
As for the actual recovery/sync, if a drive fails in the storage node, it would be straightforward case of replacing the drive and rebuilding the node's raid array wouldn't it? If the storage node fails, such as a mainboard problem, I'll replace/repair the node and put it back online, leaving gluster to self heal/resync. Gluster keeps versioning data so it would only sync changed files so that should be pretty fast.
I could also stop both the servers at night, externally clone the drives, edit the necessary conf files on the new replica and so avoid mdraid trying to resync everything.
Sorry for the confusion, I don't mean no slow down or expect the underlying fs to be responsible for transactional replication. That's the job of the DBMS, I just need the fs replication not to fail in such a way that it could cause transactional integrity issue as noted in my reply above.
That's a lot to ask. I'd like to be convinced it is possible.
It's not possible if I'm not wrong, we can always think of a situation or sequence of events that would break things. I'm just trying to pick one that would minimize the time that window of opportunity would exist hence zfs's send/receive as replication would not be a good option for live replication.
On 8/5/2010 3:52 PM, Emmanuel Noobadmin wrote:
The DB will offer a more optimized alternative. A VM image won't.
I'm not quite sure what's the connection here. The database runs within the VM and is stored in the virtual disk. I'm not using VM to substitute for a database replication but to segregrate functionality. In a way, it would also allow me to pursue different redundancy arrangements if the original configuration is not ideal for one of the functions.
Just overall price/performance. If you can separate the parts that need transactional sync-to-replicated-disk from the things that don't, you can throw more resources at the difficult parts of the problem.
So how long do you wait if it is the replica that breaks? And how do you recover/sync later?
I'm not sure what "wait" are you referring to. Is that the wait before the chosen option decides to flag the node as down or the wait before replacing the replica machine or the wait until the system is fully redundant again with a sync'd replica?
Both - since these are new and likely scenarios you are introducing.
As for the actual recovery/sync, if a drive fails in the storage node, it would be straightforward case of replacing the drive and rebuilding the node's raid array wouldn't it?
Yes, but that's slow and will affect the speed that normal writes can happen.
If the storage node fails, such as a mainboard problem, I'll replace/repair the node and put it back online, leaving gluster to self heal/resync. Gluster keeps versioning data so it would only sync changed files so that should be pretty fast.
That part sounds encouraging at least.