[CentOS] who uses Lustre in production with virtual machines?

Thu Aug 5 04:40:13 UTC 2010

On 8/4/10, Les Mikesell <lesmikesell at gmail.com> wrote:
> That's sort of the point of nexentastor which gives you a web interface
> to manage the filesystems and sharing since you don't need anything
> else.  But the free community edition only goes to 12 TB.  That might be
> enough per-host if you are going to layer something else on top, though.

12TB should be good enough for most use cases. I'm not planning on
going up to petabytes since it seems to me at some point, the network
will become the bottleneck. Again, I need to remember to look into
nexenstor.

> It is good for 2 things - you can snapshot for local 'back-in-time'
> copies without using extra space, and you can do incremental
> dump/restores from local to remote snapshots.

That sounds good... and bad at the same time because I add yet another
factor/feature to consider :D

> The VM host side is simple enough if its disk image is intact.  But, if
> you want to survive a disk server failure you need to have that
> replicated which seems like your main problem.

Which is where Gluster comes in with replicate across servers.

> If you can tolerate a 'slightly behind' backup copy, you could probably
> build it on top of zfs snapshot send/receive replication.   Nexenta has
> some sort of high-availability synchronous replication in their
> commercial product but I don't know the license terms.

That's the thing, I don't think I can tolerate a slightly behind copy
on the system. The transaction once done, must remain done. A
situation where a node fails right after a transaction was done and
output to user, then recovered to a slightly behind state where the
same transaction is then not done or not recorded, is not acceptable
for many types of transaction.

>The part I wonder about in all of these schemes is how long it takes to recover
> when the mirroring is broken.  Even with local md mirrors I find it
> takes most of a day even with < 1Tb drives with other operations
> becoming impractically slow.

In most cases, I'll expect the drives would fail first than the
server. So with the propose configuration, I have for each set of
data, a pair of server and 2 pairs of mirror drives. If server goes
down, Gluster handles self healing and if I'm not wrong, it's smart
about it so won't be duplicating every single inode. On the drive
side, even if one server is heavily impacted by the resync process,
the system as a whole likely won't notice it as much since the other
server is still at full speed.

I don't know if there's a way to shutdown a degraded md array and add
a new disk without resyncing/building. If that's possible, we have a
device which can clone a 1TB disk in about 4 hrs thus reducing the
delay to restore full redundancy.