[CentOS] who uses Lustre in production with virtual machines?

Thu Aug 5 16:10:20 UTC 2010

On 8/4/2010 11:40 PM, Emmanuel Noobadmin wrote:
>
>> It is good for 2 things - you can snapshot for local 'back-in-time'
>> copies without using extra space, and you can do incremental
>> dump/restores from local to remote snapshots.
>
> That sounds good... and bad at the same time because I add yet another
> factor/feature to consider :D

But even if you have live replicated data you might want historical 
snapshots and/or backup copies to protect against software/operator 
failure modes that might lose all of the replicated copies at once.

>> The VM host side is simple enough if its disk image is intact.  But, if
>> you want to survive a disk server failure you need to have that
>> replicated which seems like your main problem.
>
> Which is where Gluster comes in with replicate across servers.
>
>
>> If you can tolerate a 'slightly behind' backup copy, you could probably
>> build it on top of zfs snapshot send/receive replication.   Nexenta has
>> some sort of high-availability synchronous replication in their
>> commercial product but I don't know the license terms.
>
> That's the thing, I don't think I can tolerate a slightly behind copy
> on the system. The transaction once done, must remain done. A
> situation where a node fails right after a transaction was done and
> output to user, then recovered to a slightly behind state where the
> same transaction is then not done or not recorded, is not acceptable
> for many types of transaction.

What you want is difficult to accomplish even in a local file system.  I 
think it would be unreasonably expensive (both in speed and cost) to put 
your entire data store on something that provides both replication and 
transactional guarantees.   I'd like to be convinced otherwise, 
though...   Is it a requirement that you can recover your transactional 
state after a complete power loss or is it enough to have reached the 
buffers of a replica system?

>> The part I wonder about in all of these schemes is how long it takes to recover
>> when the mirroring is broken.  Even with local md mirrors I find it
>> takes most of a day even with<  1Tb drives with other operations
>> becoming impractically slow.
>
> In most cases, I'll expect the drives would fail first than the
> server. So with the propose configuration, I have for each set of
> data, a pair of server and 2 pairs of mirror drives. If server goes
> down, Gluster handles self healing and if I'm not wrong, it's smart
> about it so won't be duplicating every single inode. On the drive
> side, even if one server is heavily impacted by the resync process,
> the system as a whole likely won't notice it as much since the other
> server is still at full speed.

I don't see how you can have transactional replication if the servers 
don't have to stay in sync, or how you can avoid being slowed down by 
the head motion of a good drive being replicated to a new mirror. 
There's just some physics involved that don't make sense.

> I don't know if there's a way to shutdown a degraded md array and add
> a new disk without resyncing/building. If that's possible, we have a
> device which can clone a 1TB disk in about 4 hrs thus reducing the
> delay to restore full redundancy.

As far as I know, linux md devices have to rebuild completely.  A raid1 
will run at full speed with only one member so you can put off the 
rebuild for as long as you are willing to not have redundancy and the 
rebuild doesn't use much CPU, but during the rebuild the good drive's 
head has to make a complete pass across the drive and will keep getting 
pulled back there when running applications need it to be elsewhere.

-- 
   Les Mikesell
    lesmikesell at gmail.com