[CentOS] who uses Lustre in production with virtual machines?

Thu Aug 5 17:12:47 UTC 2010
Emmanuel Noobadmin <centos.admin at gmail.com>

On 8/6/10, Les Mikesell <lesmikesell at gmail.com> wrote:
> But even if you have live replicated data you might want historical
> snapshots and/or backup copies to protect against software/operator
> failure modes that might lose all of the replicated copies at once.

That we already do, daily backups of database, configurations and
where applicable website data. Kept for two months before dropping to
fortnightly archives which are then offloaded and kept for years.

> What you want is difficult to accomplish even in a local file system.  I
> think it would be unreasonably expensive (both in speed and cost) to put
> your entire data store on something that provides both replication and
> transactional guarantees.   I'd like to be convinced otherwise,
> though...   Is it a requirement that you can recover your transactional
> state after a complete power loss or is it enough to have reached the
> buffers of a replica system?

For the local side, I can rely on ACID compliant database engines such
as InnoDB on MySQL to maintain transactional integrity. What I don't
want is if the transaction is committed on the primary disk, an output
sent to the user for something supposedly unique such as a serial
number. Then before the replication service (in this case, the delayed
replicate of zfs send/receive) kicks in, the primary server dies.

For DRBD and gluster, if I'm not mistaken, unless I deliberate set
otherwise, a write must have at least reached the replica buffers
before it's considered as committed. So this scenario is unlikely to
arise thus I don't see this as a problem with using them as machine
replication service as compared to the unknown delay of using zfs
send/receive replicate.

While I'm using DB as an example, the same issue applies to the VM
disk image. The upper layer cannot be told a write is done until it's
been at least sent out to the replica system. The way I see it under
DRBD or gluster replicate, only if the replica dies after receiving
the write, followed by the primary dying after receiving the ack AND
reporting the result to the user AND both drives in its mirror dying.
Then would I have a consistency issue. I know it's not possible to
guarantee 100% but I can live with this kind of probability as
compared to a several seconds delay where several transactions/changes
could have taken place before a replica receives an update.

>> In most cases, I'll expect the drives would fail first than the
>> server. So with the propose configuration, I have for each set of
>> data, a pair of server and 2 pairs of mirror drives. If server goes
>> down, Gluster handles self healing and if I'm not wrong, it's smart
>> about it so won't be duplicating every single inode. On the drive
>> side, even if one server is heavily impacted by the resync process,
>> the system as a whole likely won't notice it as much since the other
>> server is still at full speed.
>
> I don't see how you can have transactional replication if the servers
> don't have to stay in sync, or how you can avoid being slowed down by
> the head motion of a good drive being replicated to a new mirror.
> There's just some physics involved that don't make sense.

Sorry for the confusion, I don't mean no slow down or expect the
underlying fs to be responsible for transactional replication. That's
the job of the DBMS, I just need the fs replication not to fail in
such a way that it could cause transactional integrity issue as noted
in my reply above.

Also I expect the impact of a rebuild to be lesser as gluster can be
configured (temporarily or permanently) to prefer a particular
volume(node) to be read from so the responsiveness should still be
good (just that the theoretical bandwidth is halved) and reducing the
head motion on the rebuilding node as less reads are demanded from it.

> As far as I know, linux md devices have to rebuild completely.  A raid1

Darn, I was hoping there was the equivalent of  "assemble but do not
rebuild" option which I had on fakeraid controllers several years
back. But I suppose if we clone the drive externally and throw it back
into service, it still does help with reducing the degradation window
since it is an identical copy even if md doesn't know it.