[CentOS] who uses Lustre in production with virtual machines?

Thu Aug 5 20:52:07 UTC 2010

On 8/6/10, Les Mikesell <lesmikesell at gmail.com> wrote:

> If you are going to do that, why not also rely on the database engine's
> replication which is aware of the transactions?   Databases rely on
> filesystem write ordering and fsync() actually working - things that
> aren't always reliable locally, much less when clustered.

Mostly because I don't only need to set this up for databases only. I
can't just say "ok, the dbms can ensure transactional integrity as
well as provide remote replication" and ignore the other uses the
system has to support.

Also the secondary consideration that I need to be able to add more
storage nodes easily so it seems to make more sense to use a single
technology that can support both requirements.

Of course in the end, budget/tech constraints might mean that I have
to cut back somewhere eventually but it doesn't hurt to plan for
things and then know what I'm cutting out.

> But there are lots of ways things can go wrong, and clustering just adds
> to them.  What happens when your replica host dies?  Or the network to
> it, or the disk where you expect the copy to land?  And if you don't
> wait for a sync to disk, what happens if these things break after the
> remote accepted the buffer copy.

All the nodes will have RAID 1 setup, I also plan on using at least 2
switches to provide network redundancy.

In general, for the planned setup with minimal replicate delay, the
only real disaster is if all 4 drives die at the same time. Otherwise
I believe only a small window exist where very specific sequence of
failures would cause problem and even so only likely for one or two
transactions due to the time window. However, using a slower replicate
method like zfs send/receive which is a command line thing, the time
window enlarges significantly which even if causes reparable damage
would take far more time to fix simply due to the fact much more
transactions could be lost.

> The DB will offer a more optimized alternative. A VM image won't.

I'm not quite sure what's the connection here. The database runs
within the VM and is stored in the virtual disk. I'm not using VM to
substitute for a database replication but to segregrate functionality.
In a way, it would also allow me to pursue different redundancy
arrangements if the original configuration is not ideal for one of the
functions.

>But can you afford to wait for transactional guarantees on all that data
> that mostly doesn't matter?

Possibly, but of course depends on result of actual testing once a
final configuration is decided. Data integrity, redundancy and
availability (during working hours anyway) are more important than
absolute performance since server load are not usually that high. By
the time the customer's load can place significant demands on the
hardware, they should also have the budget for more
orthodox/proven/expensive solutions :D

> So how long do you wait if it is the replica that breaks?  And how do
> you recover/sync later?

I'm not sure what "wait" are you referring to. Is that the wait before
the chosen option decides to flag the node as down or the wait before
replacing the replica machine or the wait until the system is fully
redundant again with a sync'd replica?

As for the actual recovery/sync, if a drive fails in the storage node,
it would be straightforward case of replacing the drive and rebuilding
the node's raid array wouldn't it? If the storage node fails, such as
a mainboard problem, I'll replace/repair the node and put it back
online, leaving gluster to self heal/resync. Gluster keeps versioning
data so it would only sync changed files so that should be pretty
fast.

I could also stop both the servers at night, externally clone the
drives, edit the necessary conf files on the new replica and so avoid
mdraid trying to resync everything.

>> Sorry for the confusion, I don't mean no slow down or expect the
>> underlying fs to be responsible for transactional replication. That's
>> the job of the DBMS, I just need the fs replication not to fail in
>> such a way that it could cause transactional integrity issue as noted
>> in my reply above.
>
> That's a lot to ask.  I'd like to be convinced it is possible.

It's not possible if I'm not wrong, we can always think of a situation
or sequence of events that would break things. I'm just trying to pick
one that would minimize the time that window of opportunity would
exist hence zfs's send/receive as replication would not be a good
option for live replication.