[CentOS] Virtualization as cheap redundancy option?

Fri Jun 25 16:28:30 UTC 2010
Emmanuel Noobadmin <centos.admin at gmail.com>

> - Rsyncing the VMs while they are running leaves them in an
> inconsistent state.  This state may or may not be worse than a simple
> crash situation.  One way I have been getting around this is by
> creating a snapshot of the VM before performing the rsync, and when
> bringing up the copy after a crash, revert to the snapshot.  That will
> at least give you consistent filesystem and memory state, but could
> cause issues with network connections.  I usually reboot the VM
> cleanly after reverting to the snapshot.

The problem with doing snapshot is, the data reverts to whatever it
was at the point of the snapshot. The client can accept waiting for
3~4 hrs for their servers to be fixed every now and then. It's a mess
of several servers ranging from almost 10yrs old we inherited from
their past vendors.

Which is why they would readily accept even 1hr of downtime for a VM
image to be transferred. But they will not accept the need to redo
work. Even if they are willing, it isn't possible because of the
server generated number sequences that would already be used by their
clients but would not likely match the new numbers after a restore to
an older snapshot.

> Rsync will not transfer the entire file when transferring over the
> network.  It scans the whole thing and only sends changes.  If you
> have --progress enabled it will appear to go through the whole file,
> but you will see the "speedup" go much higher than a regular transfer.
>  However, sometimes this process can take more time than doing a full
> copy on a local network.  Rsync is meant to conserve bandwidth, not
> necessarily time.  Also, I suggest the you use a GB network if you
> have the option.  If not you could directly link the network ports on
> 2 servers and copy straight from 1 to the other.

They already have GB switches so not a problem if rsync works
incrementally on images as well.

At the same time, I do have reservations about such a hack so I'm also
exploring the other possibility of implementing a 2 machine Lustre
cluster and run all images from that storage cluster instead. That
would take an extra machine but still more viable than the 2x option
and much faster to get back up.

> If you are looking at VMware Server for this, here are some tips:
> - For best performance, search around for "vmware tmpfs".  It will
> dramatically increase the performance of the VMs at the expense of
> some memory.

Thanks for the tip.

> - VMware Server seems like it's EOL, even though vmware hasn't
> specifically said so yet
> - There is a bug in VMware with CentOS that causes guests to slowly
> use more CPU until the whole machine is bogged down.  This can be
> fixed by restarting or suspend/resume each VM

It explains a puzzling seemingly random freeze up we get with a
particular test system. I guess the random part was because we do
sus/restart the machine every now and then so it didn't always bog
down to the point we'd notice.

Thanks again for the responses :)