- Rsyncing the VMs while they are running leaves them in an
inconsistent state. This state may or may not be worse than a simple crash situation. One way I have been getting around this is by creating a snapshot of the VM before performing the rsync, and when bringing up the copy after a crash, revert to the snapshot. That will at least give you consistent filesystem and memory state, but could cause issues with network connections. I usually reboot the VM cleanly after reverting to the snapshot.
The problem with doing snapshot is, the data reverts to whatever it was at the point of the snapshot. The client can accept waiting for 3~4 hrs for their servers to be fixed every now and then. It's a mess of several servers ranging from almost 10yrs old we inherited from their past vendors.
Which is why they would readily accept even 1hr of downtime for a VM image to be transferred. But they will not accept the need to redo work. Even if they are willing, it isn't possible because of the server generated number sequences that would already be used by their clients but would not likely match the new numbers after a restore to an older snapshot.
Rsync will not transfer the entire file when transferring over the network. It scans the whole thing and only sends changes. If you have --progress enabled it will appear to go through the whole file, but you will see the "speedup" go much higher than a regular transfer. However, sometimes this process can take more time than doing a full copy on a local network. Rsync is meant to conserve bandwidth, not necessarily time. Also, I suggest the you use a GB network if you have the option. If not you could directly link the network ports on 2 servers and copy straight from 1 to the other.
They already have GB switches so not a problem if rsync works incrementally on images as well.
At the same time, I do have reservations about such a hack so I'm also exploring the other possibility of implementing a 2 machine Lustre cluster and run all images from that storage cluster instead. That would take an extra machine but still more viable than the 2x option and much faster to get back up.
If you are looking at VMware Server for this, here are some tips:
- For best performance, search around for "vmware tmpfs". It will
dramatically increase the performance of the VMs at the expense of some memory.
Thanks for the tip.
- VMware Server seems like it's EOL, even though vmware hasn't
specifically said so yet
- There is a bug in VMware with CentOS that causes guests to slowly
use more CPU until the whole machine is bogged down. This can be fixed by restarting or suspend/resume each VM
It explains a puzzling seemingly random freeze up we get with a particular test system. I guess the random part was because we do sus/restart the machine every now and then so it didn't always bog down to the point we'd notice.
Thanks again for the responses :)