On Fri, Jun 25, 2010 at 12:28 PM, Emmanuel Noobadmin <centos.admin at gmail.com> wrote: >> - Rsyncing the VMs while they are running leaves them in an >> inconsistent state. This state may or may not be worse than a simple >> crash situation. One way I have been getting around this is by >> creating a snapshot of the VM before performing the rsync, and when >> bringing up the copy after a crash, revert to the snapshot. That will >> at least give you consistent filesystem and memory state, but could >> cause issues with network connections. I usually reboot the VM >> cleanly after reverting to the snapshot. > > The problem with doing snapshot is, the data reverts to whatever it > was at the point of the snapshot. The client can accept waiting for > 3~4 hrs for their servers to be fixed every now and then. It's a mess > of several servers ranging from almost 10yrs old we inherited from > their past vendors. > > Which is why they would readily accept even 1hr of downtime for a VM > image to be transferred. But they will not accept the need to redo > work. Even if they are willing, it isn't possible because of the > server generated number sequences that would already be used by their > clients but would not likely match the new numbers after a restore to > an older snapshot. You cannot do rsync on a continuous basis, so I think you have your answer there. Even running it once an hour isn't going to work, as the machine will be inconsistent (very bad disk corruption). It sounds like you need to get some new servers anyway, so DRBD is probably the way you need to go. Either that or a dedicated SAN or SAN-like device. >> Rsync will not transfer the entire file when transferring over the >> network. It scans the whole thing and only sends changes. If you >> have --progress enabled it will appear to go through the whole file, >> but you will see the "speedup" go much higher than a regular transfer. >> However, sometimes this process can take more time than doing a full >> copy on a local network. Rsync is meant to conserve bandwidth, not >> necessarily time. Also, I suggest the you use a GB network if you >> have the option. If not you could directly link the network ports on >> 2 servers and copy straight from 1 to the other. > > They already have GB switches so not a problem if rsync works > incrementally on images as well. > > At the same time, I do have reservations about such a hack so I'm also > exploring the other possibility of implementing a 2 machine Lustre > cluster and run all images from that storage cluster instead. That > would take an extra machine but still more viable than the 2x option > and much faster to get back up. > > >> If you are looking at VMware Server for this, here are some tips: >> - For best performance, search around for "vmware tmpfs". It will >> dramatically increase the performance of the VMs at the expense of >> some memory. > > Thanks for the tip. > >> - VMware Server seems like it's EOL, even though vmware hasn't >> specifically said so yet >> - There is a bug in VMware with CentOS that causes guests to slowly >> use more CPU until the whole machine is bogged down. This can be >> fixed by restarting or suspend/resume each VM > > It explains a puzzling seemingly random freeze up we get with a > particular test system. I guess the random part was because we do > sus/restart the machine every now and then so it didn't always bog > down to the point we'd notice. > > Thanks again for the responses :) The creeping CPU problem happens slowly over the course of a week or so, so if you're seeing acute freeze-ups, then that's probably not it. However, if all machines have been running for a while, try to suspend/resume all of them, then see if the problem goes away.