You cannot do rsync on a continuous basis, so I think you have your answer there. Even running it once an hour isn't going to work, as the machine will be inconsistent (very bad disk corruption). It
That's what I figured too, otherwise it should/would had been an easy solution.
sounds like you need to get some new servers anyway, so DRBD is probably the way you need to go. Either that or a dedicated SAN or SAN-like device.
DRBD as I understand it, is effectively RAID 1 over network. Which was the 2x cost and budget problem I had. Would the 2-machine Lustre cluster I'm considering to implement work as adequate as a SAN device?
The creeping CPU problem happens slowly over the course of a week or so, so if you're seeing acute freeze-ups, then that's probably not it. However, if all machines have been running for a while, try to suspend/resume all of them, then see if the problem goes away.
That is pretty much what we see. Sometimes we leave the machine on through several days if there was no changes to what we are doing on it and sometimes the VM freezes after a few days. Had appeared random because anyone of us could have restarted the VM during the week so those lock up probably happened when none of us did.
However, the CentOS machine itself is still responding when we vnc/ssh in so we would have to kill and restart the VM service before we can use the VM again. Sometimes that doesn't work and we have to reboot the machine.