On 6/25/2010 7:33 AM, Brian Mathis wrote:
On Fri, Jun 25, 2010 at 9:04 AM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
I'm wondering if virtualization could be used as a cheap redundancy solution for situations which can tolerate a certain amount of downtime.
Current recommendations is to run some kind of replication server such as DRBD. The problem here is cost if there are more than one server (or servers running on different OS) to be backed up. I'd basically need to tell my client they need to buy say 2X machines instead of just X. Not really attractive :D
So I'm wondering if it would be a good, or likely stupid idea, to run X CentOS machines with VMware. Each running a single instance of CentOS and in at least one case of Windows for MSSQL.
Sure. I run 4 machines with VMware Server2 in production. Three with the live VM machines and a 4th with live 'near-mirror' VMs of all the others.
So if any of the machines physically fails for whatever reasons not related to disk. I'll just transfer the disk to one of the surviving server or a cold standby and have things running again within say 30~60 minutes needed to check the filesystem, then mount and copy the image.
I don't like this so much. It means you physically have to move something, possibly have to fsck the drives and deal with potential corruption of the VM images.
I thought I could also rsync the images so that Server 1 backs up Server 2 image file and Server 2 backs up Server 3 etc in a round robin fashion to make this even faster. But reading up indicates that rsync would attempt to mirror the whole 60gb or 80gb image on any change. Bad idea.
You have multiple choices here. I do three things:
1) I have 'near'-image machines running live all the time that rsync all the production relevant portions of the live machines once a day. With scripts that can put them live in a few seconds or minutes at need.
2) I have snapshots of the VM images themselves that I take once a week by shutting down the VMs, taking an LVM static snapshot, restarting the VMs, rsyncing the snapshot to another machine, and then removing the snapshot. Since rsync only transfers the *changed* part of the image files this only takes a few hours for some hundreds of gigabytes of VM images and only has a few minutes of actual downtime.
Since VMware Server 2 has an unfixed 'cpu load' leak requiring you to stop/restart the machines about once every week or two anyway, it kills two birds with one stone.
3) I also have inside-the-vm full rsync-over-ssh with hardlinking onsite/offsite backup of all the live virtual machines taken daily with a 7 x daily, 4 x weekly, 3 x monthly, 2 x quarterly, 2 x semi-annual retention.
So while this is not real time HA but in most situations, they can tolerate an hour's downtime. The cost of the "redundancy" also stays constant no matter how many servers are added to the operation.
Any comments on this or is it like just plain stupid because there are better options that are equally cost effective?
This is one of the advantages of using VMs, and I'm sure most people are using it for this reason in one way or another. However, there are a few things you need to worry about:
- When the host crashes, the guests will also, so you'll be in a
recovery situation just like for a physical crash. This is manageable and something you'd have to deal with either way.
I'm not so hot on the 'move the physical disk' idea. 'Move the data' seems better to me.
- Rsyncing the VMs while they are running leaves them in an
inconsistent state. This state may or may not be worse than a simple crash situation. One way I have been getting around this is by creating a snapshot of the VM before performing the rsync, and when bringing up the copy after a crash, revert to the snapshot. That will at least give you consistent filesystem and memory state, but could cause issues with network connections. I usually reboot the VM cleanly after reverting to the snapshot.
Note - *take the snapshot while the vm's are 'stopped' or 'paused'* :)
Rsync will not transfer the entire file when transferring over the network. It scans the whole thing and only sends changes. If you have --progress enabled it will appear to go through the whole file, but you will see the "speedup" go much higher than a regular transfer. However, sometimes this process can take more time than doing a full copy on a local network. Rsync is meant to conserve bandwidth, not necessarily time. Also, I suggest the you use a GB network if you have the option. If not you could directly link the network ports on 2 servers and copy straight from 1 to the other.
Yep.
If you are looking at VMware Server for this, here are some tips:
- For best performance, search around for "vmware tmpfs". It will
dramatically increase the performance of the VMs at the expense of some memory.
+1
We are talking an order of magnitude difference in performance. This is probably the single most important performance tuning tip for VMware Server.
- VMware Server seems like it's EOL, even though vmware hasn't
specifically said so yet
Yah. They have been 'not calling it dead' for a while now. It is clear though from the lack of even important security patches that they intend to put as little into it as possible before it officially reaches EOL in June 2011.
- There is a bug in VMware with CentOS that causes guests to slowly
use more CPU until the whole machine is bogged down. This can be fixed by restarting or suspend/resume each VM
Note that suspend can cause havoc with IP interfaces if you bring up addresses that are not part of the automatic list during normal operation.
There is *also* a serious bug with its glibc handling on CentOS 5.4 or later. You will need to install an older copy of glibc directly into the vmware libraries for the host machine and tweak the launch scripts for stable operation. Google for: centos vmware glibc
- At this point I'd look at ESXi for the free VMware option.
Or KVM if you are willing to leave VMware since that is where RH is going.