[CentOS-virt] Migration Problem
brett at worth.id.au
Sat Mar 29 09:24:20 UTC 2008
I've been experimenting with Xen on CentOS 5 and RHAS 5 for a while now with mixed success
I thought I'd describe my latest challenge. I'll describe this from memory since all
the equipment is at work and not contactable from here.
I think I've described this config to the list before but here it is again:
I have 2 x HP DL585 servers each with 4 Dual core Opterons (non-vmx) and 16GB RAM
configured as Xen servers. These run CentOS 5.1 with the latest updates applied. These
system both attach to an iSCSI target which is an HP DL385 running ietd and serving SAN
Everything runs fine if I do no migration. I was having a "soft lockup" problem which has
been solved by installing the latest test kernel from redhat. They've adjusted a timer as
described in this report (comment 5): https://bugzilla.redhat.com/show_bug.cgi?id=250994
Anyway things are pretty stable now but... If I do a series of migrations, live or not,
from one server to the other eventually I will get the process to fail. This could take
up to an hour with the migrations happening every 5 minutes. It can also happen first
try. The message in the xend.log file says that it is unable to find the device number
for the virtualised storage i.e. sda. In my configuration I have dom0 connecting to the
LUNs used by the VMs to the domU's are not doing iSCSI. I'm passing the
"/dev/disks/by-path/iscsixxx:sda1" info in via the xen config file.
If I mount and unmount the same LUN's filesystem to the dom0 over and over again it works
every time so there's no fundamental problem with the iSCSI connection itself. Each
server is using 3 gige interfaces: One for normal LAN access, a dedicated network for the
iSCSI and a crossover cable between the third gige interfaces on the servers for the
I have a xentop running on both dom0's and can tell that its failed when the vm appears in
the target but the memory used by the vm doesn't start incrementing. The end result is
that the VM is hung and has to be destroyed and re-created. The error happens immediately
after the failed migration is initiated.
I've tried doing a read of the first few blocks of the LUN on the target immediately
before initiating the migration which is successful but makes no difference.
I realize this is pretty sketchy information but I was wondering if others were seeing
similar problems. The ability to do reliable migration is basically the prime motivation
for us wanting to do virtualization at all. Otherwise we'd just run the required services
on the main machine.
Anyway, any help would be appreciated.
More information about the CentOS-virt