[CentOS-virt] Migration Problem

Sat Mar 29 09:24:20 UTC 2008
Brett Worth <brett at worth.id.au>

Hi all!

I've been experimenting with Xen on CentOS 5 and RHAS 5 for a while now with mixed success 
  I thought I'd describe my latest challenge.   I'll describe this from memory since all 
the equipment is at work and not contactable from here.

I think I've described this config to the list before but here it is again:

I have 2 x HP DL585 servers each with 4 Dual core Opterons (non-vmx) and 16GB RAM
configured as Xen servers.  These run CentOS 5.1 with the latest updates applied.  These
system both attach to an iSCSI target which is an HP DL385 running ietd and serving SAN
based storage.

Everything runs fine if I do no migration.  I was having a "soft lockup" problem which has 
been solved by installing the latest test kernel from redhat.  They've adjusted a timer as 
described in this report (comment 5): https://bugzilla.redhat.com/show_bug.cgi?id=250994

Anyway things are pretty stable now but...  If I do a series of migrations, live or not, 
from one server to the other eventually I will get the process to fail.  This could take 
up to an hour with the migrations happening every 5 minutes.  It can also happen first 
try.  The message in the xend.log file says that it is unable to find the device number 
for the virtualised storage i.e. sda.    In my configuration I have dom0 connecting to the 
LUNs used by the VMs to the domU's are not doing  iSCSI.  I'm passing the 
"/dev/disks/by-path/iscsixxx:sda1" info in via the xen config file.

If I mount and unmount the same LUN's filesystem to the dom0 over and over again it works 
every time so there's no fundamental problem with the iSCSI connection itself.  Each 
server is using 3 gige interfaces: One for normal LAN access, a dedicated network for the 
iSCSI and a crossover cable between the third gige interfaces on the servers for the 
migration channel.

I have a xentop running on both dom0's and can tell that its failed when the vm appears in 
the target but the memory used by the vm doesn't start incrementing.  The end result is 
that the VM is hung and has to be destroyed and re-created.  The error happens immediately 
after the failed migration is initiated.

I've tried doing a read of the first few blocks of the LUN on the target immediately 
before initiating the migration which is successful but makes no difference.

I realize this is pretty sketchy information but I was wondering if others were seeing 
similar problems.  The ability to do reliable migration is basically the prime motivation 
for us wanting to do virtualization at all.  Otherwise we'd just run the required services 
on the main machine.

Anyway, any help would be appreciated.

Brett