Hi all! I've been experimenting with Xen on CentOS 5 and RHAS 5 for a while now with mixed success I thought I'd describe my latest challenge. I'll describe this from memory since all the equipment is at work and not contactable from here. I think I've described this config to the list before but here it is again: I have 2 x HP DL585 servers each with 4 Dual core Opterons (non-vmx) and 16GB RAM configured as Xen servers. These run CentOS 5.1 with the latest updates applied. These system both attach to an iSCSI target which is an HP DL385 running ietd and serving SAN based storage. Everything runs fine if I do no migration. I was having a "soft lockup" problem which has been solved by installing the latest test kernel from redhat. They've adjusted a timer as described in this report (comment 5): https://bugzilla.redhat.com/show_bug.cgi?id=250994 Anyway things are pretty stable now but... If I do a series of migrations, live or not, from one server to the other eventually I will get the process to fail. This could take up to an hour with the migrations happening every 5 minutes. It can also happen first try. The message in the xend.log file says that it is unable to find the device number for the virtualised storage i.e. sda. In my configuration I have dom0 connecting to the LUNs used by the VMs to the domU's are not doing iSCSI. I'm passing the "/dev/disks/by-path/iscsixxx:sda1" info in via the xen config file. If I mount and unmount the same LUN's filesystem to the dom0 over and over again it works every time so there's no fundamental problem with the iSCSI connection itself. Each server is using 3 gige interfaces: One for normal LAN access, a dedicated network for the iSCSI and a crossover cable between the third gige interfaces on the servers for the migration channel. I have a xentop running on both dom0's and can tell that its failed when the vm appears in the target but the memory used by the vm doesn't start incrementing. The end result is that the VM is hung and has to be destroyed and re-created. The error happens immediately after the failed migration is initiated. I've tried doing a read of the first few blocks of the LUN on the target immediately before initiating the migration which is successful but makes no difference. I realize this is pretty sketchy information but I was wondering if others were seeing similar problems. The ability to do reliable migration is basically the prime motivation for us wanting to do virtualization at all. Otherwise we'd just run the required services on the main machine. Anyway, any help would be appreciated. Brett