Hi folks, we (the company i am working for) are running several dozens of virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation infrastructure. With the latest versions of all packages installed [1], we see failures in live-migration of stock-CentOS6 HVM guests, leaving a "Domain-unnamed" on the source host, while the migrated guest runs fine on the target host. Domain-0 0 2048 64 r----- 6791.5 Domain-Unnamed 1 4099 4 --ps-- 94.8 The failure is not consistently reproducable, some guests (of the same type) live-migrate just fine, until eventually some seemingly random guest fails, leaving a "Domain-unnamed" Zombie. That Domain-unnamed causes several problems: - The memory allocated to Domain-unnamed remains blocked, thus creating a veritable 'memory-leak' to the host - The DomU causing Domain-unnamed cannot be restarted on the host, as xm thinks it's already running I have tried various things to get rid of Domain-unnamed, all without success - multiple xm destroy - restart xend - delete everything regarding Dommain-unnamed in xenstore with xenstore-rm. The removal is successful, but the domain remains. Restarting xend after the deletion restores Domain-unnamed in xenstore So far, the only way to get rid of Domain-unnamed is a virt-host reboot. As these hosts are all quad-socket opteron 6272 machines with 256gig ram running dozens of guests, this is highly impractical. I have seen this behaviour using xen 4.2.5. The previous 4.2.4 versions did not show this problem, however we did not use live-migration extensively prior to that. Before switching to Xen4CentOS, we used to build our own Xen 4.2.2 based on a git repo, published by Karanbir Singh. We had several issues with that version, but never observed a "Domain-unnamed". Any idea how to resolve this issue would be highly appreciated, as working live-migration is crucial to us. Regards, Thomas Weyergraf Some notes on our config: 1. We still use xm/xend for various reasons ---- 2. Our grub-config for the virtx-hosts is as follows: ---- default=0 timeout=5 #splashimage=(hd0,0)/grub/splash.xpm.gz #hiddenmenu title CentOS (xen-4.2.5-37.el6.gz vmlinuz-3.10.56-11.el6.centos.alt.x86_64) root (hd0,0) kernel /xen-4.2.5-37.el6.gz iommu=1 console=vga,com1 com1=115200,8n1 vga=text-80x25 dom0_mem=2048M,max:2048M module /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro xencons=hvc0 console=hvc0 root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img title CentOS (vmlinuz-3.10.56-11.el6.centos.alt.x86_64) root (hd0,0) kernel /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img 3. A typical guest-config looks like: ---- name = "fraappmgmt05t.test.fra.net-m.internal" uuid = "3778a443-9194-4c46-adff-211d7fcc24da" memory = "4096" vcpus = 4 kernel = "hvmloader" builder = 'hvm' disk = [ 'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-56,xvda,w', 'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-57,xvdb,w', ] vif = [ 'mac=00:16:3e:fa:15:4a,bridge=xenbr11' ] device_model = 'qemu-dm' serial='pty' xen_platform_pci=1 on_poweroff = "destroy" on_crash = "restart" 4. The xend.log excerpt of the migration process from the source host: ---- [2014-11-06 22:53:01 13499] DEBUG (XendDomainInfo:1795) Storing domain details: {'console/port': '7', 'cpu/3/availability': 'online', 'description': '', 'console/limit': '1048576', 'cpu/2/availability': 'online', 'vm': '/vm/f5139575-984b-4c28-b470-efc042ba2703', 'domid': '1', 'store/port': '6', 'console/type': 'ioemu', 'cpu/0/availability': 'online', 'memory/target': '4194304', 'control/platform-feature-multiprocessor-suspend': '1', 'store/ring-ref': '1044476', 'cpu/1/availability': 'online', 'control/platform-feature-xs_reset_watches': '1', 'image/suspend-cancel': '1', 'name': 'migrating-fraapppeccon06.fra.net-m.internal'} [2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) xc_save: failed to get the suspend evtchn port [2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:394) suspend [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:127) In saveInputHandler suspend [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:129) Suspending 1 ... [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:524) XendDomainInfo.shutdown(suspend) [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) XendDomainInfo.handleShutdownWatch [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) XendDomainInfo.handleShutdownWatch [2014-11-06 22:53:34 13499] INFO (XendDomainInfo:2079) Domain has shutdown: name=migrating-fraapppeccon06.fra.net-m.internal id=1 reason=suspend. [2014-11-06 22:53:34 13499] INFO (XendCheckpoint:135) Domain 1 suspended. [2014-11-06 22:53:35 13499] INFO (image:542) signalDeviceModel:restore dm state to running [2014-11-06 22:53:35 13499] DEBUG (XendCheckpoint:144) Written done [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:3077) XendDomainInfo.destroy: domid=1 [2014-11-06 22:53:35 13499] ERROR (XendDomainInfo:3091) XendDomainInfo.destroy: domain destruction failed. Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 3086, in destroy xc.domain_destroy(self.domid) Error: (16, 'Device or resource busy') [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2402) Destroying device model [2014-11-06 22:53:35 13499] INFO (image:619) migrating-fraapppeccon06.fra.net-m.internal device model terminated [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2409) Releasing devices [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vif/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing console/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = console, device = console/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51712 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51712 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51728 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51728 [2014-11-06 22:53:36 13499] DEBUG (XendCheckpoint:124) [xc_save]: /usr/lib/xen/bin/xc_save 26 2 0 0 5