[CentOS-virt] Problem with Xen4CentOS

Sun Nov 16 00:39:54 UTC 2014
Thomas Weyergraf <T.Weyergraf at virtfinity.de>

Hi folks,

we (the company i am working for) are running several dozens of 
virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation 
infrastructure.

With the latest versions of all packages installed [1], we see failures 
in live-migration of stock-CentOS6 HVM guests, leaving a 
"Domain-unnamed" on the source host, while the migrated guest runs fine 
on the target host.

Domain-0                                     0  2048    64 r----- 6791.5
Domain-Unnamed                               1  4099     4 --ps--     94.8

The failure is not consistently reproducable, some guests (of the same 
type) live-migrate just fine, until eventually some seemingly random 
guest fails, leaving a "Domain-unnamed" Zombie.

That Domain-unnamed causes several problems:
- The memory allocated to Domain-unnamed remains blocked, thus creating 
a veritable 'memory-leak' to the host
- The DomU causing Domain-unnamed cannot be restarted on the host, as xm 
thinks it's already running

I have tried various things to get rid of Domain-unnamed, all without 
success
- multiple xm destroy
- restart xend
- delete everything regarding Dommain-unnamed in xenstore with 
xenstore-rm. The removal is successful, but the domain remains. 
Restarting xend after the deletion restores Domain-unnamed in xenstore

So far, the only way to get rid of Domain-unnamed is a virt-host reboot. 
As these hosts are all quad-socket opteron 6272 machines with 256gig ram 
running dozens of guests, this is highly impractical.

I have seen this behaviour using xen 4.2.5. The previous 4.2.4 versions 
did not show this problem, however we did not use live-migration 
extensively prior to that. Before switching to Xen4CentOS, we used to 
build our own Xen 4.2.2 based on a git repo, published by Karanbir 
Singh. We had several issues with that version, but never observed a 
"Domain-unnamed".

Any idea how to resolve this issue would be highly appreciated, as 
working live-migration is crucial to us.

Regards,
Thomas Weyergraf

Some notes on our config:

1. We still use xm/xend for various reasons
----
2. Our grub-config for the virtx-hosts is as follows:
----
default=0
timeout=5
#splashimage=(hd0,0)/grub/splash.xpm.gz
#hiddenmenu
title CentOS (xen-4.2.5-37.el6.gz vmlinuz-3.10.56-11.el6.centos.alt.x86_64)
         root (hd0,0)
         kernel /xen-4.2.5-37.el6.gz iommu=1 console=vga,com1 
com1=115200,8n1 vga=text-80x25 dom0_mem=2048M,max:2048M
         module /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro 
xencons=hvc0 console=hvc0 root=/dev/fravirtx68/root rd_NO_LUKS 
LANG=en_US.UTF-8  KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD 
SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM
         module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img
title CentOS (vmlinuz-3.10.56-11.el6.centos.alt.x86_64)
         root (hd0,0)
         kernel /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro 
root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc 
KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 
rd_LVM_LV=fravirtx68/root rd_NO_DM
         module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img

3. A typical guest-config looks like:
----
name = "fraappmgmt05t.test.fra.net-m.internal"
uuid = "3778a443-9194-4c46-adff-211d7fcc24da"
memory = "4096"
vcpus = 4
kernel = "hvmloader"
builder = 'hvm'
disk = [ 
'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-56,xvda,w', 
'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-57,xvdb,w', 
]
vif = [ 'mac=00:16:3e:fa:15:4a,bridge=xenbr11' ]
device_model = 'qemu-dm'
serial='pty'
xen_platform_pci=1
on_poweroff = "destroy"
on_crash = "restart"

4. The xend.log excerpt of the migration process from the source host:
----
[2014-11-06 22:53:01 13499] DEBUG (XendDomainInfo:1795) Storing domain 
details: {'console/port': '7', 'cpu/3/availability': 'online', 
'description': '', 'console/limit': '1048576', 'cpu/2/availability': 
'online', 'vm': '/vm/f5139575-984b-4c28-b470-efc042ba2703', 'domid': 
'1', 'store/port': '6', 'console/type': 'ioemu', 'cpu/0/availability': 
'online', 'memory/target': '4194304', 
'control/platform-feature-multiprocessor-suspend': '1', 
'store/ring-ref': '1044476', 'cpu/1/availability': 'online', 
'control/platform-feature-xs_reset_watches': '1', 
'image/suspend-cancel': '1', 'name': 
'migrating-fraapppeccon06.fra.net-m.internal'}
[2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) xc_save: failed to 
get the suspend evtchn port
[2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423)
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:394) suspend
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:127) In 
saveInputHandler suspend
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:129) Suspending 1 ...
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:524) 
XendDomainInfo.shutdown(suspend)
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) 
XendDomainInfo.handleShutdownWatch
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) 
XendDomainInfo.handleShutdownWatch
[2014-11-06 22:53:34 13499] INFO (XendDomainInfo:2079) Domain has 
shutdown: name=migrating-fraapppeccon06.fra.net-m.internal id=1 
reason=suspend.
[2014-11-06 22:53:34 13499] INFO (XendCheckpoint:135) Domain 1 suspended.
[2014-11-06 22:53:35 13499] INFO (image:542) signalDeviceModel:restore 
dm state to running
[2014-11-06 22:53:35 13499] DEBUG (XendCheckpoint:144) Written done
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:3077) 
XendDomainInfo.destroy: domid=1
[2014-11-06 22:53:35 13499] ERROR (XendDomainInfo:3091) 
XendDomainInfo.destroy: domain destruction failed.
Traceback (most recent call last):
   File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", 
line 3086, in destroy
     xc.domain_destroy(self.domid)
Error: (16, 'Device or resource busy')
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2402) Destroying 
device model
[2014-11-06 22:53:35 13499] INFO (image:619) 
migrating-fraapppeccon06.fra.net-m.internal device model terminated
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2409) Releasing devices
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vif/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) 
XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing console/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) 
XendDomainInfo.destroyDevice: deviceClass = console, device = console/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51712
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) 
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51712
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51728
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) 
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51728
[2014-11-06 22:53:36 13499] DEBUG (XendCheckpoint:124) [xc_save]: 
/usr/lib/xen/bin/xc_save 26 2 0 0 5