Hi folks,
we (the company i am working for) are running several dozens of
virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation
infrastructure.
With the latest versions of all packages installed [1], we see failures
in live-migration of stock-CentOS6 HVM guests, leaving a
"Domain-unnamed" on the source host, while the migrated guest runs fine
on the target host.
Domain-0 0 2048 64 r----- 6791.5
Domain-Unnamed 1 4099 4 --ps-- 94.8
The failure is not consistently reproducable, some guests (of the same
type) live-migrate just fine, until eventually some seemingly random
guest fails, leaving a "Domain-unnamed" Zombie.
That Domain-unnamed causes several problems:
- The memory allocated to Domain-unnamed remains blocked, thus creating
a veritable 'memory-leak' to the host
- The DomU causing Domain-unnamed cannot be restarted on the host, as xm
thinks it's already running
I have tried various things to get rid of Domain-unnamed, all without
success
- multiple xm destroy
- restart xend
- delete everything regarding Dommain-unnamed in xenstore with
xenstore-rm. The removal is successful, but the domain remains.
Restarting xend after the deletion restores Domain-unnamed in xenstore
So far, the only way to get rid of Domain-unnamed is a virt-host reboot.
As these hosts are all quad-socket opteron 6272 machines with 256gig ram
running dozens of guests, this is highly impractical.
I have seen this behaviour using xen 4.2.5. The previous 4.2.4 versions
did not show this problem, however we did not use live-migration
extensively prior to that. Before switching to Xen4CentOS, we used to
build our own Xen 4.2.2 based on a git repo, published by Karanbir
Singh. We had several issues with that version, but never observed a
"Domain-unnamed".
Any idea how to resolve this issue would be highly appreciated, as
working live-migration is crucial to us.
Regards,
Thomas Weyergraf
Some notes on our config:
1. We still use xm/xend for various reasons
----
2. Our grub-config for the virtx-hosts is as follows:
----
default=0
timeout=5
#splashimage=(hd0,0)/grub/splash.xpm.gz
#hiddenmenu
title CentOS (xen-4.2.5-37.el6.gz vmlinuz-3.10.56-11.el6.centos.alt.x86_64)
root (hd0,0)
kernel /xen-4.2.5-37.el6.gz iommu=1 console=vga,com1
com1=115200,8n1 vga=text-80x25 dom0_mem=2048M,max:2048M
module /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro
xencons=hvc0 console=hvc0 root=/dev/fravirtx68/root rd_NO_LUKS
LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD
SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM
module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img
title CentOS (vmlinuz-3.10.56-11.el6.centos.alt.x86_64)
root (hd0,0)
kernel /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro
root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc
KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16
rd_LVM_LV=fravirtx68/root rd_NO_DM
module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img
3. A typical guest-config looks like:
----
name = "fraappmgmt05t.test.fra.net-m.internal"
uuid = "3778a443-9194-4c46-adff-211d7fcc24da"
memory = "4096"
vcpus = 4
kernel = "hvmloader"
builder = 'hvm'
disk = [
'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-56,xvda,w',
'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-57,xvdb,w',
]
vif = [ 'mac=00:16:3e:fa:15:4a,bridge=xenbr11' ]
device_model = 'qemu-dm'
serial='pty'
xen_platform_pci=1
on_poweroff = "destroy"
on_crash = "restart"
4. The xend.log excerpt of the migration process from the source host:
----
[2014-11-06 22:53:01 13499] DEBUG (XendDomainInfo:1795) Storing domain
details: {'console/port': '7', 'cpu/3/availability': 'online',
'description': '', 'console/limit': '1048576', 'cpu/2/availability':
'online', 'vm': '/vm/f5139575-984b-4c28-b470-efc042ba2703', 'domid':
'1', 'store/port': '6', 'console/type': 'ioemu', 'cpu/0/availability':
'online', 'memory/target': '4194304',
'control/platform-feature-multiprocessor-suspend': '1',
'store/ring-ref': '1044476', 'cpu/1/availability': 'online',
'control/platform-feature-xs_reset_watches': '1',
'image/suspend-cancel': '1', 'name':
'migrating-fraapppeccon06.fra.net-m.internal'}
[2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) xc_save: failed to
get the suspend evtchn port
[2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423)
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:394) suspend
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:127) In
saveInputHandler suspend
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:129) Suspending 1 ...
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:524)
XendDomainInfo.shutdown(suspend)
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882)
XendDomainInfo.handleShutdownWatch
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882)
XendDomainInfo.handleShutdownWatch
[2014-11-06 22:53:34 13499] INFO (XendDomainInfo:2079) Domain has
shutdown: name=migrating-fraapppeccon06.fra.net-m.internal id=1
reason=suspend.
[2014-11-06 22:53:34 13499] INFO (XendCheckpoint:135) Domain 1 suspended.
[2014-11-06 22:53:35 13499] INFO (image:542) signalDeviceModel:restore
dm state to running
[2014-11-06 22:53:35 13499] DEBUG (XendCheckpoint:144) Written done
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:3077)
XendDomainInfo.destroy: domid=1
[2014-11-06 22:53:35 13499] ERROR (XendDomainInfo:3091)
XendDomainInfo.destroy: domain destruction failed.
Traceback (most recent call last):
File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
line 3086, in destroy
xc.domain_destroy(self.domid)
Error: (16, 'Device or resource busy')
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2402) Destroying
device model
[2014-11-06 22:53:35 13499] INFO (image:619)
migrating-fraapppeccon06.fra.net-m.internal device model terminated
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2409) Releasing devices
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vif/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276)
XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing console/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276)
XendDomainInfo.destroyDevice: deviceClass = console, device = console/0
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51712
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276)
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51712
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51728
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276)
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51728
[2014-11-06 22:53:36 13499] DEBUG (XendCheckpoint:124) [xc_save]:
/usr/lib/xen/bin/xc_save 26 2 0 0 5