Re: [CentOS-virt] Problem with Xen4CentOS

18 Nov 2014

      On Sun, Nov 16, 2014 at 12:39 AM, Thomas Weyergraf
T.Weyergraf@virtfinity.de wrote:
...
Hi folks,
we (the company i am working for) are running several dozens of
virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation
infrastructure.
With the latest versions of all packages installed [1], we see failures in
live-migration of stock-CentOS6 HVM guests, leaving a "Domain-unnamed" on
the source host, while the migrated guest runs fine on the target host.
Domain-0                                     0  2048    64 r----- 6791.5
Domain-Unnamed                               1  4099     4 --ps--     94.8
The failure is not consistently reproducable, some guests (of the same type)
live-migrate just fine, until eventually some seemingly random guest fails,
leaving a "Domain-unnamed" Zombie.
That Domain-unnamed causes several problems:

The memory allocated to Domain-unnamed remains blocked, thus creating a

veritable 'memory-leak' to the host

The DomU causing Domain-unnamed cannot be restarted on the host, as xm

thinks it's already running
I have tried various things to get rid of Domain-unnamed, all without
success

multiple xm destroy
restart xend
delete everything regarding Dommain-unnamed in xenstore with xenstore-rm.

The removal is successful, but the domain remains. Restarting xend after the
deletion restores Domain-unnamed in xenstore
So far, the only way to get rid of Domain-unnamed is a virt-host reboot. As
these hosts are all quad-socket opteron 6272 machines with 256gig ram
running dozens of guests, this is highly impractical.
I have seen this behaviour using xen 4.2.5. The previous 4.2.4 versions did
not show this problem, however we did not use live-migration extensively
prior to that. Before switching to Xen4CentOS, we used to build our own Xen
4.2.2 based on a git repo, published by Karanbir Singh. We had several
issues with that version, but never observed a "Domain-unnamed".
Any idea how to resolve this issue would be highly appreciated, as working
live-migration is crucial to us.
Regards,
Thomas Weyergraf
Some notes on our config:

We still use xm/xend for various reasons

Our grub-config for the virtx-hosts is as follows:

default=0
timeout=5
#splashimage=(hd0,0)/grub/splash.xpm.gz
#hiddenmenu
title CentOS (xen-4.2.5-37.el6.gz vmlinuz-3.10.56-11.el6.centos.alt.x86_64)
        root (hd0,0)
        kernel /xen-4.2.5-37.el6.gz iommu=1 console=vga,com1 com1=115200,8n1
vga=text-80x25 dom0_mem=2048M,max:2048M
        module /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro xencons=hvc0
console=hvc0 root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8
KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD
SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM
        module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img
title CentOS (vmlinuz-3.10.56-11.el6.centos.alt.x86_64)
        root (hd0,0)
        kernel /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro
root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc
KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16
rd_LVM_LV=fravirtx68/root rd_NO_DM
        module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img

A typical guest-config looks like:

name = "fraappmgmt05t.test.fra.net-m.internal"
uuid = "3778a443-9194-4c46-adff-211d7fcc24da"
memory = "4096"
vcpus = 4
kernel = "hvmloader"
builder = 'hvm'
disk = [
'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-56,xvda,w',
'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-57,xvdb,w',
]
vif = [ 'mac=00:16:3e:fa:15:4a,bridge=xenbr11' ]
device_model = 'qemu-dm'
serial='pty'
xen_platform_pci=1
on_poweroff = "destroy"
on_crash = "restart"

The xend.log excerpt of the migration process from the source host:

[2014-11-06 22:53:01 13499] DEBUG (XendDomainInfo:1795) Storing domain
details: {'console/port': '7', 'cpu/3/availability': 'online',
'description': '', 'console/limit': '1048576', 'cpu/2/availability':
'online', 'vm': '/vm/f5139575-984b-4c28-b470-efc042ba2703', 'domid': '1',
'store/port': '6', 'console/type': 'ioemu', 'cpu/0/availability': 'online',
'memory/target': '4194304',
'control/platform-feature-multiprocessor-suspend': '1', 'store/ring-ref':
'1044476', 'cpu/1/availability': 'online',
'control/platform-feature-xs_reset_watches': '1', 'image/suspend-cancel':
'1', 'name': 'migrating-fraapppeccon06.fra.net-m.internal'}
[2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) xc_save: failed to get
the suspend evtchn port
[2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423)
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:394) suspend
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:127) In saveInputHandler
suspend
[2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:129) Suspending 1 ...
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:524)
XendDomainInfo.shutdown(suspend)
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882)
XendDomainInfo.handleShutdownWatch
[2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882)
XendDomainInfo.handleShutdownWatch
[2014-11-06 22:53:34 13499] INFO (XendDomainInfo:2079) Domain has shutdown:
name=migrating-fraapppeccon06.fra.net-m.internal id=1 reason=suspend.
[2014-11-06 22:53:34 13499] INFO (XendCheckpoint:135) Domain 1 suspended.
[2014-11-06 22:53:35 13499] INFO (image:542) signalDeviceModel:restore dm
state to running
[2014-11-06 22:53:35 13499] DEBUG (XendCheckpoint:144) Written done
[2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:3077)
XendDomainInfo.destroy: domid=1
[2014-11-06 22:53:35 13499] ERROR (XendDomainInfo:3091)
XendDomainInfo.destroy: domain destruction failed.
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line
3086, in destroy
    xc.domain_destroy(self.domid)
Error: (16, 'Device or resource busy')
Actually, looking at this again -- something is definitely weird here.
That hypercall shouldn't be able to return anything except error 11,
"EAGAIN", or error 3, "ESRCH".  Error 16 "EBUSY" isn't anywhere in the
codepath for domain_destroy, and several places will call BUG_ON() if
the error returned is *not* EAGAIN.
Are you sure you're running a matched set of hypervisor and tools?
-George

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [CentOS-virt] Problem with Xen4CentOS