[CentOS-virt] Problem with Xen4CentOS

Tue Nov 18 16:39:47 UTC 2014
Thomas Weyergraf <T.Weyergraf at virtfinity.de>

On 11/18/2014 04:59 PM, George Dunlap wrote:
> On Sun, Nov 16, 2014 at 12:39 AM, Thomas Weyergraf
> <T.Weyergraf at virtfinity.de> wrote:
>> Hi folks,
>>
>> we (the company i am working for) are running several dozens of
>> virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation
>> infrastructure.
>>
>> With the latest versions of all packages installed [1], we see failures in
>> live-migration of stock-CentOS6 HVM guests, leaving a "Domain-unnamed" on
>> the source host, while the migrated guest runs fine on the target host.
>>
>> Domain-0                                     0  2048    64 r----- 6791.5
>> Domain-Unnamed                               1  4099     4 --ps--     94.8
>>
>> The failure is not consistently reproducable, some guests (of the same type)
>> live-migrate just fine, until eventually some seemingly random guest fails,
>> leaving a "Domain-unnamed" Zombie.
> Thanks for this report.
Good to know, they are appreciated. I have other issues, which I will be 
reporting soon as well.
>
> It looks like for some reason xend has asked Xen to shut down the
> domain, but Xen is saying, "Sorry, can't do that yet."  That's why
> restarting xend and removing things from xenstore don't work: xend is
> just saying what it sees, and what it sees is a zombie domain that
> refuses to die. :-)
Right. That's what I figured out as well. Everything in tearing down the 
migrated DomU in source-host context works fine until the actual 
deconstruction takes place - and fails.
>
> Do you have a serial port connected to any of your servers?
> * If so, could you:
>   - Send the output just after you notice a domain in this state
>   - Type "Ctrl-A" three times on the console to switch to Xen, and then
> type 'q'  (And send the resulting output)
I know, you were (rightfully) going to ask for that. However, I have 
seen this problem only in our production environment, were such changes 
are next to impossible, due to policy reasons. I am currently trying to 
get hold of some spare production servers to configure them accordingly 
and re-create the problem. If that is going to happen, I will happily 
provide the dump. However, I cannot guarantee, I will get the required 
ressources anytime soon. May take weeks to actually get spare machines.
> * If not, could you:
>   - send the output of "xl dmesg"
>   - Run "xl debug-keys q" and again take the output of "xl dmesg"?
I actually did that, but the result was not saved. IIRC, you basically 
saw all the bits of the DomU in place in the xenstore-part of the dump.
I will try to catch that dump asap.
The host has already been rebootet, so catching the dump for the 
reported case is not possible anymore.
>
> Can you also do "ps ax | grep qemu" to check to see if the qemu
> instance associated with this domain has actually been destroyed, or
> if it's still around?
Yes, the qemu-dm process (btw: called with correct parameters) was 
already gone.
>
> Also, have you tried running "xl destroy" on the domain and seeing
> what happens?  xl is stateless, so it can often do things along side
> of xend.  This is not a good idea in general as they can freqently end
> up stepping on each others' toes; but in this case I think it
> shouldn't be a problem.
Yes, I did, but to no avail. I even shut-down xend for this attempt to 
make sure, I do not trigger any code-paths in xl&friends, that might 
take extra steps for the "xend is running" case.
>
> Thanks,
>
>   -George
Thanks for your time an consideration. If you happen to have any hints 
on things to try or look after, I'd be a happy consumer ;)

Regards,
Thomas