[CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

Mon Sep 11 18:13:04 UTC 2017
Johnny Hughes <johnny at centos.org>

On 09/07/2017 03:17 PM, Kevin Stange wrote:
> On 09/06/2017 05:21 PM, Kevin Stange wrote:
>> On 09/06/2017 08:40 AM, Johnny Hughes wrote:
>>> On 09/05/2017 02:26 PM, Kevin Stange wrote:
>>>> On 09/04/2017 05:27 PM, Johnny Hughes wrote:
>>>>> On 09/04/2017 03:59 PM, Kevin Stange wrote:
>>>>>> On 09/02/2017 08:11 AM, Johnny Hughes wrote:
>>>>>>> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>>>>>>>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>>>>>>>> A recently created and fully functional CentOS 7.3 VM fails to boot
>>>>>>>>> after applying CR updates:
>>>>>>>> <snip>
>>>>>>>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>>>>>>>> rpm -qa xen\*
>>>>>>>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>>>>>>>> xen-4.6.3-15.el7.x86_64
>>>>>>>>> xen-licenses-4.6.3-15.el7.x86_64
>>>>>>>>> xen-libs-4.6.3-15.el7.x86_64
>>>>>>>>> xen-runtime-4.6.3-15.el7.x86_64
>>>>>>>>>
>>>>>>>>> uname -a
>>>>>>>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>>>>>>>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>>
>>>>>>>>> Sadly, the other issue is that the grub menu will not display for me to
>>>>>>>>> select another kernel to see if it is just a kernel issue.
>>>>>>>>>
>>>>>>>>> The dracut prompt does not show any /dev/disk folder either.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
>>>>>>>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
>>>>>>>> 514.26.2.  The kernel messages that appear to kick off the failure for
>>>>>>>> me start with a page allocation failure.  It eventually reaches dracut
>>>>>>>> failures due to systemd/udev not setting up properly, but I think the
>>>>>>>> root is this:
>>>>>>>>
>> <snip>
>>>>>>>
>>>>>>> Do any of you guys have access to RHEL to try the RHEL 7.4 Kernel?
>>>>>>
>>>>>> I think I may.  I haven't tried yet, but I'll see if I can get my hands
>>>>>> on one and test it tomorrow when I'm back at the office tomorrow.
>>>>>>
>>>>>> RH closed my bug as "WONTFIX" so far, saying Red Hat Quality Engineering
>>>>>> Management declined the request.  I started to look at the Red Hat
>>>>>> source browser to see the list of patches from 693 to 514, but getting
>>>>>> the full list seems impossible because the change log only goes back to
>>>>>> 644 and there doesn't seem to be a way to obtain full builds of
>>>>>> unreleased kernels.  Unless I'm mistaken.
>>>>>>
>>>>>> I will also do some digging via RH support if I can.
>>>>>>
>>>>> I would think that RH would want AWS support for RHEL 7.4 and I thought
>>>>> AWS was run on Xen // Note:  I could be wrong about that.
>>>>>
>>>>> In any event, at the very least, we can make a kernel that boots PV for
>>>>> 7.4 at some point.
>>>>
>>>> AWS does run on Xen, but the modifications they make to Xen are not
>>>> known to me nor which version of Xen they use.  They may also run the
>>>> domains as HVM, which seems to mitigate the issue here.
>>>>
>>>> I just verified this kernel issue exists on a RHEL 7.3 system image
>>>> under the same conditions, when it's updated to RHEL 7.4 and kernel
>>>> 3.10.0-693.2.1.el7.x86_64.
>>>>
>>>
>>> One other option is to run the DomU's as PVHVM:
>>> https://wiki.xen.org/wiki/Xen_Linux_PV_on_HVM_drivers
>>>
>>> That should be much better performance than HVM and may be a workable
>>> solution for people who don't want to modify their VM kernel.
>>>
>>> Here is more info on PVHVM:
>>> https://wiki.xen.org/wiki/PV_on_HVM
>>>
>>> ================
>>> Also heard from someone to try this Config file change to the base
>>> kernel and rebuild:
>>>
>>> CONFIG_RANDOMIZE_BASE=n
>>
>> This suggestion was mirrored in the RH bugzilla as well, it worked, but
>> the same issue does not exist in newer kernels which have the option on.
>>  I've posted updated findings in the CentOS bug, which includes a patch
>> that I found which seems to fix the issue:
>>
>> https://bugs.centos.org/view.php?id=13763#c30014
> 
> With many thanks to hughesjr and toracat, I was able to find a patch
> that seems to resolve this issue and get it into CentOS Plus
> 3.10.0-693.2.1.  I've asked Red Hat to apply it to some future kernel
> update, but that is only a dream for now.
> 
> In the meantime, if anyone who has been experiencing the issue with PV
> domains can try out the CentOS Plus kernel here and provide feedback,
> I'd appreciate it!
> 
> https://buildlogs.centos.org/c7-plus/kernel-plus/20170907163005/3.10.0-693.2.1.el7.centos.plus.x86_64/
> 

I can verify that PV-on-HVM mode works with the regular kernel as well
if that is an option for you.  Many public clouds (like AWS) require HVM
or PV-on-HVM.

See these for more information on PV-on-HVM:

https://wiki.xen.org/wiki/PV_on_HVM

https://wiki.xen.org/wiki/Xen_Linux_PV_on_HVM_drivers

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170911/5bbda5a1/attachment-0006.sig>