[CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

Thu Sep 7 20:17:32 UTC 2017
Kevin Stange <kevin at steadfast.net>

On 09/06/2017 05:21 PM, Kevin Stange wrote:
> On 09/06/2017 08:40 AM, Johnny Hughes wrote:
>> On 09/05/2017 02:26 PM, Kevin Stange wrote:
>>> On 09/04/2017 05:27 PM, Johnny Hughes wrote:
>>>> On 09/04/2017 03:59 PM, Kevin Stange wrote:
>>>>> On 09/02/2017 08:11 AM, Johnny Hughes wrote:
>>>>>> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>>>>>>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>>>>>>> A recently created and fully functional CentOS 7.3 VM fails to boot
>>>>>>>> after applying CR updates:
>>>>>>> <snip>
>>>>>>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>>>>>>> rpm -qa xen\*
>>>>>>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>>>>>>> xen-4.6.3-15.el7.x86_64
>>>>>>>> xen-licenses-4.6.3-15.el7.x86_64
>>>>>>>> xen-libs-4.6.3-15.el7.x86_64
>>>>>>>> xen-runtime-4.6.3-15.el7.x86_64
>>>>>>>>
>>>>>>>> uname -a
>>>>>>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>>>>>>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>
>>>>>>>> Sadly, the other issue is that the grub menu will not display for me to
>>>>>>>> select another kernel to see if it is just a kernel issue.
>>>>>>>>
>>>>>>>> The dracut prompt does not show any /dev/disk folder either.
>>>>>>>>
>>>>>>>
>>>>>>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
>>>>>>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
>>>>>>> 514.26.2.  The kernel messages that appear to kick off the failure for
>>>>>>> me start with a page allocation failure.  It eventually reaches dracut
>>>>>>> failures due to systemd/udev not setting up properly, but I think the
>>>>>>> root is this:
>>>>>>>
> <snip>
>>>>>>
>>>>>> Do any of you guys have access to RHEL to try the RHEL 7.4 Kernel?
>>>>>
>>>>> I think I may.  I haven't tried yet, but I'll see if I can get my hands
>>>>> on one and test it tomorrow when I'm back at the office tomorrow.
>>>>>
>>>>> RH closed my bug as "WONTFIX" so far, saying Red Hat Quality Engineering
>>>>> Management declined the request.  I started to look at the Red Hat
>>>>> source browser to see the list of patches from 693 to 514, but getting
>>>>> the full list seems impossible because the change log only goes back to
>>>>> 644 and there doesn't seem to be a way to obtain full builds of
>>>>> unreleased kernels.  Unless I'm mistaken.
>>>>>
>>>>> I will also do some digging via RH support if I can.
>>>>>
>>>> I would think that RH would want AWS support for RHEL 7.4 and I thought
>>>> AWS was run on Xen // Note:  I could be wrong about that.
>>>>
>>>> In any event, at the very least, we can make a kernel that boots PV for
>>>> 7.4 at some point.
>>>
>>> AWS does run on Xen, but the modifications they make to Xen are not
>>> known to me nor which version of Xen they use.  They may also run the
>>> domains as HVM, which seems to mitigate the issue here.
>>>
>>> I just verified this kernel issue exists on a RHEL 7.3 system image
>>> under the same conditions, when it's updated to RHEL 7.4 and kernel
>>> 3.10.0-693.2.1.el7.x86_64.
>>>
>>
>> One other option is to run the DomU's as PVHVM:
>> https://wiki.xen.org/wiki/Xen_Linux_PV_on_HVM_drivers
>>
>> That should be much better performance than HVM and may be a workable
>> solution for people who don't want to modify their VM kernel.
>>
>> Here is more info on PVHVM:
>> https://wiki.xen.org/wiki/PV_on_HVM
>>
>> ================
>> Also heard from someone to try this Config file change to the base
>> kernel and rebuild:
>>
>> CONFIG_RANDOMIZE_BASE=n
> 
> This suggestion was mirrored in the RH bugzilla as well, it worked, but
> the same issue does not exist in newer kernels which have the option on.
>  I've posted updated findings in the CentOS bug, which includes a patch
> that I found which seems to fix the issue:
> 
> https://bugs.centos.org/view.php?id=13763#c30014

With many thanks to hughesjr and toracat, I was able to find a patch
that seems to resolve this issue and get it into CentOS Plus
3.10.0-693.2.1.  I've asked Red Hat to apply it to some future kernel
update, but that is only a dream for now.

In the meantime, if anyone who has been experiencing the issue with PV
domains can try out the CentOS Plus kernel here and provide feedback,
I'd appreciate it!

https://buildlogs.centos.org/c7-plus/kernel-plus/20170907163005/3.10.0-693.2.1.el7.centos.plus.x86_64/

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net