[CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

Thu Sep 14 11:00:08 UTC 2017
Adi Pircalabu <adi at ddns.com.au>

On 14-09-2017 20:57, Adi Pircalabu wrote:
> On 08-09-2017 6:17, Kevin Stange wrote:
>> On 09/06/2017 05:21 PM, Kevin Stange wrote:
>>> On 09/06/2017 08:40 AM, Johnny Hughes wrote:
>>>> On 09/05/2017 02:26 PM, Kevin Stange wrote:
>>>>> On 09/04/2017 05:27 PM, Johnny Hughes wrote:
>>>>>> On 09/04/2017 03:59 PM, Kevin Stange wrote:
>>>>>>> On 09/02/2017 08:11 AM, Johnny Hughes wrote:
>>>>>>>> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>>>>>>>>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>>>>>>>>> A recently created and fully functional CentOS 7.3 VM fails to 
>>>>>>>>>> boot
>>>>>>>>>> after applying CR updates:
>>>>>>>>> <snip>
>>>>>>>>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>>>>>>>>> rpm -qa xen\*
>>>>>>>>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>>>>>>>>> xen-4.6.3-15.el7.x86_64
>>>>>>>>>> xen-licenses-4.6.3-15.el7.x86_64
>>>>>>>>>> xen-libs-4.6.3-15.el7.x86_64
>>>>>>>>>> xen-runtime-4.6.3-15.el7.x86_64
>>>>>>>>>> 
>>>>>>>>>> uname -a
>>>>>>>>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 
>>>>>>>>>> 4.9.39-29.el7.x86_64 #1 SMP
>>>>>>>>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>>> 
>>>>>>>>>> Sadly, the other issue is that the grub menu will not display 
>>>>>>>>>> for me to
>>>>>>>>>> select another kernel to see if it is just a kernel issue.
>>>>>>>>>> 
>>>>>>>>>> The dracut prompt does not show any /dev/disk folder either.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 
>>>>>>>>> from
>>>>>>>>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine 
>>>>>>>>> with
>>>>>>>>> 514.26.2.  The kernel messages that appear to kick off the 
>>>>>>>>> failure for
>>>>>>>>> me start with a page allocation failure.  It eventually reaches 
>>>>>>>>> dracut
>>>>>>>>> failures due to systemd/udev not setting up properly, but I 
>>>>>>>>> think the
>>>>>>>>> root is this:
>>>>>>>>> 
>>> <snip>
>>>>>>>> 
>>>>>>>> Do any of you guys have access to RHEL to try the RHEL 7.4 
>>>>>>>> Kernel?
>>>>>>> 
>>>>>>> I think I may.  I haven't tried yet, but I'll see if I can get my 
>>>>>>> hands
>>>>>>> on one and test it tomorrow when I'm back at the office tomorrow.
>>>>>>> 
>>>>>>> RH closed my bug as "WONTFIX" so far, saying Red Hat Quality 
>>>>>>> Engineering
>>>>>>> Management declined the request.  I started to look at the Red 
>>>>>>> Hat
>>>>>>> source browser to see the list of patches from 693 to 514, but 
>>>>>>> getting
>>>>>>> the full list seems impossible because the change log only goes 
>>>>>>> back to
>>>>>>> 644 and there doesn't seem to be a way to obtain full builds of
>>>>>>> unreleased kernels.  Unless I'm mistaken.
>>>>>>> 
>>>>>>> I will also do some digging via RH support if I can.
>>>>>>> 
>>>>>> I would think that RH would want AWS support for RHEL 7.4 and I 
>>>>>> thought
>>>>>> AWS was run on Xen // Note:  I could be wrong about that.
>>>>>> 
>>>>>> In any event, at the very least, we can make a kernel that boots 
>>>>>> PV for
>>>>>> 7.4 at some point.
>>>>> 
>>>>> AWS does run on Xen, but the modifications they make to Xen are not
>>>>> known to me nor which version of Xen they use.  They may also run 
>>>>> the
>>>>> domains as HVM, which seems to mitigate the issue here.
>>>>> 
>>>>> I just verified this kernel issue exists on a RHEL 7.3 system image
>>>>> under the same conditions, when it's updated to RHEL 7.4 and kernel
>>>>> 3.10.0-693.2.1.el7.x86_64.
>>>>> 
>>>> 
>>>> One other option is to run the DomU's as PVHVM:
>>>> https://wiki.xen.org/wiki/Xen_Linux_PV_on_HVM_drivers
>>>> 
>>>> That should be much better performance than HVM and may be a 
>>>> workable
>>>> solution for people who don't want to modify their VM kernel.
>>>> 
>>>> Here is more info on PVHVM:
>>>> https://wiki.xen.org/wiki/PV_on_HVM
>>>> 
>>>> ================
>>>> Also heard from someone to try this Config file change to the base
>>>> kernel and rebuild:
>>>> 
>>>> CONFIG_RANDOMIZE_BASE=n
>>> 
>>> This suggestion was mirrored in the RH bugzilla as well, it worked, 
>>> but
>>> the same issue does not exist in newer kernels which have the option 
>>> on.
>>>  I've posted updated findings in the CentOS bug, which includes a 
>>> patch
>>> that I found which seems to fix the issue:
>>> 
>>> https://bugs.centos.org/view.php?id=13763#c30014
>> 
>> With many thanks to hughesjr and toracat, I was able to find a patch
>> that seems to resolve this issue and get it into CentOS Plus
>> 3.10.0-693.2.1.  I've asked Red Hat to apply it to some future kernel
>> update, but that is only a dream for now.
>> 
>> In the meantime, if anyone who has been experiencing the issue with PV
>> domains can try out the CentOS Plus kernel here and provide feedback,
>> I'd appreciate it!
>> 
>> https://buildlogs.centos.org/c7-plus/kernel-plus/20170907163005/3.10.0-693.2.1.el7.centos.plus.x86_64/
> 
> Loaded 3.10.0-693.2.2.el7.centos.plus.x86_64 successfully on two
> CentOS 7.4 PV domUs which failed previously on
> kernel-3.10.0-693.2.2.el7.x86_64, the 2 hypervisors tested are:
> 1. CentOS 6.9, kernel 4.9.13-22.el6.x86_64, Xen 4.6.3-8.el6
> 2. CentOS 7.3, kernel 4.9.31-27.el7.x86_64, Xen 4.9.31-27.el7.x86_64

Should read:
1. CentOS 6.9, kernel 4.9.13-22.el6.x86_64, Xen 4.6.3-8.el6
2. CentOS 7.3, kernel 4.9.31-27.el7.x86_64, Xen 4.6.3-15.el7

---
Adi Pircalabu, System Administrator