[CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Tue Apr 18 17:39:49 UTC 2017
PJ Welsh <pjwelsh at gmail.com>

Here is something interesting... I went through the BIOS options and found
that one R710 that *is* functioning only differed in that "Logical
Processor"/Hyperthreading was *enabled* while the one that is *not*
functioning had HT *disabled*. Enabled Logical Processor and the system
starts without issue! I've rebooted 3 times now without issue.
Dell R710 BIOS version 6.4.0
2x Intel(R) Xeon(R) CPU L5639  @ 2.13GHz
4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64
x86_64 GNU/Linux



On Tue, Apr 18, 2017 at 8:44 AM, PJ Welsh <pjwelsh at gmail.com> wrote:

> Apologies: I installed the newer -26 kernel and had not rebooted into it.
> The grub2 menu item should have been "CentOS Linux (4.9.20-25.el7.x86_64) 7
> (Core)". I am currently restarting that remote affected system (unmodified
> grub2 entry first).
> Thanks
> PJ
>
> On Tue, Apr 18, 2017 at 8:39 AM, PJ Welsh <pjwelsh at gmail.com> wrote:
>
>> Just to note, the same pattern happens on C7:
>> "CentOS Linux, with Xen hypervisor" = reboot
>> "CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)" = boot
>>
>> [root at XXX ~]# uname -a
>> Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64
>> x86_64 x86_64
>>
>> On Tue, Apr 18, 2017 at 8:36 AM, PJ Welsh <pjwelsh at gmail.com> wrote:
>>
>>> There was a note that the non-Xen kernel at the same kernel version did
>>> indeed boot:
>>> "CentOS-6 4.9.20-26 kernel exhibits the same constant
>>> kernel-start-then-reboot issue when booting under the "CentOS Linux, with
>>> Xen hypervisor" grub2 menu option. However, it *does* properly boot under
>>> the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!"
>>>
>>> Trying to get back into being able to test this more.
>>>
>>> Thanks
>>> PJ
>>>
>>> On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes <johnny at centos.org>
>>> wrote:
>>>
>>>> On 04/14/2017 03:26 PM, Anderson, Dave wrote:
>>>> > Sad to say that I already tested 4.9.20-26 from your repo
>>>> yesterday...it does look a little cleaner before it dies, but still dies. I
>>>> have not tested it with the vcpu=4 wokaround, but I can tonight if you
>>>> would like. Relevant bits below:
>>>> >
>>>> > Loading Xen 4.6.3-12.el7 ...
>>>> > Loading Linux 4.9.20-26.el7.x86_64 ...
>>>> > Loading initial ramdisk ...
>>>> > [    0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc
>>>> version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26
>>>> CDT 2017
>>>> >
>>>> > <snip>
>>>> >
>>>> > [    6.195089] smpboot: Max logical packages: 1
>>>> > [    6.199549] VPMU disabled by hypervisor.
>>>> > [    6.203663] Performance Events: SandyBridge events, PMU not
>>>> available due to virtualization, using software events only.
>>>> > [    6.215436] NMI watchdog: disabled (cpu0): hardware events not
>>>> enabled
>>>> > [    6.222139] NMI watchdog: Shutting down hard lockup detector on
>>>> all cpus
>>>> > [    6.229165] installing Xen timer for CPU 1
>>>> > [    6.233849] installing Xen timer for CPU 2
>>>> > [    6.238504] installing Xen timer for CPU 3
>>>> > [    6.243139] installing Xen timer for CPU 4
>>>> > [    6.247836] installing Xen timer for CPU 5
>>>> > [    6.252478] installing Xen timer for CPU 6
>>>> > [    6.257155] installing Xen timer for CPU 7
>>>> > [    6.261795] installing Xen timer for CPU 8
>>>> > [    6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data
>>>> 1.
>>>> > [    6.272736] ------------[ cut here ]------------
>>>> > [    6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997!
>>>> > [    6.280104] random: fast init done
>>>> > [    6.286333] invalid opcode: 0000 [#1] SMP
>>>> > [    6.290343] Modules linked in:
>>>> > [    6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted
>>>> 4.9.20-26.el7.x86_64 #1
>>>> > [    6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a
>>>> 08/04/2015
>>>> > [    6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000
>>>> > [    6.313103] RIP: e030:[<ffffffff8103e7e7>]  [<ffffffff8103e7e7>]
>>>> identify_secondary_cpu+0x57/0x80
>>>> > [    6.322019] RSP: e02b:ffffc900400c3f08  EFLAGS: 00010086
>>>> > [    6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX:
>>>> ffffffff81e5ffc8
>>>> > [    6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI:
>>>> 0000000000000005
>>>> > [    6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09:
>>>> 0000000000000000
>>>> > [    6.348738] R10: 0000000000000005 R11: 0000000000000006 R12:
>>>> 0000000000000008
>>>> > [    6.355873] R13: 0000000000000000 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> > [    6.363006] FS:  0000000000000000(0000) GS:ffff88005d800000(0000)
>>>> knlGS:0000000000000000
>>>> > [    6.371090] CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
>>>> > [    6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4:
>>>> 0000000000042660
>>>> > [    6.383970] Stack:
>>>> > [    6.386004]  0000000000000008 0000000000000000 ffffc900400c3f28
>>>> ffffffff8104ebce
>>>> > [    6.393483]  ffffc900400c3f40 ffffffff81029855 0000000000000000
>>>> ffffc900400c3f50
>>>> > [    6.400963]  ffffffff810298d0 0000000000000000 0000000000000000
>>>> 0000000000000000
>>>> > [    6.408450] Call Trace:
>>>> > [    6.410907]  [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40
>>>> > [    6.416753]  [<ffffffff81029855>] cpu_bringup+0x35/0x90
>>>> > [    6.421981]  [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40
>>>> > [    6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75
>>>> 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3
>>>> 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81
>>>> > [    6.448249] RIP  [<ffffffff8103e7e7>]
>>>> identify_secondary_cpu+0x57/0x80
>>>> > [    6.454801]  RSP <ffffc900400c3f08>
>>>> > [    6.458305] ---[ end trace 2f9b62c5c7050204 ]---
>>>> >
>>>> >
>>>> > So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch.
>>>> Firmware: 0 APIC: 1"  lines, but otherwise dies the same way. I included a
>>>> few extra lines up from the panic because the "[    6.195089] smpboot: Max
>>>> logical packages: 1" could possibly be relevant, I need to go look at a
>>>> clean boot to see if that was in there on this machine.
>>>> >
>>>> >
>>>> > Even more strangely, in addition to the machine I'm talking about
>>>> which panics and reboots, I had a second nearly identical machine
>>>> (different CPU/ram config, everything else the same) which booted but had
>>>> some kind of hw conflict with 4.9.x that I never had before. It appears to
>>>> be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using
>>>> SCU, so I disabled that). Had that other machine not booted I would have
>>>> just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that
>>>> one machine booted at least :)
>>>> >
>>>> > Thanks,
>>>> > -Dave
>>>>
>>>> Dave,
>>>>
>>>> Just for testing purposes, can you try booting the kernel in the normal
>>>> way on the machine does does not work (a normal grub entry on the kernel
>>>> with no xen.gz line)
>>>>
>>>> That way, we can hopefully narrow the issue down to a hypervisor issue
>>>> or a kernel config issue.
>>>>
>>>> Thanks,
>>>> Johnny Hughes
>>>>
>>>> >
>>>> >
>>>> >> On Apr 14, 2017, at 05:39, Johnny Hughes <johnny at centos.org> wrote:
>>>> >>
>>>> >> Dave,
>>>> >>
>>>> >> Take a look at this kernel as it is the one I think we are going to
>>>> >> release (or a slightly newer 4.9.2x from kernel.org LTS). This
>>>> version
>>>> >> has some newer settings that are more redhat/fedora/centos base
>>>> kernel
>>>> >> like WRT what is a module and what is built into the kernel, etc.
>>>> >>
>>>> >> https://people.centos.org/hughesjr/4.9.x/
>>>> >>
>>>> >> Thanks,
>>>> >> Johnny Hughes
>>>> >>
>>>> >> On 04/14/2017 05:16 AM, Anderson, Dave wrote:
>>>> >>> List moderator: feel free to delete my previous large message with
>>>> attachments that's in the moderation queue...it's now obsolete anyway.
>>>> >>>
>>>> >>>
>>>> >>> I have found a fix/workaround for my reboot issues with Xen
>>>> 4.6.3-12 + Kernel 4.9.13:
>>>> >>>
>>>> >>> Once I finally got serial output all the way through the boot
>>>> process (xen+dom0) I discovered the stack trace:
>>>> >>>
>>>> >>> [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7
>>>> >>> installing Xen timer for CPU 8
>>>> >>> [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20
>>>> >>> smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
>>>> >>> ------------[ cut here ]------------
>>>> >>> kernel BUG at arch/x86/kernel/cpu/common.c:997!
>>>> >>> invalid opcode: 0000 [#1] SMP
>>>> >>> Modules linked in:
>>>> >>> CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1
>>>> >>> Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015
>>>> >>> random: fast init done
>>>> >>> task: ffff880058a8c4c0 task.stack: ffffc900400b4000
>>>> >>> RIP: e030:[<ffffffff8103e527>]  [<ffffffff8103e527>]
>>>> identify_secondary_cpu+0x57/0x80
>>>> >>> RSP: e02b:ffffc900400b7f08  EFLAGS: 00010086
>>>> >>> RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68
>>>> >>> RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005
>>>> >>> RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004
>>>> >>> R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008
>>>> >>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>> >>> FS:  0000000000000000(0000) GS:ffff88005d800000(0000)
>>>> knlGS:0000000000000000
>>>> >>> CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
>>>> >>> CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660
>>>> >>> Stack:
>>>> >>> 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e
>>>> >>> ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50
>>>> >>> ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000
>>>> >>> Call Trace:
>>>> >>> [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40
>>>> >>> [<ffffffff81029925>] cpu_bringup+0x35/0x90
>>>> >>> [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40
>>>> >>> Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb
>>>> da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f>
>>>> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81
>>>> >>> RIP  [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80
>>>> >>> RSP <ffffc900400b7f08>
>>>> >>> ---[ end trace dc5563100443876e ]---
>>>> >>>
>>>> >>> I surmised that reducing the number of dom0 vcpu might solve this
>>>> issue (they were unbounded)
>>>> >>>
>>>> >>> In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the
>>>> GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running
>>>> grub2-mkconfig has resulted in the system I have that never booted Xen
>>>> 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.
>>>> >>>
>>>> >>>
>>>> >>> So...I don't know if there's a race condition somewhere, or
>>>> what...but...so far this workaround has not failed me.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> -Dave
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com
>>>> >>>>> wrote:
>>>> >>>>> I've not gotten any bites from my posting on the xen-devel
>>>> mailing list.
>>>> >>>>> Here is the only one to-date:
>>>> >>>>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg010
>>>> 69.html
>>>> >>>>>
>>>> >>>>> From that email, there needs to be some hypervisor messages.
>>>> >>>>>
>>>> >>>>> Does anyone know how to produce the hypervisor messages? I've
>>>> already
>>>> >>>>
>>>> >>>>> removed the rhgb and quiet options from the boot.
>>>> >>>>
>>>> >>>>>
>>>> >>>>> Thanks
>>>> >>>>> PJ
>>>> >>>>
>>>> >>>>
>>>> >>>> I spoke too soon. To get more information: Please see
>>>> >>>>
>>>> >>>> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_
>>>> Project
>>>> >>>>
>>>> >>>> and
>>>> >>>>
>>>> >>>> https://wiki.xenproject.org/wiki/Xen_Serial_Console
>>>> >>>>
>>>> >>>> or alternatively at least add "vga=keep".
>>>> >>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> CentOS-virt mailing list
>>>> CentOS-virt at centos.org
>>>> https://lists.centos.org/mailman/listinfo/centos-virt
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170418/78606084/attachment-0006.html>