Re: [CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

List overview All Threads
Download

newer

older

Issues with exposing USB serial...

qemu-kvm-ev-2.6.0-28.el7_3.9.1 now...

Anderson, Dave

14 Apr 2017 14 Apr '17

10:16 a.m.

List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...

On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

Show replies by date

Johnny Hughes

14 Apr 14 Apr

12:39 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...

List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

PJ Welsh

2:33 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

I am on holiday until Sunday, but will download the kernel now and test it when I get back into work. Thanks

On Fri, Apr 14, 2017 at 7:39 AM, Johnny Hughes johnny@centos.org wrote:

...

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with

attachments that's in the moderation queue...it's now obsolete anyway.

...
I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 +

Kernel 4.9.13:

...
Once I finally got serial output all the way through the boot process

(xen+dom0) I discovered the stack trace:

...
[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>]

identify_secondary_cpu+0x57/0x80

...
RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00

00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81

...
RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue

(they were unbounded)

...
In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the

GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

...
So...I don't know if there's a race condition somewhere, or

what...but...so far this workaround has not failed me.

...
Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing

list.

...
...
...
Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Anderson, Dave

8:26 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Sad to say that I already tested 4.9.20-26 from your repo yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only. [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus [ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.20-26.el7.x86_64 #1 [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8 [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000 [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008 [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660 [ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50 [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000 [ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

Even more strangely, in addition to the machine I'm talking about which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

Thanks, -Dave

...

On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Anderson, Dave

8:39 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

So, strangely,

I have two _identical_ dualproc xeon mobos (same bios/ipmi versions, they even share an enclosure, one is right side, other is left), each with different cpu/memory:

Using 4.9.13 with vcpu limited to 4, early in the boot process, the one that _was_ booting before setting the xen vcpu args says: "[ 7.060720] smpboot: Max logical packages: 2",

and the other one says "[ 6.195089] smpboot: Max logical packages: 1"

They both have dual procs, known working/good.

The first (the one that worked unmodified) has dual 8 core (16 HT/ea) and correctly detects "[ 0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs". It's a Xeon E5-2665v1.

The second machine (didn't work without the xen vcpu args) has dual 4 core (8ht/ea) and also correctly detects "[ 0.000000] smpboot: Allowing 16 CPUs, 0 hotplug CPUs". It's a Xeon E5-2643v1...so it seems like this one does ok until it decides there's only one cpu package?

Thanks, -Dave

...

On Apr 14, 2017, at 13:26, Anderson, Dave daveanderson@wsu.edu wrote:

Sad to say that I already tested 4.9.20-26 from your repo yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only. [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus [ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.20-26.el7.x86_64 #1 [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8 [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000 [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008 [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660 [ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50 [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000 [ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

Even more strangely, in addition to the machine I'm talking about which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

Thanks, -Dave

...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Anderson, Dave

9:57 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

I also just realized the C6 portion of the title/subject line here refers to CentOS 6, so I'd like to clarify that all my testing/issues/etc was under CentOS 7.3 with all patches applied.

Thanks, -Dave

...

On Apr 14, 2017, at 1:39 PM, Anderson, Dave daveanderson@wsu.edu wrote:

So, strangely,

I have two _identical_ dualproc xeon mobos (same bios/ipmi versions, they even share an enclosure, one is right side, other is left), each with different cpu/memory:

Using 4.9.13 with vcpu limited to 4, early in the boot process, the one that _was_ booting before setting the xen vcpu args says: "[ 7.060720] smpboot: Max logical packages: 2",

and the other one says "[ 6.195089] smpboot: Max logical packages: 1"

They both have dual procs, known working/good.

The first (the one that worked unmodified) has dual 8 core (16 HT/ea) and correctly detects "[ 0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs". It's a Xeon E5-2665v1.

The second machine (didn't work without the xen vcpu args) has dual 4 core (8ht/ea) and also correctly detects "[ 0.000000] smpboot: Allowing 16 CPUs, 0 hotplug CPUs". It's a Xeon E5-2643v1...so it seems like this one does ok until it decides there's only one cpu package?

Thanks, -Dave

...
On Apr 14, 2017, at 13:26, Anderson, Dave daveanderson@wsu.edu wrote:

Sad to say that I already tested 4.9.20-26 from your repo yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only. [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus [ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.20-26.el7.x86_64 #1 [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8 [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000 [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008 [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660 [ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50 [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000 [ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

Even more strangely, in addition to the machine I'm talking about which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

Thanks, -Dave

...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Johnny Hughes

18 Apr 18 Apr

1:30 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

On 04/14/2017 03:26 PM, Anderson, Dave wrote:

...

Sad to say that I already tested 4.9.20-26 from your repo yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only. [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus [ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.20-26.el7.x86_64 #1 [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8 [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000 [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008 [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660 [ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50 [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000 [ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

Even more strangely, in addition to the machine I'm talking about which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

Thanks, -Dave

Dave,

Just for testing purposes, can you try booting the kernel in the normal way on the machine does does not work (a normal grub entry on the kernel with no xen.gz line)

That way, we can hopefully narrow the issue down to a hypervisor issue or a kernel config issue.

Thanks, Johnny Hughes

...

...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

PJ Welsh

1:36 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

There was a note that the non-Xen kernel at the same kernel version did indeed boot: "CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the "CentOS Linux, with Xen hypervisor" grub2 menu option. However, it *does* properly boot under the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!"

Trying to get back into being able to test this more.

Thanks PJ

On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes johnny@centos.org wrote:

...

On 04/14/2017 03:26 PM, Anderson, Dave wrote:

...
Sad to say that I already tested 4.9.20-26 from your repo yesterday...it

does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

...
Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc

version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

...
<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not available

due to virtualization, using software events only.

...
[ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all

cpus

...
[ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted

4.9.20-26.el7.x86_64 #1

...
[ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a

08/04/2015

...
[ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>]

identify_secondary_cpu+0x57/0x80

...
[ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX:

ffffffff81e5ffc8

...
[ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI:

0000000000000005

...
[ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09:

0000000000000000

...
[ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12:

0000000000000008

...
[ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15:

0000000000000000

...
[ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
[ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4:

0000000000042660

...
[ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28

ffffffff8104ebce

...
[ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000

ffffc900400c3f50

...
[ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000

0000000000000000

...
[ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c

0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81

...
[ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/

0x80

...
[ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch.

Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

...
Even more strangely, in addition to the machine I'm talking about which

panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

...
Thanks, -Dave

Dave,

Just for testing purposes, can you try booting the kernel in the normal way on the machine does does not work (a normal grub entry on the kernel with no xen.gz line)

That way, we can hopefully narrow the issue down to a hypervisor issue or a kernel config issue.

Thanks, Johnny Hughes

...
...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with

attachments that's in the moderation queue...it's now obsolete anyway.

...
...
...
I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 +

Kernel 4.9.13:

...
...
...
Once I finally got serial output all the way through the boot process

(xen+dom0) I discovered the stack trace:

...
...
...
[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>]

identify_secondary_cpu+0x57/0x80

...
...
...
RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
...
...
CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da

00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81

...
...
...
RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this

issue (they were unbounded)

...
...
...
In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the

GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

...
...
...
So...I don't know if there's a race condition somewhere, or

what...but...so far this workaround has not failed me.

...
...
...
Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing

list.

...
...
...
...
...
Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

PJ Welsh

1:39 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Just to note, the same pattern happens on C7: "CentOS Linux, with Xen hypervisor" = reboot "CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)" = boot

[root@XXX ~]# uname -a Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64 x86_64 x86_64

On Tue, Apr 18, 2017 at 8:36 AM, PJ Welsh pjwelsh@gmail.com wrote:

...

There was a note that the non-Xen kernel at the same kernel version did indeed boot: "CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the "CentOS Linux, with Xen hypervisor" grub2 menu option. However, it *does* properly boot under the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!"

Trying to get back into being able to test this more.

Thanks PJ

On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes johnny@centos.org wrote:

...
On 04/14/2017 03:26 PM, Anderson, Dave wrote:

...
Sad to say that I already tested 4.9.20-26 from your repo

yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

...
Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc

version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

...
<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not

available due to virtualization, using software events only.

...
[ 6.215436] NMI watchdog: disabled (cpu0): hardware events not

enabled

...
[ 6.222139] NMI watchdog: Shutting down hard lockup detector on all

cpus

...
[ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted

4.9.20-26.el7.x86_64 #1

...
[ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a

08/04/2015

...
[ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>]

identify_secondary_cpu+0x57/0x80

...
[ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX:

ffffffff81e5ffc8

...
[ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI:

0000000000000005

...
[ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09:

0000000000000000

...
[ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12:

0000000000000008

...
[ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15:

0000000000000000

...
[ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
[ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4:

0000000000042660

...
[ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28

ffffffff8104ebce

...
[ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000

ffffc900400c3f50

...
[ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000

0000000000000000

...
[ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c

0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81

...
[ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x

80

...
[ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch.

Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

...
Even more strangely, in addition to the machine I'm talking about which

panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

...
Thanks, -Dave

Dave,

Just for testing purposes, can you try booting the kernel in the normal way on the machine does does not work (a normal grub entry on the kernel with no xen.gz line)

That way, we can hopefully narrow the issue down to a hypervisor issue or a kernel config issue.

Thanks, Johnny Hughes

...
...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with

attachments that's in the moderation queue...it's now obsolete anyway.

...
...
...
I have found a fix/workaround for my reboot issues with Xen 4.6.3-12

Kernel 4.9.13:

...
...
...
Once I finally got serial output all the way through the boot process

(xen+dom0) I discovered the stack trace:

...
...
...
[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>]

identify_secondary_cpu+0x57/0x80

...
...
...
RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
...
...
CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da

00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81

...
...
...
RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this

issue (they were unbounded)

...
...
...
In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the

GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

...
...
...
So...I don't know if there's a race condition somewhere, or

what...but...so far this workaround has not failed me.

...
...
...
Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com > wrote: > I've not gotten any bites from my posting on the xen-devel mailing

list.

...
...
...
...
> Here is the only one to-date: > https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html > > From that email, there needs to be some hypervisor messages. > > Does anyone know how to produce the hypervisor messages? I've

already

...
...
...
...
> removed the rhgb and quiet options from the boot.

> > Thanks > PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

PJ Welsh

1:44 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Apologies: I installed the newer -26 kernel and had not rebooted into it. The grub2 menu item should have been "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)". I am currently restarting that remote affected system (unmodified grub2 entry first). Thanks PJ

On Tue, Apr 18, 2017 at 8:39 AM, PJ Welsh pjwelsh@gmail.com wrote:

...

Just to note, the same pattern happens on C7: "CentOS Linux, with Xen hypervisor" = reboot "CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)" = boot

[root@XXX ~]# uname -a Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64 x86_64 x86_64

On Tue, Apr 18, 2017 at 8:36 AM, PJ Welsh pjwelsh@gmail.com wrote:

...
There was a note that the non-Xen kernel at the same kernel version did indeed boot: "CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the "CentOS Linux, with Xen hypervisor" grub2 menu option. However, it *does* properly boot under the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!"

Trying to get back into being able to test this more.

Thanks PJ

On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes johnny@centos.org wrote:

...
On 04/14/2017 03:26 PM, Anderson, Dave wrote:

...
Sad to say that I already tested 4.9.20-26 from your repo

yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

...
Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc

version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

...
<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not

available due to virtualization, using software events only.

...
[ 6.215436] NMI watchdog: disabled (cpu0): hardware events not

enabled

...
[ 6.222139] NMI watchdog: Shutting down hard lockup detector on all

cpus

...
[ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted

4.9.20-26.el7.x86_64 #1

...
[ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a

08/04/2015

...
[ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>]

identify_secondary_cpu+0x57/0x80

...
[ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX:

ffffffff81e5ffc8

...
[ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI:

0000000000000005

...
[ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09:

0000000000000000

...
[ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12:

0000000000000008

...
[ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15:

0000000000000000

...
[ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
[ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4:

0000000000042660

...
[ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28

ffffffff8104ebce

...
[ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000

ffffc900400c3f50

...
[ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000

0000000000000000

...
[ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75

1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81

...
[ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x

80

...
[ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch.

Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

...
Even more strangely, in addition to the machine I'm talking about

which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

...
Thanks, -Dave

Dave,

Just for testing purposes, can you try booting the kernel in the normal way on the machine does does not work (a normal grub entry on the kernel with no xen.gz line)

That way, we can hopefully narrow the issue down to a hypervisor issue or a kernel config issue.

Thanks, Johnny Hughes

...
...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This

version

...
...
has some newer settings that are more redhat/fedora/centos base kernel like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote:

...
List moderator: feel free to delete my previous large message with

attachments that's in the moderation queue...it's now obsolete anyway.

...
...
...
I have found a fix/workaround for my reboot issues with Xen 4.6.3-12

Kernel 4.9.13:

...
...
...
Once I finally got serial output all the way through the boot

process (xen+dom0) I discovered the stack trace:

...
...
...
[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>]

identify_secondary_cpu+0x57/0x80

...
...
...
RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
...
...
CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da

00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81

...
...
...
RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this

issue (they were unbounded)

...
...
...
In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the

GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

...
...
...
So...I don't know if there's a race condition somewhere, or

what...but...so far this workaround has not failed me.

...
...
...
Thanks, -Dave

> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com >> wrote: >> I've not gotten any bites from my posting on the xen-devel mailing

list.

...
...
...
>> Here is the only one to-date: >> https://lists.xen.org/archives/html/xen-devel/2017-04/msg010

69.html

...
...
...
>> >> From that email, there needs to be some hypervisor messages. >> >> Does anyone know how to produce the hypervisor messages? I've

already

...
...
...
> >> removed the rhgb and quiet options from the boot. > >> >> Thanks >> PJ > > > I spoke too soon. To get more information: Please see > > https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project > > and > > https://wiki.xenproject.org/wiki/Xen_Serial_Console > > or alternatively at least add "vga=keep". >

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

PJ Welsh

5:39 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Here is something interesting... I went through the BIOS options and found that one R710 that *is* functioning only differed in that "Logical Processor"/Hyperthreading was *enabled* while the one that is *not* functioning had HT *disabled*. Enabled Logical Processor and the system starts without issue! I've rebooted 3 times now without issue. Dell R710 BIOS version 6.4.0 2x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

On Tue, Apr 18, 2017 at 8:44 AM, PJ Welsh pjwelsh@gmail.com wrote:

...

Apologies: I installed the newer -26 kernel and had not rebooted into it. The grub2 menu item should have been "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)". I am currently restarting that remote affected system (unmodified grub2 entry first). Thanks PJ

On Tue, Apr 18, 2017 at 8:39 AM, PJ Welsh pjwelsh@gmail.com wrote:

...
Just to note, the same pattern happens on C7: "CentOS Linux, with Xen hypervisor" = reboot "CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)" = boot

[root@XXX ~]# uname -a Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64 x86_64 x86_64

On Tue, Apr 18, 2017 at 8:36 AM, PJ Welsh pjwelsh@gmail.com wrote:

...
There was a note that the non-Xen kernel at the same kernel version did indeed boot: "CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the "CentOS Linux, with Xen hypervisor" grub2 menu option. However, it *does* properly boot under the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!"

Trying to get back into being able to test this more.

Thanks PJ

On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes johnny@centos.org wrote:

...
On 04/14/2017 03:26 PM, Anderson, Dave wrote:

...
Sad to say that I already tested 4.9.20-26 from your repo

yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

...
Loading Xen 4.6.3-12.el7 ... Loading Linux 4.9.20-26.el7.x86_64 ... Loading initial ramdisk ... [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc

version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

...
<snip>

[ 6.195089] smpboot: Max logical packages: 1 [ 6.199549] VPMU disabled by hypervisor. [ 6.203663] Performance Events: SandyBridge events, PMU not

available due to virtualization, using software events only.

...
[ 6.215436] NMI watchdog: disabled (cpu0): hardware events not

enabled

...
[ 6.222139] NMI watchdog: Shutting down hard lockup detector on

all cpus

...
[ 6.229165] installing Xen timer for CPU 1 [ 6.233849] installing Xen timer for CPU 2 [ 6.238504] installing Xen timer for CPU 3 [ 6.243139] installing Xen timer for CPU 4 [ 6.247836] installing Xen timer for CPU 5 [ 6.252478] installing Xen timer for CPU 6 [ 6.257155] installing Xen timer for CPU 7 [ 6.261795] installing Xen timer for CPU 8 [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data

...
[ 6.272736] ------------[ cut here ]------------ [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.280104] random: fast init done [ 6.286333] invalid opcode: 0000 [#1] SMP [ 6.290343] Modules linked in: [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted

4.9.20-26.el7.x86_64 #1

...
[ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a

08/04/2015

...
[ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>]

identify_secondary_cpu+0x57/0x80

...
[ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX:

ffffffff81e5ffc8

...
[ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI:

0000000000000005

...
[ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09:

0000000000000000

...
[ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12:

0000000000000008

...
[ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15:

0000000000000000

...
[ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
[ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4:

0000000000042660

...
[ 6.383970] Stack: [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28

ffffffff8104ebce

...
[ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000

ffffc900400c3f50

...
[ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000

0000000000000000

...
[ 6.408450] Call Trace: [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75

1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81

...
[ 6.448249] RIP [<ffffffff8103e7e7>]

identify_secondary_cpu+0x57/0x80

...
[ 6.454801] RSP <ffffc900400c3f08> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---

So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch.

Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.

...
Even more strangely, in addition to the machine I'm talking about

which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)

...
Thanks, -Dave

Dave,

Just for testing purposes, can you try booting the kernel in the normal way on the machine does does not work (a normal grub entry on the kernel with no xen.gz line)

That way, we can hopefully narrow the issue down to a hypervisor issue or a kernel config issue.

Thanks, Johnny Hughes

...
...
On Apr 14, 2017, at 05:39, Johnny Hughes johnny@centos.org wrote:

Dave,

Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This

version

...
...
has some newer settings that are more redhat/fedora/centos base

kernel

...
...
like WRT what is a module and what is built into the kernel, etc.

https://people.centos.org/hughesjr/4.9.x/

Thanks, Johnny Hughes

On 04/14/2017 05:16 AM, Anderson, Dave wrote: > List moderator: feel free to delete my previous large message with

attachments that's in the moderation queue...it's now obsolete anyway.

...
...
> > > I have found a fix/workaround for my reboot issues with Xen

4.6.3-12 + Kernel 4.9.13:

...
...
> > Once I finally got serial output all the way through the boot

process (xen+dom0) I discovered the stack trace:

...
...
> > [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 > installing Xen timer for CPU 8 > [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 > smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. > ------------[ cut here ]------------ > kernel BUG at arch/x86/kernel/cpu/common.c:997! > invalid opcode: 0000 [#1] SMP > Modules linked in: > CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 > Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 > random: fast init done > task: ffff880058a8c4c0 task.stack: ffffc900400b4000 > RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>]

identify_secondary_cpu+0x57/0x80

...
...
> RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 > RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 > RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 > RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 > R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > FS: 0000000000000000(0000) GS:ffff88005d800000(0000)

knlGS:0000000000000000

...
...
> CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 > Stack: > 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e > ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 > ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 > Call Trace: > [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 > [<ffffffff81029925>] cpu_bringup+0x35/0x90 > [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 > Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb

da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81

...
...
> RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 > RSP <ffffc900400b7f08> > ---[ end trace dc5563100443876e ]--- > > I surmised that reducing the number of dom0 vcpu might solve this

issue (they were unbounded)

...
...
> > In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the

GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

...
...
> > > So...I don't know if there's a race condition somewhere, or

what...but...so far this workaround has not failed me.

...
...
> > Thanks, > -Dave > > > >> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com >>> wrote: >>> I've not gotten any bites from my posting on the xen-devel

mailing list.

...
...
>>> Here is the only one to-date: >>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg010

69.html

...
...
>>> >>> From that email, there needs to be some hypervisor messages. >>> >>> Does anyone know how to produce the hypervisor messages? I've

already

...
...
>> >>> removed the rhgb and quiet options from the boot. >> >>> >>> Thanks >>> PJ >> >> >> I spoke too soon. To get more information: Please see >> >> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_

Project

...
...
>> >> and >> >> https://wiki.xenproject.org/wiki/Xen_Serial_Console >> >> or alternatively at least add "vga=keep". >>

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Johnny Hughes

19 Apr 19 Apr

10:40 a.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

On 04/18/2017 12:39 PM, PJ Welsh wrote:

...

Here is something interesting... I went through the BIOS options and found that one R710 that *is* functioning only differed in that "Logical Processor"/Hyperthreading was *enabled* while the one that is *not* functioning had HT *disabled*. Enabled Logical Processor and the system starts without issue! I've rebooted 3 times now without issue. Dell R710 BIOS version 6.4.0 2x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the system as normal updates. It should be available later today.

<snip>

PJ Welsh

5:18 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes johnny@centos.org wrote:

...

On 04/18/2017 12:39 PM, PJ Welsh wrote:

...
Here is something interesting... I went through the BIOS options and found that one R710 that *is* functioning only differed in that "Logical Processor"/Hyperthreading was *enabled* while the one that is *not* functioning had HT *disabled*. Enabled Logical Processor and the system starts without issue! I've rebooted 3 times now without issue. Dell R710 BIOS version 6.4.0 2x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the system as normal updates. It should be available later today.

<snip>

I've verified with a second Dell R710 that disabling Hyperthreading/Logical Processor causes the primary xen booting kernel to fail and reboot. Consequently, enabling allows for the system to start as expected and without any issue: Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26 22:15:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I just attempted an update and the 4.9.23-26 is not yet up. Does this update address the Hyperthreading issue in any way?

Thanks PJ

Johnny Hughes

5:33 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

On 04/19/2017 12:18 PM, PJ Welsh wrote:

...

On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes <johnny@centos.org mailto:johnny@centos.org> wrote:
On 04/18/2017 12:39 PM, PJ Welsh wrote:
> Here is something interesting... I went through the BIOS options and
> found that one R710 that *is* functioning only differed in that "Logical
> Processor"/Hyperthreading was *enabled* while the one that is *not*
> functioning had HT *disabled*. Enabled Logical Processor and the system
> starts without issue! I've rebooted 3 times now without issue.
> Dell R710 BIOS version 6.4.0
> 2x Intel(R) Xeon(R) CPU L5639  @ 2.13GHz
> 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64
> x86_64 GNU/Linux
>

Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the
system as normal updates.  It should be available later today.

<snip>
I've verified with a second Dell R710 that disabling Hyperthreading/Logical Processor causes the primary xen booting kernel to fail and reboot. Consequently, enabling allows for the system to start as expected and without any issue: Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26 22:15:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I just attempted an update and the 4.9.23-26 is not yet up. Does this update address the Hyperthreading issue in any way?

I don't think so .. at least I did not specifically add anything to do so.

You can get it here for testing:

https://buildlogs.centos.org/centos/7/virt/x86_64/xen/

(or from /6/ as well for CentOS-6)

Not sure why it did not go out on the signing run .. will check that server.

Anderson, Dave

21 Apr 21 Apr

1:40 a.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Good news/bad news testing the new kernel on CentOS7 with my now notoriously finicky machines:

Good news: 4.9.23-26.el7 (grabbed today via yum update) isn't any worse than 4.9.13-22 was on my xen hosts (as far as I can tell so far at least)

Bad news: It isn't any better than 4.9.13 was for me either, if I don't set vcpu limit in the grub/xen config, it still panics like so:

[ 6.716016] CPU: Physical Processor ID: 0 [ 6.720199] CPU: Processor Core ID: 0 [ 6.724046] mce: CPU supports 2 MCE banks [ 6.728239] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 [ 6.733884] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 [ 6.740770] Freeing SMP alternatives memory: 32K (ffffffff821a8000 - ffffffff821b0000) [ 6.750638] ftrace: allocating 34344 entries in 135 pages [ 6.771888] smpboot: Max logical packages: 1 [ 6.776363] VPMU disabled by hypervisor. [ 6.780479] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only. [ 6.792237] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.798943] NMI watchdog: Shutting down hard lockup detector on all cpus [ 6.805949] installing Xen timer for CPU 1 [ 6.810659] installing Xen timer for CPU 2 [ 6.815317] installing Xen timer for CPU 3 [ 6.819947] installing Xen timer for CPU 4 [ 6.824618] installing Xen timer for CPU 5 [ 6.829282] installing Xen timer for CPU 6 [ 6.833935] installing Xen timer for CPU 7 [ 6.838565] installing Xen timer for CPU 8 [ 6.843110] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.849475] ------------[ cut here ]------------ [ 6.854091] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.855864] random: fast init done [ 6.863070] invalid opcode: 0000 [#1] SMP [ 6.867088] Modules linked in: [ 6.870168] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.23-26.el7.x86_64 #1 [ 6.877298] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 [ 6.883920] task: ffff880058a6a5c0 task.stack: ffffc900400c0000 [ 6.889840] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.898756] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.904069] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8 [ 6.911201] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 [ 6.918335] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000 [ 6.925466] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008 [ 6.932599] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 6.939735] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 6.947819] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.953565] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660 [ 6.960696] Stack: [ 6.962731] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce [ 6.970205] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50 [ 6.977691] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000 [ 6.985164] Call Trace: [ 6.987626] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.993480] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.998700] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 7.004706] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 90 d3 ca 81 [ 7.024976] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 7.031528] RSP <ffffc900400c3f08> [ 7.035032] ---[ end trace f2a8d75941398d9f ]--- [ 7.039658] Kernel panic - not syncing: Attempted to kill the idle task!

So...other than my work around...that still works...not sure what else I can provide in the way of feedback/testing. But if you want anything else gathered, let me know.

Thanks, -Dave

-- Dave Anderson

...

On Apr 19, 2017, at 10:33 AM, Johnny Hughes johnny@centos.org wrote:

On 04/19/2017 12:18 PM, PJ Welsh wrote:

...
On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes <johnny@centos.org mailto:johnny@centos.org> wrote:

On 04/18/2017 12:39 PM, PJ Welsh wrote:

...
Here is something interesting... I went through the BIOS options and found that one R710 that *is* functioning only differed in that "Logical Processor"/Hyperthreading was *enabled* while the one that is *not* functioning had HT *disabled*. Enabled Logical Processor and the system starts without issue! I've rebooted 3 times now without issue. Dell R710 BIOS version 6.4.0 2x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the system as normal updates. It should be available later today.

<snip>

I've verified with a second Dell R710 that disabling Hyperthreading/Logical Processor causes the primary xen booting kernel to fail and reboot. Consequently, enabling allows for the system to start as expected and without any issue: Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26 22:15:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I just attempted an update and the 4.9.23-26 is not yet up. Does this update address the Hyperthreading issue in any way?

I don't think so .. at least I did not specifically add anything to do so.

You can get it here for testing:

https://buildlogs.centos.org/centos/7/virt/x86_64/xen/

(or from /6/ as well for CentOS-6)

Not sure why it did not go out on the signing run .. will check that server.

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Mark L Sung

10:01 a.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Hummmm, seems there are still stability issues on the "4.9.2-26.el7.x86_64", recently hear many issue related to Supermicro board! :-(

Peace!!!

On Fri, Apr 21, 2017 at 9:40 AM, Anderson, Dave daveanderson@wsu.edu wrote:

...

Good news/bad news testing the new kernel on CentOS7 with my now notoriously finicky machines:

Good news: 4.9.23-26.el7 (grabbed today via yum update) isn't any worse than 4.9.13-22 was on my xen hosts (as far as I can tell so far at least)

Bad news: It isn't any better than 4.9.13 was for me either, if I don't set vcpu limit in the grub/xen config, it still panics like so:

[ 6.716016] CPU: Physical Processor ID: 0 [ 6.720199] CPU: Processor Core ID: 0 [ 6.724046] mce: CPU supports 2 MCE banks [ 6.728239] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 [ 6.733884] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 [ 6.740770] Freeing SMP alternatives memory: 32K (ffffffff821a8000 - ffffffff821b0000) [ 6.750638] ftrace: allocating 34344 entries in 135 pages [ 6.771888] smpboot: Max logical packages: 1 [ 6.776363] VPMU disabled by hypervisor. [ 6.780479] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only. [ 6.792237] NMI watchdog: disabled (cpu0): hardware events not enabled [ 6.798943] NMI watchdog: Shutting down hard lockup detector on all cpus [ 6.805949] installing Xen timer for CPU 1 [ 6.810659] installing Xen timer for CPU 2 [ 6.815317] installing Xen timer for CPU 3 [ 6.819947] installing Xen timer for CPU 4 [ 6.824618] installing Xen timer for CPU 5 [ 6.829282] installing Xen timer for CPU 6 [ 6.833935] installing Xen timer for CPU 7 [ 6.838565] installing Xen timer for CPU 8 [ 6.843110] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. [ 6.849475] ------------[ cut here ]------------ [ 6.854091] kernel BUG at arch/x86/kernel/cpu/common.c:997! [ 6.855864] random: fast init done [ 6.863070] invalid opcode: 0000 [#1] SMP [ 6.867088] Modules linked in: [ 6.870168] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.23-26.el7.x86_64 #1 [ 6.877298] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 [ 6.883920] task: ffff880058a6a5c0 task.stack: ffffc900400c0000 [ 6.889840] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 6.898756] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 [ 6.904069] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8 [ 6.911201] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 [ 6.918335] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000 [ 6.925466] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008 [ 6.932599] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 6.939735] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 6.947819] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 6.953565] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660 [ 6.960696] Stack: [ 6.962731] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce [ 6.970205] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50 [ 6.977691] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000 [ 6.985164] Call Trace: [ 6.987626] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 [ 6.993480] [<ffffffff81029855>] cpu_bringup+0x35/0x90 [ 6.998700] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 [ 7.004706] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 90 d3 ca 81 [ 7.024976] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80 [ 7.031528] RSP <ffffc900400c3f08> [ 7.035032] ---[ end trace f2a8d75941398d9f ]--- [ 7.039658] Kernel panic - not syncing: Attempted to kill the idle task!

So...other than my work around...that still works...not sure what else I can provide in the way of feedback/testing. But if you want anything else gathered, let me know.

Thanks, -Dave

-- Dave Anderson

...
On Apr 19, 2017, at 10:33 AM, Johnny Hughes johnny@centos.org wrote:

On 04/19/2017 12:18 PM, PJ Welsh wrote:

...
On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes <johnny@centos.org mailto:johnny@centos.org> wrote:

On 04/18/2017 12:39 PM, PJ Welsh wrote:

...
Here is something interesting... I went through the BIOS options and found that one R710 that *is* functioning only differed in that

"Logical

...
...
...
Processor"/Hyperthreading was *enabled* while the one that is *not* functioning had HT *disabled*. Enabled Logical Processor and the system starts without issue! I've rebooted 3 times now without issue. Dell R710 BIOS version 6.4.0 2x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the system as normal updates. It should be available later today.

<snip>

I've verified with a second Dell R710 that disabling Hyperthreading/Logical Processor causes the primary xen booting kernel to fail and reboot. Consequently, enabling allows for the system to start as expected and without any issue: Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26 22:15:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I just attempted an update and the 4.9.23-26 is not yet up. Does this update address the Hyperthreading issue in any way?

I don't think so .. at least I did not specifically add anything to do

so.

...
You can get it here for testing:

https://buildlogs.centos.org/centos/7/virt/x86_64/xen/

(or from /6/ as well for CentOS-6)

Not sure why it did not go out on the signing run .. will check that

server.

...

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Kevin Stange

4:22 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

For some additional context, all my hardware is Supermicro and working great on 4.9.13 - 26. I have dom0_max_vcpus=2 because of issues I was having with deadlocked CPU cores before setting that option on 3.18 kernels. In my experience setting that value doesn't cause any detriment to the dom0, which isn't doing most of the work anyway.

These are all the motherboards I'm running the kernel stably on:

Supermicro X8DT3 Supermicro X8DT6 Supermicro X9DRD-iF/LF Supermicro X9DRT Supermicro X9SCL/X9SCM

I'm on CentOS 6 across the board.

On 04/21/2017 05:01 AM, Mark L Sung wrote:

...

Hummmm, seems there are still stability issues on the "4.9.2-26.el7.x86_64", recently hear many issue related to Supermicro board! :-(

Peace!!!

On Fri, Apr 21, 2017 at 9:40 AM, Anderson, Dave <daveanderson@wsu.edu mailto:daveanderson@wsu.edu> wrote:

Good news/bad news testing the new kernel on CentOS7 with my now
notoriously finicky machines:

Good news: 4.9.23-26.el7 (grabbed today via yum update) isn't any
worse than 4.9.13-22 was on my xen hosts (as far as I can tell so
far at least)

Bad news: It isn't any better than 4.9.13 was for me either, if I
don't set vcpu limit in the grub/xen config, it still panics like so:

[    6.716016] CPU: Physical Processor ID: 0
[    6.720199] CPU: Processor Core ID: 0
[    6.724046] mce: CPU supports 2 MCE banks
[    6.728239] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
[    6.733884] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
[    6.740770] Freeing SMP alternatives memory: 32K
(ffffffff821a8000 - ffffffff821b0000)
[    6.750638] ftrace: allocating 34344 entries in 135 pages
[    6.771888] smpboot: Max logical packages: 1
[    6.776363] VPMU disabled by hypervisor.
[    6.780479] Performance Events: SandyBridge events, PMU not
available due to virtualization, using software events only.
[    6.792237] NMI watchdog: disabled (cpu0): hardware events not
enabled
[    6.798943] NMI watchdog: Shutting down hard lockup detector on
all cpus
[    6.805949] installing Xen timer for CPU 1
[    6.810659] installing Xen timer for CPU 2
[    6.815317] installing Xen timer for CPU 3
[    6.819947] installing Xen timer for CPU 4
[    6.824618] installing Xen timer for CPU 5
[    6.829282] installing Xen timer for CPU 6
[    6.833935] installing Xen timer for CPU 7
[    6.838565] installing Xen timer for CPU 8
[    6.843110] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
[    6.849475] ------------[ cut here ]------------
[    6.854091] kernel BUG at arch/x86/kernel/cpu/common.c:997!
[    6.855864] random: fast init done
[    6.863070] invalid opcode: 0000 [#1] SMP
[    6.867088] Modules linked in:
[    6.870168] CPU: 8 PID: 0 Comm: swapper/8 Not tainted
4.9.23-26.el7.x86_64 #1
[    6.877298] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a
08/04/2015
[    6.883920] task: ffff880058a6a5c0 task.stack: ffffc900400c0000
[    6.889840] RIP: e030:[<ffffffff8103e7e7>]  [<ffffffff8103e7e7>]
identify_secondary_cpu+0x57/0x80
[    6.898756] RSP: e02b:ffffc900400c3f08  EFLAGS: 00010086
[    6.904069] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX:
ffffffff81e5ffc8
[    6.911201] RDX: 0000000000000001 RSI: 0000000000000005 RDI:
0000000000000005
[    6.918335] RBP: ffffc900400c3f18 R08: 00000000000000ce R09:
0000000000000000
[    6.925466] R10: 0000000000000005 R11: 0000000000000006 R12:
0000000000000008
[    6.932599] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[    6.939735] FS:  0000000000000000(0000) GS:ffff88005d800000(0000)
knlGS:0000000000000000
[    6.947819] CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
[    6.953565] CR2: 0000000000000000 CR3: 0000000001e07000 CR4:
0000000000042660
[    6.960696] Stack:
[    6.962731]  0000000000000008 0000000000000000 ffffc900400c3f28
ffffffff8104ebce
[    6.970205]  ffffc900400c3f40 ffffffff81029855 0000000000000000
ffffc900400c3f50
[    6.977691]  ffffffff810298d0 0000000000000000 0000000000000000
0000000000000000
[    6.985164] Call Trace:
[    6.987626]  [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40
[    6.993480]  [<ffffffff81029855>] cpu_bringup+0x35/0x90
[    6.998700]  [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40
[    7.004706] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75
1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c
5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 90
d3 ca 81
[    7.024976] RIP  [<ffffffff8103e7e7>]
identify_secondary_cpu+0x57/0x80
[    7.031528]  RSP <ffffc900400c3f08>
[    7.035032] ---[ end trace f2a8d75941398d9f ]---
[    7.039658] Kernel panic - not syncing: Attempted to kill the
idle task!

So...other than my work around...that still works...not sure what
else I can provide in the way of feedback/testing. But if you want
anything else gathered, let me know.

Thanks,
-Dave

--
Dave Anderson


> On Apr 19, 2017, at 10:33 AM, Johnny Hughes <johnny@centos.org
<mailto:johnny@centos.org>> wrote:
>
> On 04/19/2017 12:18 PM, PJ Welsh wrote:
>>
>> On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes <johnny@centos.org
<mailto:johnny@centos.org>
>> <mailto:johnny@centos.org <mailto:johnny@centos.org>>> wrote:
>>
>>    On 04/18/2017 12:39 PM, PJ Welsh wrote:
>>> Here is something interesting... I went through the BIOS options and
>>> found that one R710 that *is* functioning only differed in that
"Logical
>>> Processor"/Hyperthreading was *enabled* while the one that is *not*
>>> functioning had HT *disabled*. Enabled Logical Processor and the
system
>>> starts without issue! I've rebooted 3 times now without issue.
>>> Dell R710 BIOS version 6.4.0
>>> 2x Intel(R) Xeon(R) CPU L5639  @ 2.13GHz
>>> 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64
x86_64
>>> x86_64 GNU/Linux
>>>
>>
>>    Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the
>>    system as normal updates.  It should be available later today.
>>
>>    <snip>
>>
>>
>> I've verified with a second Dell R710 that disabling
>> Hyperthreading/Logical Processor causes the primary xen booting
kernel
>> to fail and reboot. Consequently, enabling allows for the system to
>> start as expected and without any issue:
>> Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26
>> 22:15:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>
>> I just attempted an update and the 4.9.23-26 is not yet up. Does this
>> update address the Hyperthreading issue in any way?
>>
>
> I don't think so .. at least I did not specifically add anything
to do so.
>
> You can get it here for testing:
>
> https://buildlogs.centos.org/centos/7/virt/x86_64/xen/
<https://buildlogs.centos.org/centos/7/virt/x86_64/xen/>
>
> (or from /6/ as well for CentOS-6)
>
> Not sure why it did not go out on the signing run .. will check
that server.
>
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt@centos.org <mailto:CentOS-virt@centos.org>
> https://lists.centos.org/mailman/listinfo/centos-virt
<https://lists.centos.org/mailman/listinfo/centos-virt>

_______________________________________________
CentOS-virt mailing list
CentOS-virt@centos.org <mailto:CentOS-virt@centos.org>
https://lists.centos.org/mailman/listinfo/centos-virt
<https://lists.centos.org/mailman/listinfo/centos-virt>

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net

PJ Welsh

14 Apr 14 Apr

2:34 p.m.

New subject: Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Very nice on the sleuthing! Thanks

On Fri, Apr 14, 2017 at 5:16 AM, Anderson, Dave daveanderson@wsu.edu wrote:

...

List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.

I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

[Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 installing Xen timer for CPU 8 [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/cpu/common.c:997! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 random: fast init done task: ffff880058a8c4c0 task.stack: ffffc900400b4000 RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 Stack: 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 [<ffffffff81029925>] cpu_bringup+0x35/0x90 [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 RSP <ffffc900400b7f08> ---[ end trace dc5563100443876e ]---

I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)

In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.

So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.

Thanks, -Dave

...
On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com

...
wrote: I've not gotten any bites from my posting on the xen-devel mailing list. Here is the only one to-date: https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html

From that email, there needs to be some hypervisor messages.

Does anyone know how to produce the hypervisor messages? I've already

...
removed the rhgb and quiet options from the boot.

...
Thanks PJ

I spoke too soon. To get more information: Please see

https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project

and

https://wiki.xenproject.org/wiki/Xen_Serial_Console

or alternatively at least add "vga=keep".

pjwelsh

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

3171

Age (days ago)

3178

Last active (days ago)

virt@lists.centos.org

17 comments

5 participants

tags (0)

participants (5)

Anderson, Dave
Johnny Hughes
Kevin Stange
Mark L Sung
PJ Welsh