[CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

Johnny Hughes johnny at centos.org
Tue Apr 18 13:30:13 UTC 2017


On 04/14/2017 03:26 PM, Anderson, Dave wrote:
> Sad to say that I already tested 4.9.20-26 from your repo yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:
> 
> Loading Xen 4.6.3-12.el7 ...
> Loading Linux 4.9.20-26.el7.x86_64 ...
> Loading initial ramdisk ...
> [    0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017
> 
> <snip>
> 
> [    6.195089] smpboot: Max logical packages: 1
> [    6.199549] VPMU disabled by hypervisor.
> [    6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only.
> [    6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled
> [    6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus
> [    6.229165] installing Xen timer for CPU 1
> [    6.233849] installing Xen timer for CPU 2
> [    6.238504] installing Xen timer for CPU 3
> [    6.243139] installing Xen timer for CPU 4
> [    6.247836] installing Xen timer for CPU 5
> [    6.252478] installing Xen timer for CPU 6
> [    6.257155] installing Xen timer for CPU 7
> [    6.261795] installing Xen timer for CPU 8
> [    6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
> [    6.272736] ------------[ cut here ]------------
> [    6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997!
> [    6.280104] random: fast init done
> [    6.286333] invalid opcode: 0000 [#1] SMP
> [    6.290343] Modules linked in:
> [    6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.20-26.el7.x86_64 #1
> [    6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015
> [    6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000
> [    6.313103] RIP: e030:[<ffffffff8103e7e7>]  [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80
> [    6.322019] RSP: e02b:ffffc900400c3f08  EFLAGS: 00010086
> [    6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8
> [    6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005
> [    6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000
> [    6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008
> [    6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [    6.363006] FS:  0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000
> [    6.371090] CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
> [    6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660
> [    6.383970] Stack:
> [    6.386004]  0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce
> [    6.393483]  ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50
> [    6.400963]  ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000
> [    6.408450] Call Trace:
> [    6.410907]  [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40
> [    6.416753]  [<ffffffff81029855>] cpu_bringup+0x35/0x90
> [    6.421981]  [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40
> [    6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 
> [    6.448249] RIP  [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80
> [    6.454801]  RSP <ffffc900400c3f08>
> [    6.458305] ---[ end trace 2f9b62c5c7050204 ]---
> 
> 
> So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 0 APIC: 1"  lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[    6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.
> 
> 
> Even more strangely, in addition to the machine I'm talking about which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)
> 
> Thanks,
> -Dave

Dave,

Just for testing purposes, can you try booting the kernel in the normal
way on the machine does does not work (a normal grub entry on the kernel
with no xen.gz line)

That way, we can hopefully narrow the issue down to a hypervisor issue
or a kernel config issue.

Thanks,
Johnny Hughes

> 
> 
>> On Apr 14, 2017, at 05:39, Johnny Hughes <johnny at centos.org> wrote:
>>
>> Dave,
>>
>> Take a look at this kernel as it is the one I think we are going to
>> release (or a slightly newer 4.9.2x from kernel.org LTS). This version
>> has some newer settings that are more redhat/fedora/centos base kernel
>> like WRT what is a module and what is built into the kernel, etc.
>>
>> https://people.centos.org/hughesjr/4.9.x/
>>
>> Thanks,
>> Johnny Hughes
>>
>> On 04/14/2017 05:16 AM, Anderson, Dave wrote:
>>> List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.
>>>
>>>
>>> I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:
>>>
>>> Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:
>>>
>>> [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7
>>> installing Xen timer for CPU 8
>>> [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20
>>> smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
>>> ------------[ cut here ]------------
>>> kernel BUG at arch/x86/kernel/cpu/common.c:997!
>>> invalid opcode: 0000 [#1] SMP
>>> Modules linked in:
>>> CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1
>>> Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015
>>> random: fast init done
>>> task: ffff880058a8c4c0 task.stack: ffffc900400b4000
>>> RIP: e030:[<ffffffff8103e527>]  [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80
>>> RSP: e02b:ffffc900400b7f08  EFLAGS: 00010086
>>> RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68
>>> RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005
>>> RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004
>>> R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008
>>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>> FS:  0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000
>>> CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
>>> CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660
>>> Stack:
>>> 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e
>>> ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50
>>> ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000
>>> Call Trace:
>>> [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40
>>> [<ffffffff81029925>] cpu_bringup+0x35/0x90
>>> [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40
>>> Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 
>>> RIP  [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80
>>> RSP <ffffc900400b7f08>
>>> ---[ end trace dc5563100443876e ]---
>>>
>>> I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)
>>>
>>> In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.
>>>
>>>
>>> So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.
>>>
>>> Thanks,
>>> -Dave
>>>
>>>
>>>
>>>> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com
>>>>> wrote:
>>>>> I've not gotten any bites from my posting on the xen-devel mailing list.
>>>>> Here is the only one to-date:
>>>>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html
>>>>>
>>>>> From that email, there needs to be some hypervisor messages.
>>>>>
>>>>> Does anyone know how to produce the hypervisor messages? I've already
>>>>
>>>>> removed the rhgb and quiet options from the boot.
>>>>
>>>>>
>>>>> Thanks
>>>>> PJ
>>>>
>>>>
>>>> I spoke too soon. To get more information: Please see
>>>>
>>>> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project
>>>>
>>>> and
>>>>
>>>> https://wiki.xenproject.org/wiki/Xen_Serial_Console
>>>>
>>>> or alternatively at least add "vga=keep".
>>>>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170418/694e319a/attachment.sig>


More information about the CentOS-virt mailing list