[CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
Anderson, Dave
daveanderson at wsu.edu
Fri Apr 14 21:57:53 UTC 2017
I also just realized the C6 portion of the title/subject line here refers to CentOS 6, so I'd like to clarify that all my testing/issues/etc was under CentOS 7.3 with all patches applied.
Thanks,
-Dave
> On Apr 14, 2017, at 1:39 PM, Anderson, Dave <daveanderson at wsu.edu> wrote:
>
> So, strangely,
>
> I have two _identical_ dualproc xeon mobos (same bios/ipmi versions, they even share an enclosure, one is right side, other is left), each with different cpu/memory:
>
>
> Using 4.9.13 with vcpu limited to 4, early in the boot process, the one that _was_ booting before setting the xen vcpu args says:
> "[ 7.060720] smpboot: Max logical packages: 2",
>
> and the other one says
> "[ 6.195089] smpboot: Max logical packages: 1"
>
>
>
> They both have dual procs, known working/good.
>
>
> The first (the one that worked unmodified) has dual 8 core (16 HT/ea) and correctly detects "[ 0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs". It's a Xeon E5-2665v1.
>
> The second machine (didn't work without the xen vcpu args) has dual 4 core (8ht/ea) and also correctly detects "[ 0.000000] smpboot: Allowing 16 CPUs, 0 hotplug CPUs". It's a Xeon E5-2643v1...so it seems like this one does ok until it decides there's only one cpu package?
>
> Thanks,
> -Dave
>
>
>> On Apr 14, 2017, at 13:26, Anderson, Dave <daveanderson at wsu.edu> wrote:
>>
>> Sad to say that I already tested 4.9.20-26 from your repo yesterday...it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:
>>
>> Loading Xen 4.6.3-12.el7 ...
>> Loading Linux 4.9.20-26.el7.x86_64 ...
>> Loading initial ramdisk ...
>> [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017
>>
>> <snip>
>>
>> [ 6.195089] smpboot: Max logical packages: 1
>> [ 6.199549] VPMU disabled by hypervisor.
>> [ 6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only.
>> [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled
>> [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus
>> [ 6.229165] installing Xen timer for CPU 1
>> [ 6.233849] installing Xen timer for CPU 2
>> [ 6.238504] installing Xen timer for CPU 3
>> [ 6.243139] installing Xen timer for CPU 4
>> [ 6.247836] installing Xen timer for CPU 5
>> [ 6.252478] installing Xen timer for CPU 6
>> [ 6.257155] installing Xen timer for CPU 7
>> [ 6.261795] installing Xen timer for CPU 8
>> [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
>> [ 6.272736] ------------[ cut here ]------------
>> [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997!
>> [ 6.280104] random: fast init done
>> [ 6.286333] invalid opcode: 0000 [#1] SMP
>> [ 6.290343] Modules linked in:
>> [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.20-26.el7.x86_64 #1
>> [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015
>> [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000
>> [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80
>> [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086
>> [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81e5ffc8
>> [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005
>> [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: 0000000000000000
>> [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000000008
>> [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000
>> [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033
>> [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: 0000000000042660
>> [ 6.383970] Stack:
>> [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 ffffffff8104ebce
>> [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 ffffc900400c3f50
>> [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 0000000000000000
>> [ 6.408450] Call Trace:
>> [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40
>> [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90
>> [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40
>> [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81
>> [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x80
>> [ 6.454801] RSP <ffffc900400c3f08>
>> [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]---
>>
>>
>> So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a few extra lines up from the panic because the "[ 6.195089] smpboot: Max logical packages: 1" could possibly be relevant, I need to go look at a clean boot to see if that was in there on this machine.
>>
>>
>> Even more strangely, in addition to the machine I'm talking about which panics and reboots, I had a second nearly identical machine (different CPU/ram config, everything else the same) which booted but had some kind of hw conflict with 4.9.x that I never had before. It appears to be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I disabled that). Had that other machine not booted I would have just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine booted at least :)
>>
>> Thanks,
>> -Dave
>>
>>
>>> On Apr 14, 2017, at 05:39, Johnny Hughes <johnny at centos.org> wrote:
>>>
>>> Dave,
>>>
>>> Take a look at this kernel as it is the one I think we are going to
>>> release (or a slightly newer 4.9.2x from kernel.org LTS). This version
>>> has some newer settings that are more redhat/fedora/centos base kernel
>>> like WRT what is a module and what is built into the kernel, etc.
>>>
>>> https://people.centos.org/hughesjr/4.9.x/
>>>
>>> Thanks,
>>> Johnny Hughes
>>>
>>> On 04/14/2017 05:16 AM, Anderson, Dave wrote:
>>>> List moderator: feel free to delete my previous large message with attachments that's in the moderation queue...it's now obsolete anyway.
>>>>
>>>>
>>>> I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:
>>>>
>>>> Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:
>>>>
>>>> [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7
>>>> installing Xen timer for CPU 8
>>>> [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20
>>>> smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
>>>> ------------[ cut here ]------------
>>>> kernel BUG at arch/x86/kernel/cpu/common.c:997!
>>>> invalid opcode: 0000 [#1] SMP
>>>> Modules linked in:
>>>> CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1
>>>> Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015
>>>> random: fast init done
>>>> task: ffff880058a8c4c0 task.stack: ffffc900400b4000
>>>> RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80
>>>> RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086
>>>> RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68
>>>> RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005
>>>> RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004
>>>> R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008
>>>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>> FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000
>>>> CS: e033 DS: 002b ES: 002b CR0: 0000000080050033
>>>> CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660
>>>> Stack:
>>>> 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e
>>>> ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50
>>>> ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000
>>>> Call Trace:
>>>> [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40
>>>> [<ffffffff81029925>] cpu_bringup+0x35/0x90
>>>> [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40
>>>> Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81
>>>> RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80
>>>> RSP <ffffc900400b7f08>
>>>> ---[ end trace dc5563100443876e ]---
>>>>
>>>> I surmised that reducing the number of dom0 vcpu might solve this issue (they were unbounded)
>>>>
>>>> In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running grub2-mkconfig has resulted in the system I have that never booted Xen 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests.
>>>>
>>>>
>>>> So...I don't know if there's a race condition somewhere, or what...but...so far this workaround has not failed me.
>>>>
>>>> Thanks,
>>>> -Dave
>>>>
>>>>
>>>>
>>>>> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com
>>>>>> wrote:
>>>>>> I've not gotten any bites from my posting on the xen-devel mailing list.
>>>>>> Here is the only one to-date:
>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html
>>>>>>
>>>>>> From that email, there needs to be some hypervisor messages.
>>>>>>
>>>>>> Does anyone know how to produce the hypervisor messages? I've already
>>>>>
>>>>>> removed the rhgb and quiet options from the boot.
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> PJ
>>>>>
>>>>>
>>>>> I spoke too soon. To get more information: Please see
>>>>>
>>>>> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project
>>>>>
>>>>> and
>>>>>
>>>>> https://wiki.xenproject.org/wiki/Xen_Serial_Console
>>>>>
>>>>> or alternatively at least add "vga=keep".
>>>>>
>>>>> pjwelsh
>>>>
>>>>
>>>> _______________________________________________
>>>> CentOS-virt mailing list
>>>> CentOS-virt at centos.org
>>>> https://lists.centos.org/mailman/listinfo/centos-virt
>>>>
>>>
>>>
>>> _______________________________________________
>>> CentOS-virt mailing list
>>> CentOS-virt at centos.org
>>> https://lists.centos.org/mailman/listinfo/centos-virt
More information about the CentOS-virt
mailing list