[CentOS-virt] Xen PV DomU running Kernel 4.14.5-1.el7.elrepo.x86_64: xl -v vcpu-set <domU> <val> triggers domU kernel WARNING, then domU becomes unresponsive

Thu Dec 14 22:12:27 UTC 2017
Adi Pircalabu <adi at ddns.com.au>

On 15-12-2017 9:01, Adi Pircalabu wrote:
> On 15-12-2017 4:10, Akemi Yagi wrote:
>> On Mon, Dec 11, 2017 at 4:52 PM, Adi Pircalabu <adi at ddns.com.au>
>> wrote:
>> 
>>> Has anyone seen this recently? I couldn't replicate it on:
>>> - CentOS 6 running kernel-2.6.32-696.16.1.el6.x86_64,
>>> kernel-lt-4.4.105-1.el6.elrepo.x86_64
>>> - CentOS 7 running 4.9.67-1.el7.centos.x86_64
>>> 
>>> But I can replicate it consistently running "xl -v vcpu-set <domU>
>>> <val>" on:
>>> - CentOS 6 running 4.14.5-1.el6.elrepo.x86_64
>>> - CentOS 7 running 4.14.5-1.el7.elrepo.x86_64
>>> 
>>> dom0 versions tested with similar results in the domU:
>>> - 4.6.6-6.el7 on kernel 4.9.63-29.el7.x86_64
>>> - 4.6.3-15.el6 on kernel 4.9.37-29.el6.x86_64
>>> 
>>> Noticed behaviour:
>>> - These commands stall:
>>> top
>>> ls -l /var/tmp
>>> ls -l /tmp
>>> - Stuck in D state on the CentOS 7 domU:
>>> root         5  0.0  0.0      0     0 ?        D    11:20   0:00
>>> [kworker/u8:0]
>>> root       316  0.0  0.0      0     0 ?        D    11:20   0:00
>>> [jbd2/xvda1-8]
>>> root      1145  0.0  0.2 116636  4776 ?        Ds   11:20   0:00
>>> -bash
>>> root      1289  0.0  0.1  25852  2420 ?        Ds   11:35   0:00
>>> /usr/bin/systemd-tmpfiles --clean
>>> root      1290  0.0  0.1 125248  2696 pts/1    D+   11:44   0:00 ls
>>> --color=auto -l /tmp/
>>> root      1293  0.0  0.1 125248  2568 pts/2    D+   11:44   0:00 ls
>>> --color=auto -l /var/tmp
>>> root      1296  0.0  0.2 116636  4908 pts/3    Ds+  11:44   0:00
>>> -bash
>>> root      1358  0.0  0.1 125248  2612 pts/4    D+   11:47   0:00 ls
>>> --color=auto -l /var/tmp
>>> 
>>> At a first glance it appears the issue is in 4.14.5 kernel. Stack
>>> traces follow:
>>> 
>>> Adi Pircalabu
>> 
>> Can you test-install 4.15-rcX​
>>  to see if the problem persists in the latest kernel?:
>> 
>>http://elrepo.org/people/ajb/devel/kernel-ml/el7/x86_64/RPMS/ [1]
>> 
>> Akemi
> 
> Thanks for that, tested it on both CentOS 6 and 7 PV domU and I get
> similar panics:
> 
> -----CentOS 6-----
> [...]
> dracut: Switching root
> 		Welcome to CentOS
> Starting udev: udev: starting version 147
> input: PC Speaker as /devices/platform/pcspkr/input/input0
> xen_netfront: Initialising Xen virtual ethernet driver
> BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000010
> IP: coretemp_cpu_online+0x116/0x190 [coretemp]
> PGD 7b5c7067 P4D 7b5c7067 PUD 7b5cd067 PMD 0
> Oops: 0002 [#1] SMP
> Modules linked in: coretemp(+) hwmon xen_netfront pcspkr ext4 jbd2
> mbcache xen_blkfront dm_mirror dm_region_hash dm_log dm_mod dax
> CPU: 0 PID: 12 Comm: cpuhp/0 Not tainted 4.15.0-0.rc1.el6.elrepo.x86_64 
> #1
> task: ffff88007c8f43c0 task.stack: ffffc90040390000
> RIP: e030:coretemp_cpu_online+0x116/0x190 [coretemp]
> RSP: e02b:ffffc90040393cd8 EFLAGS: 00010246
> RAX: 0000000000000010 RBX: 0000000000000000 RCX: ffff88007c87c248
> RDX: 0000000000000000 RSI: ffff880077720c28 RDI: ffff8800069ea020
> RBP: ffffc90040393d18 R08: 0000000000000000 R09: ffffc90040393a08
> R10: 0000000000000000 R11: 000000000000005f R12: 0000000000000000
> R13: ffff8800069ea000 R14: ffff88007f60a040 R15: 0000000000000000
> FS:  00007f685ca0a700(0000) GS:ffff88007f600000(0000) 
> knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 000000000683a000 CR4: 0000000000042660
> Call Trace:
>  ? coretemp_add_core+0x50/0x50 [coretemp]
>  cpuhp_invoke_callback+0xe9/0x700
>  ? put_prev_task_fair+0x26/0x40
>  ? __schedule+0x2d0/0x6e0
>  ? __wake_up_common+0x84/0x130
>  ? __wake_up_common+0x84/0x130
>  cpuhp_thread_fun+0xee/0x170
>  smpboot_thread_fn+0x10c/0x160
>  ? smpboot_create_threads+0x80/0x80
>  kthread+0x10a/0x140
>  ? kthread_probe_data+0x40/0x40
>  ret_from_fork+0x1f/0x30
> Code: 11 15 41 e1 49 89 c5 b8 f4 ff ff ff 4d 85 ed 0f 84 66 ff ff ff
> 4c 89 ef e8 88 11 41 e1 85 c0 75 6e 48 8b 05 75 17 00 00 4d 63 ff <4e>
> 89 2c f8 49 81 fd 00 f0 ff ff 44 89 e8 0f 87 3c ff ff ff 49
> RIP: coretemp_cpu_online+0x116/0x190 [coretemp] RSP: ffffc90040393cd8
> CR2: 0000000000000010
> ---[ end trace 8253bafacf228cf2 ]---
> -----CentOS 6-----
> -----CentOS 7-----
> [...]
> [  OK  ] Found device /dev/xvda2.
>          Activating swap /dev/xvda2...
> [    4.998940] alg: No test for pcbc(aes) (pcbc-aes-aesni)
> [    5.001054] Adding 1048572k swap on /dev/xvda2.  Priority:-2
> extents:1 across:1048572k SSFS
> [  OK  ] Activated swap /dev/xvda2.
> [  OK  ] Reached target Swap.
> [    5.020760] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000010
> [    5.020767] IP: coretemp_cpu_online+0xf8/0x1f7 [coretemp]
> [    5.020769] PGD 0 P4D 0
> [    5.020771] Oops: 0002 [#1] SMP
> [    5.020773] Modules linked in: coretemp(+) crct10dif_pclmul
> crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd
> glue_helper cryptd pcspkr intel_rapl_perf nfsd auth_rpcgss nfs_acl
> lockd grace sunrpc ip_tables ext4 mbcache jbd2 xen_netfront
> xen_blkfront crc32c_intel
> [    5.020786] CPU: 0 PID: 12 Comm: cpuhp/0 Not tainted
> 4.15.0-0.rc3.el7.elrepo.x86_64 #1
> [    5.020789] RIP: e030:coretemp_cpu_online+0xf8/0x1f7 [coretemp]
> [    5.020790] RSP: e02b:ffffc90040387e10 EFLAGS: 00010246
> [    5.020793] RAX: 0000000000000010 RBX: ffff8800040d8800 RCX: 
> 0000000000000000
> [    5.020794] RDX: ffff880079761e70 RSI: ffff88007c438cc8 RDI: 
> ffff8800040d8820
> [    5.020796] RBP: ffffc90040387e40 R08: 0000000000000000 R09: 
> ffffffff81f8aff0
> [    5.020798] R10: ffff88007d01f400 R11: 0000000000000000 R12: 
> 0000000000000000
> [    5.020800] R13: 0000000000000000 R14: 0000000000000000 R15: 
> ffff88007d00a020
> [    5.020804] FS:  0000000000000000(0000) GS:ffff88007d000000(0000)
> knlGS:0000000000000000
> [    5.020806] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.020808] CR2: 0000000000000010 CR3: 0000000004731000 CR4: 
> 0000000000042660
> [    5.020810] Call Trace:
> [    5.020814]  ? create_core_data+0x5f0/0x5f0 [coretemp]
> [    5.020817]  cpuhp_invoke_callback+0xae/0x5c0
> [    5.020820]  ? __schedule+0x295/0x880
> [    5.020823]  cpuhp_thread_fun+0xce/0x170
> [    5.020825]  smpboot_thread_fn+0x110/0x160
> [    5.020827]  kthread+0x102/0x140
> [    5.020828]  ? sort_range+0x30/0x30
> [    5.020831]  ? kthread_associate_blkcg+0xa0/0xa0
> [    5.020833]  ret_from_fork+0x1f/0x30
> [    5.020834] Code: 21 a0 41 0f b7 f6 e8 38 73 30 e1 48 89 c3 b8 f4
> ff ff ff 48 85 db 74 c0 48 89 df e8 d3 69 30 e1 85 c0 75 7c 48 8b 05
> 40 18 00 00 <4a> 89 1c f0 48 81 fb 00 f0 ff ff 0f 87 e7 00 00 00 49 8b
> 47 4c
> [    5.020852] RIP: coretemp_cpu_online+0xf8/0x1f7 [coretemp] RSP:
> ffffc90040387e10
> [    5.020854] CR2: 0000000000000010
> [    5.020856] ---[ end trace 9ce91afe6b362317 ]---
> [    5.020858] Kernel panic - not syncing: Fatal exception
> [    5.020861] Kernel Offset: disabled
> -----CentOS 7-----
> 
> For CentOS 7 I've also tried booting it using vcpus = 1 and no
> maxvcpus with the same outcome. Looks like 4.15-rc is a no go for me
> :)

We might be getting a bit sidetracked here, but also noticed this boot 
warning on CentOS 6 with 4.14.5-1.el6.elrepo.x86_64:

xen_netfront: Initialising Xen virtual ethernet driver
sysfs: cannot create duplicate filename 
'/devices/platform/coretemp.0/hwmon/hwmon0/temp2_label'
------------[ cut here ]------------
WARNING: CPU: 1 PID: 13 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x63/0x80
Modules linked in: coretemp(+) hwmon xen_netfront pcspkr ext4 jbd2 
mbcache xen_blkfront dm_mirror dm_region_hash dm_log dm_mod dax
CPU: 1 PID: 13 Comm: cpuhp/1 Not tainted 4.14.5-1.el6.elrepo.x86_64 #1
task: ffff88007c8f8400 task.stack: ffffc90040398000
RIP: e030:sysfs_warn_dup+0x63/0x80
RSP: e02b:ffffc9004039bb18 EFLAGS: 00010296
RAX: 000000000000005f RBX: ffff88000683f000 RCX: ffffffff81e5f308
RDX: 0000000000000001 RSI: 00000000000000e2 RDI: 0000000000000200
RBP: ffffc9004039bb38 R08: 0000000000000000 R09: 0720072007200720
R10: 0720072007200720 R11: 0000000000000000 R12: ffff880006aa8cf8
R13: ffff880006b27180 R14: 0000000000000000 R15: ffff880006b27180
FS:  00007f9f1475d700(0000) GS:ffff88007f680000(0000) 
knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000001381000 CR3: 000000007caa3000 CR4: 0000000000042660
Call Trace:
  sysfs_add_file_mode_ns+0x150/0x190
  ? smp_call_function_single+0xc3/0x100
  create_files+0x7e/0x1c0
  internal_create_group+0x8e/0x130
  sysfs_create_group+0x13/0x20
  create_core_data+0x29d/0x390 [coretemp]
  ? coretemp_add_core+0x50/0x50 [coretemp]
  coretemp_add_core+0x20/0x50 [coretemp]
  coretemp_cpu_online+0xce/0x190 [coretemp]
  ? coretemp_add_core+0x50/0x50 [coretemp]
  cpuhp_invoke_callback+0xe9/0x700
  ? __schedule+0x2cd/0x6e0
  ? __wake_up_common+0x84/0x130
  ? __wake_up_common+0x84/0x130
  cpuhp_thread_fun+0xeb/0x170
  smpboot_thread_fn+0x10c/0x160
  ? smpboot_create_threads+0x80/0x80
  kthread+0x111/0x150
  ? __kthread_init_worker+0x40/0x40
  ret_from_fork+0x25/0x30
Code: 48 89 c3 74 12 b9 00 10 00 00 48 89 c2 31 f6 4c 89 ef e8 61 d1 ff 
ff 4c 89 e2 48 89 de 48 c7 c7 b8 4c cb 81 31 c0 e8 31 07 e1 ff <0f> ff 
48 89 df e8 63 c4 f4 ff 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d
---[ end trace af2e9ac25142c3a0 ]---
coretemp coretemp.0: Adding Core 1 failed


---
Adi Pircalabu