[CentOS-virt] Xen PV DomU running Kernel 4.14.5-1.el7.elrepo.x86_64: xl -v vcpu-set <domU> <val> triggers domU kernel WARNING, then domU becomes unresponsive

On 20-12-2017 9:17, Johnny Hughes wrote:
[...]
> 
> OK .. I have built and pushed to the testing tag the following dom0 
> kernels:
> 
> kernel-4.9.70-29.el7
> kernel-4.9.70-29.el6
> 
> they will show up in a couple hours here:
> 
> https://buildlogs.centos.org/centos/6/virt/x86_64/xen/
> 
> https://buildlogs.centos.org/centos/7/virt/x86_64/xen/

Johnny, thanks for that. Unfortunately the bug is still there. Only 
tested it on C7 for both dom0 and domU.

dom0 xl info:
release                : 4.9.70-29.el7.x86_64
version                : #1 SMP Tue Dec 19 15:25:38 UTC 2017
machine                : x86_64
nr_cpus                : 8
max_cpu_id             : 7
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 2
cpu_mhz                : 3606
hw_caps                : 
bfebfbff:2c100800:00000000:00007f00:77fafbff:00000000:00000121:029c6fbf
virt_caps              : hvm
total_memory           : 16275
free_memory            : 13020
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 6
xen_extra              : .6-8.el7
xen_version            : 4.6.6-8.el7
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : Tue Dec 12 11:42:15 2017 +0000 
git:07e9f39-dirty
xen_commandline        : placeholder dom0_mem=1024M,max:1024M 
dom0_max_vcpus=1 dom0_vcpus_pin cpuinfo com1=115200,8n1 console=com1,tty 
loglvl=all guest_loglvl=all
cc_compiler            : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
cc_compile_by          : mockbuild
cc_compile_domain      : centos.org
cc_compile_date        : Tue Dec 12 12:15:46 UTC 2017
xend_config_format     : 4

domU kernel 4.14.7-1.el7.elrepo.x86_64

dom0 stacktrace:

[   14.881838] ip_set: protocol 6
[  281.806649] installing Xen timer for CPU 2
[  281.807377] cpu 2 spinlock event irq 25
[  281.812667] installing Xen timer for CPU 3
[  281.813383] cpu 3 spinlock event irq 33
[  287.710812] ------------[ cut here ]------------
[  287.710842] WARNING: CPU: 2 PID: 35 at block/blk-mq.c:1144 
__blk_mq_run_hw_queue+0x89/0xa0
[  287.710853] Modules linked in: ip_set_hash_ip ip_set nfnetlink 
x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd 
intel_rapl_perf pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
ip_tables ext4 mbcache jbd2 xen_netfront xen_blkfront crc32c_intel
[  287.710913] CPU: 2 PID: 35 Comm: kworker/2:1H Not tainted 
4.14.7-1.el7.elrepo.x86_64 #1
[  287.710927] Workqueue: kblockd blk_mq_run_work_fn
[  287.710936] task: ffff88007c6a0000 task.stack: ffffc90040474000
[  287.710948] RIP: e030:__blk_mq_run_hw_queue+0x89/0xa0
[  287.710956] RSP: e02b:ffffc90040477e30 EFLAGS: 00010202
[  287.710968] RAX: 0000000000000001 RBX: ffff880003aa5400 RCX: 
ffff88007d11bca0
[  287.710977] RDX: ffff88007c656d98 RSI: 00000000000000a0 RDI: 
ffff880003aa5400
[  287.710986] RBP: ffffc90040477e48 R08: 0000000000000000 R09: 
0000000000000000
[  287.710996] R10: 0000000000007ff0 R11: 000000000000018e R12: 
ffff88007c570000
[  287.711005] R13: ffff88007d11bc80 R14: ffff88007d121b00 R15: 
ffff880003aa5448
[  287.711025] FS:  0000000000000000(0000) GS:ffff88007d100000(0000) 
knlGS:ffff88007d100000
[  287.711036] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.711045] CR2: 00007f8750908000 CR3: 00000000796a4000 CR4: 
0000000000042660
[  287.711057] Call Trace:
[  287.711071]  blk_mq_run_work_fn+0x2c/0x30
[  287.711086]  process_one_work+0x149/0x360
[  287.711098]  worker_thread+0x4d/0x3e0
[  287.711108]  kthread+0x109/0x140
[  287.711119]  ? rescuer_thread+0x380/0x380
[  287.711128]  ? kthread_park+0x60/0x60
[  287.711140]  ret_from_fork+0x25/0x30
[  287.711148] Code: 00 e8 4c e8 45 00 4c 89 e7 e8 34 4a d7 ff 48 89 df 
41 89 c5 e8 19 66 00 00 44 89 ee 4c 89 e7 e8 4e 4a d7 ff 5b 41 5c 41 5d 
5d c3 <0f> ff eb b4 48 89 df e8 fb 65 00 00 5b 41 5c 41 5d 5d c3 0f ff
[  287.711235] ---[ end trace 7b31b11d076677d1 ]---

However I've just found domU is recoverable by reverting to the previous 
number of vcpus when it's blocked.

---
Adi Pircalabu

> 
> 
> 
>> On 12/11/2017 06:52 PM, Adi Pircalabu wrote:
>>> Has anyone seen this recently? I couldn't replicate it on:
>>> - CentOS 6 running kernel-2.6.32-696.16.1.el6.x86_64,
>>> kernel-lt-4.4.105-1.el6.elrepo.x86_64
>>> - CentOS 7 running 4.9.67-1.el7.centos.x86_64
>>> 
>>> But I can replicate it consistently running "xl -v vcpu-set <domU>
>>> <val>" on:
>>> - CentOS 6 running 4.14.5-1.el6.elrepo.x86_64
>>> - CentOS 7 running 4.14.5-1.el7.elrepo.x86_64
>>> 
>>> dom0 versions tested with similar results in the domU:
>>> - 4.6.6-6.el7 on kernel 4.9.63-29.el7.x86_64
>>> - 4.6.3-15.el6 on kernel 4.9.37-29.el6.x86_64
>>> 
>>> Noticed behaviour:
>>> - These commands stall:
>>> top
>>> ls -l /var/tmp
>>> ls -l /tmp
>>> - Stuck in D state on the CentOS 7 domU:
>>> root         5  0.0  0.0      0     0 ?        D    11:20   0:00
>>> [kworker/u8:0]
>>> root       316  0.0  0.0      0     0 ?        D    11:20   0:00
>>> [jbd2/xvda1-8]
>>> root      1145  0.0  0.2 116636  4776 ?        Ds   11:20   0:00 
>>> -bash
>>> root      1289  0.0  0.1  25852  2420 ?        Ds   11:35   0:00
>>> /usr/bin/systemd-tmpfiles --clean
>>> root      1290  0.0  0.1 125248  2696 pts/1    D+   11:44   0:00 ls
>>> --color=auto -l /tmp/
>>> root      1293  0.0  0.1 125248  2568 pts/2    D+   11:44   0:00 ls
>>> --color=auto -l /var/tmp
>>> root      1296  0.0  0.2 116636  4908 pts/3    Ds+  11:44   0:00 
>>> -bash
>>> root      1358  0.0  0.1 125248  2612 pts/4    D+   11:47   0:00 ls
>>> --color=auto -l /var/tmp
>>> 
>>> At a first glance it appears the issue is in 4.14.5 kernel. Stack 
>>> traces
>>> follow:
>>> 
>>> -----CentOS 6 kernel-ml-4.14.5-1.el6.elrepo.x86_64 start here-----
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 4 PID: 60 at block/blk-mq.c:1144
>>> __blk_mq_run_hw_queue+0x9e/0xc0
>>> Modules linked in: intel_cstate(-) ipt_REJECT nf_reject_ipv4
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport iptable_filter 
>>> ip_tables
>>> ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state
>>> nf_conntrack libcrc32c ip6table_filter ip6_tables dm_mod dax
>>> xen_netfront crc32_pclmul crct10dif_pclmul ghash_clmulni_intel
>>> crc32c_intel pcbc aesni_intel glue_helper crypto_simd cryptd 
>>> aes_x86_64
>>> coretemp hwmon x86_pkg_temp_thermal sb_edac intel_rapl_perf pcspkr 
>>> ext4
>>> jbd2 mbcache xen_blkfront
>>> CPU: 4 PID: 60 Comm: kworker/4:1H Not tainted 
>>> 4.14.5-1.el6.elrepo.x86_64 #1
>>> Workqueue: kblockd blk_mq_run_work_fn
>>> task: ffff8802711a2780 task.stack: ffffc90041af4000
>>> RIP: e030:__blk_mq_run_hw_queue+0x9e/0xc0
>>> RSP: e02b:ffffc90041af7c48 EFLAGS: 00010202
>>> RAX: 0000000000000001 RBX: ffff88027117fa80 RCX: 0000000000000001
>>> RDX: ffff88026b053ee0 RSI: ffff88027351bca0 RDI: ffff88026b072800
>>> RBP: ffffc90041af7c68 R08: ffffc90041af7eb8 R09: ffff8802711a2810
>>> R10: 0000000000007ff0 R11: 0000000000000001 R12: ffff88026b072800
>>> R13: ffffe8ffffd04d00 R14: 0000000000000000 R15: ffffe8ffffd04d05
>>> FS:  00002b7b7c89b700(0000) GS:ffff880273500000(0000)
>>> knlGS:0000000000000000
>>> CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: ffffffffff600400 CR3: 000000026d953000 CR4: 0000000000042660
>>> Call Trace:
>>>  blk_mq_run_work_fn+0x31/0x40
>>>  process_one_work+0x174/0x440
>>>  ? xen_mc_flush+0xad/0x1b0
>>>  ? schedule+0x3a/0xa0
>>>  worker_thread+0x6b/0x410
>>>  ? default_wake_function+0x12/0x20
>>>  ? __wake_up_common+0x84/0x130
>>>  ? maybe_create_worker+0x120/0x120
>>>  ? schedule+0x3a/0xa0
>>>  ? _raw_spin_unlock_irqrestore+0x16/0x20
>>>  ? maybe_create_worker+0x120/0x120
>>>  kthread+0x111/0x150
>>>  ? __kthread_init_worker+0x40/0x40
>>>  ret_from_fork+0x25/0x30
>>> Code: 89 df e8 06 2f d9 ff 4c 89 e7 41 89 c5 e8 0b 6e 00 00 44 89 ee 
>>> 48
>>> 89 df e8 20 2f d9 ff 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 <0f> 
>>> ff
>>> eb aa 4c 89 e7 e8 e6 6d 00 00 48 8b 5d e8 4c 8b 65 f0 4c
>>> ---[ end trace fe2aaf4e723042fd ]---
>>> -----CentOS 6 kernel-ml-4.14.5-1.el6.elrepo.x86_64 end here-----
>>> 
>>> -----CentOS 7 kernel-ml-4.14.5-1.el7.elrepo.x86_64 start here-----
>>> [  116.528885] ------------[ cut here ]------------
>>> [  116.528894] WARNING: CPU: 3 PID: 38 at block/blk-mq.c:1144
>>> __blk_mq_run_hw_queue+0x89/0xa0
>>> [  116.528898] Modules linked in: intel_cstate(-) ip_set_hash_ip 
>>> ip_set
>>> nfnetlink x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul
>>> ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
>>> intel_rapl_perf pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc
>>> ip_tables ext4 mbcache jbd2 xen_netfront xen_blkfront crc32c_intel
>>> [  116.528919] CPU: 3 PID: 38 Comm: kworker/3:1H Not tainted
>>> 4.14.5-1.el7.elrepo.x86_64 #1
>>> [  116.529007] Code: 00 e8 7c c5 45 00 4c 89 e7 e8 14 4b d7 ff 48 89 
>>> df
>>> 41 89 c5 e8 19 66 00 00 44 89 ee 4c 89 e7 e8 2e 4b d7 ff 5b 41 5c 41 
>>> 5d
>>> 5d c3 <0f> ff eb b4 48 89 df e8 fb 65 00 00 5b 41 5c 41 5d 5d c3 0f 
>>> ff
>>> [  116.529034] ---[ end trace a7814e3ec9a330c6 ]---
>>> [  147.424117] ------------[ cut here ]------------
>>> [  147.424150] WARNING: CPU: 2 PID: 24 at block/blk-mq.c:1144
>>> __blk_mq_run_hw_queue+0x89/0xa0
>>> [  147.424160] Modules linked in: ip_set_hash_ip ip_set nfnetlink
>>> x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul
>>> ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
>>> intel_rapl_perf pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc
>>> ip_tables ext4 mbcache jbd2 xen_netfront xen_blkfront crc32c_intel
>>> [  147.424222] CPU: 2 PID: 24 Comm: kworker/2:0H Tainted: G        
>>> W    
>>>   4.14.5-1.el7.elrepo.x86_64 #1
>>> [  147.424238] Workqueue: kblockd blk_mq_run_work_fn
>>> [  147.424247] task: ffff88007c539840 task.stack: ffffc900403e4000
>>> [  147.424259] RIP: e030:__blk_mq_run_hw_queue+0x89/0xa0
>>> [  147.424270] RSP: e02b:ffffc900403e7e30 EFLAGS: 00010202
>>> [  147.424279] RAX: 0000000000000001 RBX: ffff880003b83800 RCX:
>>> ffff88007d11bca0
>>> [  147.424288] RDX: ffff88007c656c88 RSI: 00000000000000a0 RDI:
>>> ffff880003b83800
>>> [  147.424298] RBP: ffffc900403e7e48 R08: 0000000000000000 R09:
>>> 0000000000000000
>>> [  147.424309] R10: 0000000000007ff0 R11: 00000000000074e5 R12:
>>> ffff88007c436900
>>> [  147.424319] R13: ffff88007d11bc80 R14: ffff88007d121b00 R15:
>>> ffff880003b83848
>>> [  147.424340] FS:  0000000000000000(0000) GS:ffff88007d100000(0000)
>>> knlGS:ffff88007d100000
>>> [  147.424350] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  147.424359] CR2: 00007f504f19a700 CR3: 0000000079bed000 CR4:
>>> 0000000000042660
>>> [  147.424370] Call Trace:
>>> [  147.424384]  blk_mq_run_work_fn+0x2c/0x30
>>> [  147.424400]  process_one_work+0x149/0x360
>>> [  147.424411]  worker_thread+0x4d/0x3e0
>>> [  147.424421]  kthread+0x109/0x140
>>> [  147.424432]  ? rescuer_thread+0x380/0x380
>>> [  147.424441]  ? kthread_park+0x60/0x60
>>> [  147.424455]  ret_from_fork+0x25/0x30
>>> [  147.424463] Code: 00 e8 7c c5 45 00 4c 89 e7 e8 14 4b d7 ff 48 89 
>>> df
>>> 41 89 c5 e8 19 66 00 00 44 89 ee 4c 89 e7 e8 2e 4b d7 ff 5b 41 5c 41 
>>> 5d
>>> 5d c3 <0f> ff eb b4 48 89 df e8 fb 65 00 00 5b 41 5c 41 5d 5d c3 0f 
>>> ff
>>> [  147.424554] ---[ end trace a7814e3ec9a330c7 ]---
>>> -----CentOS 7 kernel-ml-4.14.5-1.el7.elrepo.x86_64 end here-----
>>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> CentOS-virt mailing list
>> CentOS-virt at centos.org
>> https://lists.centos.org/mailman/listinfo/centos-virt
>> 
> 
> 
> 
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt