I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
Here is the first instance:
sched: RT throttling activated NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] Modules linked in: ebt_arp xen_pciback xen_gntalloc ebt_ip ebtable_filter ebtables ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev br_netfilter bridge stp llc ip6t_REJECT nf_reject_ipv6 nf_c onntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd joydev sg 8250_fintek serio_raw gpio_ich iTCO_wdt iTCO_vendor_su pport coretemp intel_powerclamp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr i2c_i801 lpc_ich igb ptp pps_core hwmon ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache raid10 raid1 sd_mod mptsas mptscsih mptbase scsi_transpor t_sas aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 ahci libahci mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.25-18.el6.x86_64 #1 Hardware name: Supermicro X8DTN+-F/X8DTN+-F, BIOS 080016 08/03/2011 task: ffffffff81c1b4c0 ti: ffffffff81c00000 task.ti: ffffffff81c00000 RIP: e030:[<ffffffffa02811d5>] [<ffffffffa02811d5>] xenvif_tx_build_gops+0xa5/0x890 [xen_netback] RSP: e02b:ffff88013f403ca8 EFLAGS: 00000206 RAX: 000000000000003c RBX: ffffc90012c28000 RCX: ffffc90012c280d0 RDX: 0000000000071ea1 RSI: 0000000000000040 RDI: ffffc90012c28000 RBP: ffff88013f403e38 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000071e65 R13: ffff88013f403e50 R14: 000000000000003c R15: 0000000000000032 FS: 00007fe942ac7980(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8000007f6800 CR3: 00000000bcd4c000 CR4: 0000000000002660 Stack: ffff88013f403d30 ffff880006de6800 ffff8800adfee1c0 ffff88013f403e54 000000015786e75a 00000040a0351878 ffffc90012c2def0 000000015786e73c ffffc90012c280d0 ffffc90012c2def0 ffffffffa03516c0 ffff8800bf29e000 Call Trace: <IRQ> [<ffffffffa03516c0>] ? br_handle_frame_finish+0x3f0/0x3f0 [bridge] [<ffffffff815a2b9e>] ? __netif_receive_skb_core+0x1ee/0x640 [<ffffffff815a3017>] ? __netif_receive_skb+0x27/0x70 [<ffffffff815a326d>] ? netif_receive_skb_internal+0x2d/0x90 [<ffffffffa01d5053>] ? igb_alloc_rx_buffers+0x63/0xe0 [igb] [<ffffffffa0281a0d>] xenvif_tx_action+0x4d/0xa0 [xen_netback] [<ffffffffa02843b5>] xenvif_poll+0x35/0x68 [xen_netback] [<ffffffff815a39c2>] net_rx_action+0x112/0x2a0 [<ffffffff81076b7c>] __do_softirq+0xfc/0x2b0 [<ffffffff81076e3d>] irq_exit+0xbd/0xd0 [<ffffffff813b2cbc>] xen_evtchn_do_upcall+0x3c/0x50 [<ffffffff8167659e>] xen_do_hypervisor_callback+0x1e/0x40 <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff8100a830>] ? xen_safe_halt+0x10/0x20 [<ffffffff8101ec84>] ? default_idle+0x24/0xc0 [<ffffffff8101e28f>] ? arch_cpu_idle+0xf/0x20 [<ffffffff810b2276>] ? cpuidle_idle_call+0xd6/0x1d0 [<ffffffff81091312>] ? __atomic_notifier_call_chain+0x12/0x20 [<ffffffff810b24a5>] ? cpu_idle_loop+0x135/0x1e0 [<ffffffff810b256b>] ? cpu_startup_entry+0x1b/0x70 [<ffffffff810b25b0>] ? cpu_startup_entry+0x60/0x70 [<ffffffff81667c57>] ? rest_init+0x77/0x80 [<ffffffff81d774c9>] ? start_kernel+0x441/0x448 [<ffffffff81d76ea6>] ? set_init_arg+0x5d/0x5d [<ffffffff81d76603>] ? x86_64_start_reservations+0x2a/0x2c [<ffffffff81d7aae1>] ? xen_start_kernel+0x5ef/0x5f1 Code: 00 0f 87 06 07 00 00 44 8b b3 b8 00 00 00 44 03 b3 c0 00 00 00 45 29 e6 41 39 c6 44 0f 47 f0 45 85 f6 0f 84 8f 00 00 00 0f ae e8 <8b> 83 c0 00 00 00 83 e8 01 44 21 e0 48 8d 04 40 48 c1 e0 02 48
Of the remaining lockups, here is the common backtrace with the exception that there have been two instances of RIP being in net_rx_action:
[<ffffffff815a39c2>] net_rx_action+0x112/0x2a0 [<ffffffff81076b7c>] __do_softirq+0xfc/0x2b0 [<ffffffff81076e3d>] irq_exit+0xbd/0xd0 [<ffffffff813b2cbc>] xen_evtchn_do_upcall+0x3c/0x50 [<ffffffff8167659e>] xen_do_hypervisor_callback+0x1e/0x40 <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff8100a830>] ? xen_safe_halt+0x10/0x20 [<ffffffff8101ec84>] ? default_idle+0x24/0xc0 [<ffffffff8101e28f>] ? arch_cpu_idle+0xf/0x20 [<ffffffff810b2276>] ? cpuidle_idle_call+0xd6/0x1d0 [<ffffffff81091312>] ? __atomic_notifier_call_chain+0x12/0x20 [<ffffffff810b24a5>] ? cpu_idle_loop+0x135/0x1e0 [<ffffffff810b256b>] ? cpu_startup_entry+0x1b/0x70 [<ffffffff810b25b0>] ? cpu_startup_entry+0x60/0x70 [<ffffffff81667c57>] ? rest_init+0x77/0x80 [<ffffffff81d774c9>] ? start_kernel+0x441/0x448 [<ffffffff81d76ea6>] ? set_init_arg+0x5d/0x5d [<ffffffff81d76603>] ? x86_64_start_reservations+0x2a/0x2c [<ffffffff81d7aae1>] ? xen_start_kernel+0x5ef/0x5f1
I can post more complete backtraces if that information would be useful to someone.
On 03/09/2016 08:15 PM, Sarah Newman wrote:
I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
Here is mpstat from around the time of the issue:
0:08:56 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:09:10 PM all 0.00 0.00 66.67 0.00 0.00 33.33 0.00 0.00 0.00 10:09:11 PM all 2.17 0.00 5.43 32.61 0.00 58.70 1.09 0.00 0.00 10:09:12 PM all 0.00 0.00 1.15 0.00 0.00 85.06 0.00 0.00 13.79 10:09:13 PM all 0.00 0.00 1.08 0.00 0.00 83.87 0.00 0.00 15.05 10:09:14 PM all 0.00 0.00 1.10 0.00 0.00 83.52 0.00 0.00 15.38 10:09:15 PM all 1.09 0.00 1.09 0.00 0.00 85.87 0.00 0.00 11.96 10:09:51 PM all 0.00 0.00 1.09 0.00 0.00 84.78 1.09 0.00 13.04 Message from syslogd at Mar 9 22:09:51 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] 10:10:02 PM all 0.00 0.00 33.33 50.00 0.00 16.67 0.00 0.00 0.00 10:10:03 PM all 3.16 0.00 10.53 8.42 0.00 2.11 1.05 0.00 74.74 10:10:04 PM all 0.00 0.00 3.23 38.71 0.00 1.08 1.08 0.00 55.91 10:10:05 PM all 0.00 0.00 4.30 11.83 0.00 3.23 1.08 0.00 79.57
Typical load:
10:22:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:22:16 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96 10:22:17 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.04 0.00 98.96 10:22:18 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98 10:22:19 PM all 0.00 0.00 1.01 0.00 0.00 1.01 0.00 0.00 97.98 10:22:20 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.02 0.00 98.98 10:22:21 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96 10:22:22 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98
I reverted to an older kernel since the older kernel had run for a couple of months without issues.
On 03/10/2016 12:05 AM, Sarah Newman wrote:
On 03/09/2016 08:15 PM, Sarah Newman wrote:
I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
Here is mpstat from around the time of the issue:
0:08:56 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:09:10 PM all 0.00 0.00 66.67 0.00 0.00 33.33 0.00 0.00 0.00 10:09:11 PM all 2.17 0.00 5.43 32.61 0.00 58.70 1.09 0.00 0.00 10:09:12 PM all 0.00 0.00 1.15 0.00 0.00 85.06 0.00 0.00 13.79 10:09:13 PM all 0.00 0.00 1.08 0.00 0.00 83.87 0.00 0.00 15.05 10:09:14 PM all 0.00 0.00 1.10 0.00 0.00 83.52 0.00 0.00 15.38 10:09:15 PM all 1.09 0.00 1.09 0.00 0.00 85.87 0.00 0.00 11.96 10:09:51 PM all 0.00 0.00 1.09 0.00 0.00 84.78 1.09 0.00 13.04 Message from syslogd at Mar 9 22:09:51 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] 10:10:02 PM all 0.00 0.00 33.33 50.00 0.00 16.67 0.00 0.00 0.00 10:10:03 PM all 3.16 0.00 10.53 8.42 0.00 2.11 1.05 0.00 74.74 10:10:04 PM all 0.00 0.00 3.23 38.71 0.00 1.08 1.08 0.00 55.91 10:10:05 PM all 0.00 0.00 4.30 11.83 0.00 3.23 1.08 0.00 79.57
Typical load:
10:22:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:22:16 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96 10:22:17 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.04 0.00 98.96 10:22:18 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98 10:22:19 PM all 0.00 0.00 1.01 0.00 0.00 1.01 0.00 0.00 97.98 10:22:20 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.02 0.00 98.98 10:22:21 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96 10:22:22 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98
I reverted to an older kernel since the older kernel had run for a couple of months without issues.
This did not fix it. I isolated the issue to a vif rate limit of 100Mb/s being applied to one of the guests and am now able to reproduce on a different machine.
I will look into whether this has been fixed already; if so I will submit a pull request for the Xen4CentOS kernel and if not I will take it up with the xen-devel list.
On Sat, Mar 12, 2016 at 11:47 PM, Sarah Newman srn@prgmr.com wrote:
On 03/10/2016 12:05 AM, Sarah Newman wrote:
On 03/09/2016 08:15 PM, Sarah Newman wrote:
I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
Here is mpstat from around the time of the issue:
0:08:56 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:09:10 PM all 0.00 0.00 66.67 0.00 0.00 33.33 0.00 0.00 0.00 10:09:11 PM all 2.17 0.00 5.43 32.61 0.00 58.70 1.09 0.00 0.00 10:09:12 PM all 0.00 0.00 1.15 0.00 0.00 85.06 0.00 0.00 13.79 10:09:13 PM all 0.00 0.00 1.08 0.00 0.00 83.87 0.00 0.00 15.05 10:09:14 PM all 0.00 0.00 1.10 0.00 0.00 83.52 0.00 0.00 15.38 10:09:15 PM all 1.09 0.00 1.09 0.00 0.00 85.87 0.00 0.00 11.96 10:09:51 PM all 0.00 0.00 1.09 0.00 0.00 84.78 1.09 0.00 13.04 Message from syslogd at Mar 9 22:09:51 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] 10:10:02 PM all 0.00 0.00 33.33 50.00 0.00 16.67 0.00 0.00 0.00 10:10:03 PM all 3.16 0.00 10.53 8.42 0.00 2.11 1.05 0.00 74.74 10:10:04 PM all 0.00 0.00 3.23 38.71 0.00 1.08 1.08 0.00 55.91 10:10:05 PM all 0.00 0.00 4.30 11.83 0.00 3.23 1.08 0.00 79.57
Typical load:
10:22:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:22:16 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96 10:22:17 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.04 0.00 98.96 10:22:18 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98 10:22:19 PM all 0.00 0.00 1.01 0.00 0.00 1.01 0.00 0.00 97.98 10:22:20 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.02 0.00 98.98 10:22:21 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96 10:22:22 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98
I reverted to an older kernel since the older kernel had run for a couple of months without issues.
This did not fix it. I isolated the issue to a vif rate limit of 100Mb/s being applied to one of the guests and am now able to reproduce on a different machine.
I will look into whether this has been fixed already; if so I will submit a pull request for the Xen4CentOS kernel and if not I will take it up with the xen-devel list.
Yes, I was going to suggest posting this to xen-users -- it's not unlikely someone has already run across this.
-George