Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64 - virt

10 Mar 2016


      I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups
within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
Here is the first instance:
sched: RT throttling activated
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
Modules linked in: ebt_arp xen_pciback xen_gntalloc ebt_ip ebtable_filter ebtables ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4
iptable_filter ip_tables xt_physdev br_netfilter bridge stp llc ip6t_REJECT nf_reject_ipv6 nf_c
onntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev
xen_evtchn xenfs xen_privcmd joydev sg 8250_fintek serio_raw gpio_ich iTCO_wdt iTCO_vendor_su
pport coretemp intel_powerclamp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr i2c_i801 lpc_ich igb ptp pps_core hwmon ioatdma dca i7core_edac
edac_core shpchp ext3 jbd mbcache raid10 raid1 sd_mod mptsas mptscsih mptbase scsi_transpor
t_sas aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 ahci libahci mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.25-18.el6.x86_64 #1
Hardware name: Supermicro X8DTN+-F/X8DTN+-F, BIOS 080016  08/03/2011
task: ffffffff81c1b4c0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
RIP: e030:[<ffffffffa02811d5>]  [<ffffffffa02811d5>] xenvif_tx_build_gops+0xa5/0x890 [xen_netback]
RSP: e02b:ffff88013f403ca8  EFLAGS: 00000206
RAX: 000000000000003c RBX: ffffc90012c28000 RCX: ffffc90012c280d0
RDX: 0000000000071ea1 RSI: 0000000000000040 RDI: ffffc90012c28000
RBP: ffff88013f403e38 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000071e65
R13: ffff88013f403e50 R14: 000000000000003c R15: 0000000000000032
FS:  00007fe942ac7980(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8000007f6800 CR3: 00000000bcd4c000 CR4: 0000000000002660
Stack:
 ffff88013f403d30 ffff880006de6800 ffff8800adfee1c0 ffff88013f403e54
 000000015786e75a 00000040a0351878 ffffc90012c2def0 000000015786e73c
 ffffc90012c280d0 ffffc90012c2def0 ffffffffa03516c0 ffff8800bf29e000
Call Trace:
 <IRQ>
 [<ffffffffa03516c0>] ? br_handle_frame_finish+0x3f0/0x3f0 [bridge]
 [<ffffffff815a2b9e>] ? __netif_receive_skb_core+0x1ee/0x640
 [<ffffffff815a3017>] ? __netif_receive_skb+0x27/0x70
 [<ffffffff815a326d>] ? netif_receive_skb_internal+0x2d/0x90
 [<ffffffffa01d5053>] ? igb_alloc_rx_buffers+0x63/0xe0 [igb]
 [<ffffffffa0281a0d>] xenvif_tx_action+0x4d/0xa0 [xen_netback]
 [<ffffffffa02843b5>] xenvif_poll+0x35/0x68 [xen_netback]
 [<ffffffff815a39c2>] net_rx_action+0x112/0x2a0
 [<ffffffff81076b7c>] __do_softirq+0xfc/0x2b0
 [<ffffffff81076e3d>] irq_exit+0xbd/0xd0
 [<ffffffff813b2cbc>] xen_evtchn_do_upcall+0x3c/0x50
 [<ffffffff8167659e>] xen_do_hypervisor_callback+0x1e/0x40
 <EOI>
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff8100a830>] ? xen_safe_halt+0x10/0x20
 [<ffffffff8101ec84>] ? default_idle+0x24/0xc0
 [<ffffffff8101e28f>] ? arch_cpu_idle+0xf/0x20
 [<ffffffff810b2276>] ? cpuidle_idle_call+0xd6/0x1d0
 [<ffffffff81091312>] ? __atomic_notifier_call_chain+0x12/0x20
 [<ffffffff810b24a5>] ? cpu_idle_loop+0x135/0x1e0
 [<ffffffff810b256b>] ? cpu_startup_entry+0x1b/0x70
 [<ffffffff810b25b0>] ? cpu_startup_entry+0x60/0x70
 [<ffffffff81667c57>] ? rest_init+0x77/0x80
 [<ffffffff81d774c9>] ? start_kernel+0x441/0x448
 [<ffffffff81d76ea6>] ? set_init_arg+0x5d/0x5d
 [<ffffffff81d76603>] ? x86_64_start_reservations+0x2a/0x2c
 [<ffffffff81d7aae1>] ? xen_start_kernel+0x5ef/0x5f1
Code: 00 0f 87 06 07 00 00 44 8b b3 b8 00 00 00 44 03 b3 c0 00 00 00 45 29 e6 41 39 c6 44 0f 47 f0 45 85 f6 0f 84 8f 00 00 00 0f ae e8 <8b> 83 c0 00
00 00 83 e8 01 44 21 e0 48 8d 04 40 48 c1 e0 02 48
Of the remaining lockups, here is the common backtrace with the exception that there have been two instances of RIP being in net_rx_action:
[<ffffffff815a39c2>] net_rx_action+0x112/0x2a0
 [<ffffffff81076b7c>] __do_softirq+0xfc/0x2b0
 [<ffffffff81076e3d>] irq_exit+0xbd/0xd0
 [<ffffffff813b2cbc>] xen_evtchn_do_upcall+0x3c/0x50
 [<ffffffff8167659e>] xen_do_hypervisor_callback+0x1e/0x40
 <EOI>
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff8100a830>] ? xen_safe_halt+0x10/0x20
 [<ffffffff8101ec84>] ? default_idle+0x24/0xc0
 [<ffffffff8101e28f>] ? arch_cpu_idle+0xf/0x20
 [<ffffffff810b2276>] ? cpuidle_idle_call+0xd6/0x1d0
 [<ffffffff81091312>] ? __atomic_notifier_call_chain+0x12/0x20
 [<ffffffff810b24a5>] ? cpu_idle_loop+0x135/0x1e0
 [<ffffffff810b256b>] ? cpu_startup_entry+0x1b/0x70
 [<ffffffff810b25b0>] ? cpu_startup_entry+0x60/0x70
 [<ffffffff81667c57>] ? rest_init+0x77/0x80
 [<ffffffff81d774c9>] ? start_kernel+0x441/0x448
 [<ffffffff81d76ea6>] ? set_init_arg+0x5d/0x5d
 [<ffffffff81d76603>] ? x86_64_start_reservations+0x2a/0x2c
 [<ffffffff81d7aae1>] ? xen_start_kernel+0x5ef/0x5f1
I can post more complete backtraces if that information would be useful to someone.