Hi,
We had a potentially network related crash on a dom0 with Linux 4.9.39 / Xen 4.8 and as of today I can't find any fixes in stable/linux-4.9.y, xen/staging-4.8, or CPU microcode updates that look like a smoking gun. I can't rule out that it's Xen related. The backtraces are:
------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 inet_gro_complete+0xbb/0xd0 Call Trace: <IRQ> dump_stack+0x63/0x8e __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 inet_gro_complete+0xbb/0xd0 napi_gro_complete+0x73/0xa0 napi_gro_flush+0x5f/0x80 napi_complete_done+0x6a/0xb0 igb_poll+0x38d/0x720 [igb] ? igb_msix_ring+0x2e/0x40 [igb] ? __handle_irq_event_percpu+0x4b/0x1a0 net_rx_action+0x158/0x360 __do_softirq+0xd1/0x283 irq_exit+0xe9/0x100 xen_evtchn_do_upcall+0x35/0x50 xen_do_hypervisor_callback+0x1e/0x40 <EOI> ? xen_hypercall_sched_op+0xa/0x20 ? xen_hypercall_sched_op+0xa/0x20 ? xen_safe_halt+0x10/0x20 ? default_idle+0x1e/0xd0 ? arch_cpu_idle+0xf/0x20 ? default_idle_call+0x2c/0x40 ? cpu_startup_entry+0x1ac/0x240 ? rest_init+0x77/0x80 ? start_kernel+0x4a7/0x4b4 ? set_init_arg+0x55/0x55 ? x86_64_start_reservations+0x24/0x26 ? xen_start_kernel+0x555/0x561
general protection fault: 0000 [#1] SMP Call Trace: <IRQ> ? napi_gro_complete+0x5e/0xa0 skb_release_all+0x24/0x30 kfree_skb+0x32/0x90 napi_gro_complete+0x5e/0xa0 napi_gro_flush+0x5f/0x80 napi_complete_done+0x6a/0xb0 igb_poll+0x38d/0x720 [igb] ? igb_msix_ring+0x2e/0x40 [igb] ? __handle_irq_event_percpu+0x4b/0x1a0 net_rx_action+0x158/0x360 __do_softirq+0xd1/0x283 irq_exit+0xe9/0x100 xen_evtchn_do_upcall+0x35/0x50 xen_do_hypervisor_callback+0x1e/0x40 <EOI> ? xen_hypercall_sched_op+0xa/0x20 ? xen_hypercall_sched_op+0xa/0x20 ? xen_safe_halt+0x10/0x20 ? default_idle+0x1e/0xd0 ? arch_cpu_idle+0xf/0x20 ? default_idle_call+0x2c/0x40 ? cpu_startup_entry+0x1ac/0x240 ? rest_init+0x77/0x80 ? start_kernel+0x4a7/0x4b4 ? set_init_arg+0x55/0x55 ? x86_64_start_reservations+0x24/0x26 ? xen_start_kernel+0x555/0x561 RIP skb_release_data+0x73/0xf0 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
If anyone has had a similar backtrace or knows of a potential fix please respond.
This server has ECC and there were no ECC or other errors in the BIOS event log, nor were there any indications of any problems in the serial console log leading up to the warning.
This particular server had an uptime of about a month and a half, and so far we've had this error exactly once across all our servers since switching to 4.9.39 in August, so I don't think it's going to be easy to reproduce.
---
It looks to me like in the first backtrace, this check from inet_gro_complete failed:
ops = rcu_dereference(inet_offloads[proto]);
Which I'm guessing means the packet didn't have a valid layer 4 protocol definition, or we don't have that protocol enabled. Then when attempting to handle that failure there was a GPF, I believe by accessing invalid data in shinfo->frag_list . "skb_release_data+0x73" is in __read_once_size, which I think is generated by "kfree_skb: if (likely(atomic_read(&skb->users) == 1))" .
--Sarah
On Thu, Nov 09, 2017 at 01:36:44PM -0800, Sarah Newman wrote:
Hi,
We had a potentially network related crash on a dom0 with Linux 4.9.39 / Xen 4.8 and as of today I can't find any fixes in stable/linux-4.9.y, xen/staging-4.8, or CPU microcode updates that look like a smoking gun. I can't rule out that it's Xen related. The backtraces are:
------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 inet_gro_complete+0xbb/0xd0
Did you try tweaking network settings, disabling GRO for the network interface in question, and see if that changes anything?
Thanks,
-- Pasi
It looks to me like in the first backtrace, this check from inet_gro_complete failed:
ops = rcu_dereference(inet_offloads[proto]);
Which I'm guessing means the packet didn't have a valid layer 4 protocol definition, or we don't have that protocol enabled. Then when attempting to handle that failure there was a GPF, I believe by accessing invalid data in shinfo->frag_list . "skb_release_data+0x73" is in __read_once_size, which I think is generated by "kfree_skb: if (likely(atomic_read(&skb->users) == 1))" .
--Sarah