Hi,
We had a potentially network related crash on a dom0 with Linux 4.9.39 / Xen 4.8 and as of today I can't find any fixes in stable/linux-4.9.y,
xen/staging-4.8, or CPU microcode updates that look like a smoking gun. I can't rule out that it's Xen related. The backtraces are:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 inet_gro_complete+0xbb/0xd0
Call Trace:
<IRQ> dump_stack+0x63/0x8e
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
inet_gro_complete+0xbb/0xd0
napi_gro_complete+0x73/0xa0
napi_gro_flush+0x5f/0x80
napi_complete_done+0x6a/0xb0
igb_poll+0x38d/0x720 [igb]
? igb_msix_ring+0x2e/0x40 [igb]
? __handle_irq_event_percpu+0x4b/0x1a0
net_rx_action+0x158/0x360
__do_softirq+0xd1/0x283
irq_exit+0xe9/0x100
xen_evtchn_do_upcall+0x35/0x50
xen_do_hypervisor_callback+0x1e/0x40
<EOI> ? xen_hypercall_sched_op+0xa/0x20
? xen_hypercall_sched_op+0xa/0x20
? xen_safe_halt+0x10/0x20
? default_idle+0x1e/0xd0
? arch_cpu_idle+0xf/0x20
? default_idle_call+0x2c/0x40
? cpu_startup_entry+0x1ac/0x240
? rest_init+0x77/0x80
? start_kernel+0x4a7/0x4b4
? set_init_arg+0x55/0x55
? x86_64_start_reservations+0x24/0x26
? xen_start_kernel+0x555/0x561
general protection fault: 0000 [#1] SMP
Call Trace:
<IRQ> ? napi_gro_complete+0x5e/0xa0
skb_release_all+0x24/0x30
kfree_skb+0x32/0x90
napi_gro_complete+0x5e/0xa0
napi_gro_flush+0x5f/0x80
napi_complete_done+0x6a/0xb0
igb_poll+0x38d/0x720 [igb]
? igb_msix_ring+0x2e/0x40 [igb]
? __handle_irq_event_percpu+0x4b/0x1a0
net_rx_action+0x158/0x360
__do_softirq+0xd1/0x283
irq_exit+0xe9/0x100
xen_evtchn_do_upcall+0x35/0x50
xen_do_hypervisor_callback+0x1e/0x40
<EOI> ? xen_hypercall_sched_op+0xa/0x20
? xen_hypercall_sched_op+0xa/0x20
? xen_safe_halt+0x10/0x20
? default_idle+0x1e/0xd0
? arch_cpu_idle+0xf/0x20
? default_idle_call+0x2c/0x40
? cpu_startup_entry+0x1ac/0x240
? rest_init+0x77/0x80
? start_kernel+0x4a7/0x4b4
? set_init_arg+0x55/0x55
? x86_64_start_reservations+0x24/0x26
? xen_start_kernel+0x555/0x561
RIP skb_release_data+0x73/0xf0
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
If anyone has had a similar backtrace or knows of a potential fix please respond.
This server has ECC and there were no ECC or other errors in the BIOS event log, nor were there any indications of any problems in the serial console
log leading up to the warning.
This particular server had an uptime of about a month and a half, and so far we've had this error exactly once across all our servers since switching
to 4.9.39 in August, so I don't think it's going to be easy to reproduce.
---
It looks to me like in the first backtrace, this check from inet_gro_complete failed:
ops = rcu_dereference(inet_offloads[proto]);
Which I'm guessing means the packet didn't have a valid layer 4 protocol definition, or we don't have that protocol enabled. Then when attempting to
handle that failure there was a GPF, I believe by accessing invalid data in shinfo->frag_list . "skb_release_data+0x73" is in __read_once_size, which
I think is generated by "kfree_skb: if (likely(atomic_read(&skb->users) == 1))" .
--Sarah