[CentOS-virt] Crash in network stack under Xen

Thu Nov 9 21:36:44 UTC 2017
Sarah Newman <srn at prgmr.com>

Hi,

We had a potentially network related crash on a dom0 with Linux 4.9.39 / Xen 4.8 and as of today I can't find any fixes in stable/linux-4.9.y,
xen/staging-4.8, or CPU microcode updates that look like a smoking gun. I can't rule out that it's Xen related. The backtraces are:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 inet_gro_complete+0xbb/0xd0
 Call Trace:
  <IRQ>    dump_stack+0x63/0x8e
   __warn+0xd1/0xf0
   warn_slowpath_null+0x1d/0x20
   inet_gro_complete+0xbb/0xd0
   napi_gro_complete+0x73/0xa0
   napi_gro_flush+0x5f/0x80
   napi_complete_done+0x6a/0xb0
   igb_poll+0x38d/0x720 [igb]
   ? igb_msix_ring+0x2e/0x40 [igb]
   ? __handle_irq_event_percpu+0x4b/0x1a0
   net_rx_action+0x158/0x360
   __do_softirq+0xd1/0x283
   irq_exit+0xe9/0x100
   xen_evtchn_do_upcall+0x35/0x50
   xen_do_hypervisor_callback+0x1e/0x40
   <EOI>    ? xen_hypercall_sched_op+0xa/0x20
   ? xen_hypercall_sched_op+0xa/0x20
   ? xen_safe_halt+0x10/0x20
   ? default_idle+0x1e/0xd0
   ? arch_cpu_idle+0xf/0x20
   ? default_idle_call+0x2c/0x40
   ? cpu_startup_entry+0x1ac/0x240
   ? rest_init+0x77/0x80
   ? start_kernel+0x4a7/0x4b4
   ? set_init_arg+0x55/0x55
   ? x86_64_start_reservations+0x24/0x26
   ? xen_start_kernel+0x555/0x561

 general protection fault: 0000 [#1] SMP
 Call Trace:
  <IRQ>    ? napi_gro_complete+0x5e/0xa0
   skb_release_all+0x24/0x30
   kfree_skb+0x32/0x90
   napi_gro_complete+0x5e/0xa0
   napi_gro_flush+0x5f/0x80
   napi_complete_done+0x6a/0xb0
   igb_poll+0x38d/0x720 [igb]
   ? igb_msix_ring+0x2e/0x40 [igb]
   ? __handle_irq_event_percpu+0x4b/0x1a0
   net_rx_action+0x158/0x360
   __do_softirq+0xd1/0x283
   irq_exit+0xe9/0x100
   xen_evtchn_do_upcall+0x35/0x50
   xen_do_hypervisor_callback+0x1e/0x40
   <EOI>    ? xen_hypercall_sched_op+0xa/0x20
   ? xen_hypercall_sched_op+0xa/0x20
   ? xen_safe_halt+0x10/0x20
   ? default_idle+0x1e/0xd0
   ? arch_cpu_idle+0xf/0x20
   ? default_idle_call+0x2c/0x40
   ? cpu_startup_entry+0x1ac/0x240
   ? rest_init+0x77/0x80
   ? start_kernel+0x4a7/0x4b4
   ? set_init_arg+0x55/0x55
   ? x86_64_start_reservations+0x24/0x26
   ? xen_start_kernel+0x555/0x561
 RIP   skb_release_data+0x73/0xf0
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: disabled
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

If anyone has had a similar backtrace or knows of a potential fix please respond.

This server has ECC and there were no ECC or other errors in the BIOS event log, nor were there any indications of any problems in the serial console
log leading up to the warning.

This particular server had an uptime of about a month and a half, and so far we've had this error exactly once across all our servers since switching
to 4.9.39 in August, so I don't think it's going to be easy to reproduce.

---

It looks to me like in the first backtrace, this check from inet_gro_complete failed:

ops = rcu_dereference(inet_offloads[proto]);

Which I'm guessing means the packet didn't have a valid layer 4 protocol definition, or we don't have that protocol enabled. Then when attempting to
handle that failure there was a GPF, I believe by accessing invalid data in shinfo->frag_list . "skb_release_data+0x73" is in __read_once_size, which
I think is generated by "kfree_skb: if (likely(atomic_read(&skb->users) == 1))" .

--Sarah