Hi All,
One of our developers managed to trigger a kernel oops on a 4.4.2 dom0.. Oops text is attached. He was working on setting up network namespaces / bridging inside a centos domU, had activated the bridge and lost networking (probably config error) so he rebooted the VM. On reboot is when we saw the oops, along with various xen procs hanging:
root 23388 0.0 1.6 132796 32264 ? SLsl Oct02 0:04 /usr/sbin/xl create /mnt/xen/gx/xen/metrixc7
root 26119 0.0 0.0 0 0 ? Z Oct02 0:00 _ [block] <defunct>
root 26127 0.0 0.0 0 0 ? Z Oct02 0:00 _ [block] <defunct>
root 26137 0.0 0.0 0 0 ? Z Oct02 0:00 _ [block] <defunct>
root 26157 0.0 0.0 0 0 ? Z Oct02 0:00 _ [block] <defunct>
root 26169 0.0 0.0 0 0 ? Z Oct02 0:00 _ [block] <defunct>
root 26195 0.0 0.0 0 0 ? Z Oct02 0:00 _ [vif-bridge] <defunct>
root 24625 0.0 0.0 0 0 ? Ds Oct02 0:06 [tapdisk]
At this point the dom0 is still up and running existing VMs fine, I can also migrate live VMs off of it successfully although the post-migration clean up fails and hangs:
libxl: error: libxl_device.c:935:device_backend_callback: unable to remove device with path /local/domain/0/backend/vbd/17/51712
Host is running centos 7 with the 4.4.2-7 package and kernel 3.10.68-11.el6.centos.alt.x86_64. I've also attached xl dmesg.
First time I've seen anything like this and not sure if his networking/bridging in the domU is related or just coincidental. Any thoughts / ideas? Going to try to reproduce on a test dom0 later this week, so happy to grab any additional debugging if required.
- Nathan
On Mon, Oct 5, 2015 at 10:55 PM, Nathan March nathan@gt.net wrote:
Hi All,
One of our developers managed to trigger a kernel oops on a 4.4.2 dom0.. Oops text is attached. He was working on setting up network namespaces / bridging inside a centos domU, had activated the bridge and lost networking (probably config error) so he rebooted the VM. On reboot is when we saw the oops, along with various xen procs hanging:
Nathan,
Thanks for the detailed report. Would you mind re-posting this to the xen-users mailing list?
Realistically though, at the moment I don't have the bandwidth to maintain more than one kernel package for CentOS; so if you can trigger it reliably, my only advice (other than backporting a patch if one is available) would be to try it again with the 3.18 kernel currently in the tree. If it still faults there I can pursue backporting a patch there.
-George