[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

Since moving from 4.4 to 4.6, I've been seeing an increasing number of
stability issues on our hypervisors. I'm not clear if there's a singular
root cause here, or if I'm dealing with multiple bugs.

One of the more common ones I've seen, is a VM on shutdown will remain in
the null state and a kernel bug is thrown:

xen001 log # xl list

Name                                        ID   Mem VCPUs      State
Time(s)

Domain-0                                     0  6144    24     r-----
6639.7

(null)                                       3     0     1     --pscd
36.3

[89920.839074] BUG: unable to handle kernel paging request at
ffff88020ee9a000

[89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20

[89920.839933] PGD 2008067 

[89920.840022] PUD 17f43f067 

[89920.840390] PMD 1e0976067 

[89920.840469] PTE 0

[89920.840833] 

[89920.841123] Oops: 0000 [#1] SMP

[89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables
arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss
nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding
xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn
xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler
joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb
dca ptp pps_core uas usb_storage wmi ttm

[89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64
#1

[89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1
03/04/2015

[89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000

[89920.848192] RIP: e030:[<ffffffff81430922>]  [<ffffffff81430922>]
__memcpy+0x12/0x20

[89920.848783] RSP: e02b:ffffc900460e3b20  EFLAGS: 00010246

[89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX:
0000000000000200

[89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI:
ffff88018916d000

[89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09:
0000000000000002

[89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12:
0000000000001000

[89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15:
ffffc900460e3b68

[89920.850605] FS:  00007fb865c30700(0000) GS:ffff880204b00000(0000)
knlGS:0000000000000000

[89920.851118] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033

[89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4:
0000000000042660

[89920.851720] Stack:

[89920.852009]  ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08
ffffc900460e3bb8

[89920.852821]  ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08
0000000000001000

[89920.853633]  ffffc900460e3d88 0000000000000000 0000000000001000
ffffea0000000000

[89920.854445] Call Trace:

[89920.854741]  [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70

[89920.855043]  [<ffffffff814381c5>]
iov_iter_copy_from_user_atomic+0x265/0x290

[89920.855354]  [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0

[89920.855673]  [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160

[89920.855992]  [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs]

[89920.856297]  [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0

[89920.856599]  [<ffffffff815fa365>] lo_write_bvec+0x65/0x100

[89920.856899]  [<ffffffff815fc375>] do_req_filebacked+0x195/0x300

[89920.857202]  [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80

[89920.857505]  [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0

[89920.857808]  [<ffffffff818d9dca>] ? schedule+0x3a/0xa0

[89920.858108]  [<ffffffff818ddbb6>] ? _raw_spin_unlock_irqrestore+0x16/0x20

[89920.858411]  [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40

[89920.858713]  [<ffffffff810c63f5>] kthread+0xe5/0x100

[89920.859014]  [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40

[89920.859317]  [<ffffffff818de2d5>] ret_from_fork+0x25/0x30

[89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90
90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07
<f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 

[89920.864410] RIP  [<ffffffff81430922>] __memcpy+0x12/0x20

[89920.864749]  RSP <ffffc900460e3b20>

[89920.865021] CR2: ffff88020ee9a000

[89920.865294] ---[ end trace b77d2ce5646284d1 ]---

Wondering if anyone has advice on how to troubleshoot the above, or might
have some insight into that the issue could be? This hypervisor was only up
for a day, had almost no VMs running on it since boot, I booted a single
windows test VM which BSOD'ed and then this happened.

This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
across a wide number of systems with from both Dell and Supermicro, although
we run the same Intel x540 10gb nic's in each system with the same netapp
nfs backend storage.

Cheers,

Nathan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20171107/97d2bd1a/attachment-0005.html>