[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

Wed Nov 15 15:09:43 UTC 2017
George Dunlap <dunlapg at umich.edu>

Natan,

Thanks for the report.  Would you mind re-posting this to the
xen-users mailing list?  You're much more likely to get someone there
who's seen such a bug before.

 -George

On Tue, Nov 7, 2017 at 11:12 PM, Nathan March <nathan at gt.net> wrote:
> Since moving from 4.4 to 4.6, I’ve been seeing an increasing number of
> stability issues on our hypervisors. I’m not clear if there’s a singular
> root cause here, or if I’m dealing with multiple bugs…
>
>
>
> One of the more common ones I’ve seen, is a VM on shutdown will remain in
> the null state and a kernel bug is thrown:
>
>
>
> xen001 log # xl list
>
> Name                                        ID   Mem VCPUs      State
> Time(s)
>
> Domain-0                                     0  6144    24     r-----
> 6639.7
>
> (null)                                       3     0     1     --pscd
> 36.3
>
>
>
> [89920.839074] BUG: unable to handle kernel paging request at
> ffff88020ee9a000
>
> [89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20
>
> [89920.839933] PGD 2008067
>
> [89920.840022] PUD 17f43f067
>
> [89920.840390] PMD 1e0976067
>
> [89920.840469] PTE 0
>
> [89920.840833]
>
> [89920.841123] Oops: 0000 [#1] SMP
>
> [89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables
> arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss
> nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding
> xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn
> xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler
> joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb
> dca ptp pps_core uas usb_storage wmi ttm
>
> [89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64
> #1
>
> [89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1
> 03/04/2015
>
> [89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000
>
> [89920.848192] RIP: e030:[<ffffffff81430922>]  [<ffffffff81430922>]
> __memcpy+0x12/0x20
>
> [89920.848783] RSP: e02b:ffffc900460e3b20  EFLAGS: 00010246
>
> [89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX:
> 0000000000000200
>
> [89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI:
> ffff88018916d000
>
> [89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09:
> 0000000000000002
>
> [89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12:
> 0000000000001000
>
> [89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15:
> ffffc900460e3b68
>
> [89920.850605] FS:  00007fb865c30700(0000) GS:ffff880204b00000(0000)
> knlGS:0000000000000000
>
> [89920.851118] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>
> [89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4:
> 0000000000042660
>
> [89920.851720] Stack:
>
> [89920.852009]  ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08
> ffffc900460e3bb8
>
> [89920.852821]  ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08
> 0000000000001000
>
> [89920.853633]  ffffc900460e3d88 0000000000000000 0000000000001000
> ffffea0000000000
>
> [89920.854445] Call Trace:
>
> [89920.854741]  [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70
>
> [89920.855043]  [<ffffffff814381c5>]
> iov_iter_copy_from_user_atomic+0x265/0x290
>
> [89920.855354]  [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0
>
> [89920.855673]  [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160
>
> [89920.855992]  [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs]
>
> [89920.856297]  [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0
>
> [89920.856599]  [<ffffffff815fa365>] lo_write_bvec+0x65/0x100
>
> [89920.856899]  [<ffffffff815fc375>] do_req_filebacked+0x195/0x300
>
> [89920.857202]  [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80
>
> [89920.857505]  [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0
>
> [89920.857808]  [<ffffffff818d9dca>] ? schedule+0x3a/0xa0
>
> [89920.858108]  [<ffffffff818ddbb6>] ? _raw_spin_unlock_irqrestore+0x16/0x20
>
> [89920.858411]  [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40
>
> [89920.858713]  [<ffffffff810c63f5>] kthread+0xe5/0x100
>
> [89920.859014]  [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40
>
> [89920.859317]  [<ffffffff818de2d5>] ret_from_fork+0x25/0x30
>
> [89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07
> <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3
>
> [89920.864410] RIP  [<ffffffff81430922>] __memcpy+0x12/0x20
>
> [89920.864749]  RSP <ffffc900460e3b20>
>
> [89920.865021] CR2: ffff88020ee9a000
>
> [89920.865294] ---[ end trace b77d2ce5646284d1 ]---
>
>
>
> Wondering if anyone has advice on how to troubleshoot the above, or might
> have some insight into that the issue could be? This hypervisor was only up
> for a day, had almost no VMs running on it since boot, I booted a single
> windows test VM which BSOD’ed and then this happened.
>
>
>
> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
> across a wide number of systems with from both Dell and Supermicro, although
> we run the same Intel x540 10gb nic’s in each system with the same netapp
> nfs backend storage.
>
>
>
> Cheers,
>
> Nathan
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>