[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

Sat Mar 12 23:47:16 UTC 2016
Sarah Newman <srn at prgmr.com>

On 03/10/2016 12:05 AM, Sarah Newman wrote:
> On 03/09/2016 08:15 PM, Sarah Newman wrote:
>> I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups
>> within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
>>
> 
> Here is mpstat from around the time of the issue:
> 
> 0:08:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 10:09:10 PM  all    0.00    0.00   66.67    0.00    0.00   33.33    0.00    0.00    0.00
> 10:09:11 PM  all    2.17    0.00    5.43   32.61    0.00   58.70    1.09    0.00    0.00
> 10:09:12 PM  all    0.00    0.00    1.15    0.00    0.00   85.06    0.00    0.00   13.79
> 10:09:13 PM  all    0.00    0.00    1.08    0.00    0.00   83.87    0.00    0.00   15.05
> 10:09:14 PM  all    0.00    0.00    1.10    0.00    0.00   83.52    0.00    0.00   15.38
> 10:09:15 PM  all    1.09    0.00    1.09    0.00    0.00   85.87    0.00    0.00   11.96
> 10:09:51 PM  all    0.00    0.00    1.09    0.00    0.00   84.78    1.09    0.00   13.04
> Message from syslogd at Mar  9 22:09:51 ...
>  kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> 10:10:02 PM  all    0.00    0.00   33.33   50.00    0.00   16.67    0.00    0.00    0.00
> 10:10:03 PM  all    3.16    0.00   10.53    8.42    0.00    2.11    1.05    0.00   74.74
> 10:10:04 PM  all    0.00    0.00    3.23   38.71    0.00    1.08    1.08    0.00   55.91
> 10:10:05 PM  all    0.00    0.00    4.30   11.83    0.00    3.23    1.08    0.00   79.57
> 
> Typical load:
> 
> 10:22:15 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 10:22:16 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00    0.00   97.96
> 10:22:17 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.04    0.00   98.96
> 10:22:18 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01    0.00   97.98
> 10:22:19 PM  all    0.00    0.00    1.01    0.00    0.00    1.01    0.00    0.00   97.98
> 10:22:20 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.02    0.00   98.98
> 10:22:21 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00    0.00   97.96
> 10:22:22 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01    0.00   97.98
> 
> 
> I reverted to an older kernel since the older kernel had run for a couple of months without issues.


This did not fix it. I isolated the issue to a vif rate limit of 100Mb/s being applied to one of the guests and am now able to reproduce on a
different machine.

I will look into whether this has been fixed already; if so I will submit a pull request for the Xen4CentOS kernel and if not I will take it up with
the xen-devel list.