[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

Tue Mar 15 10:55:52 UTC 2016
George Dunlap <dunlapg at umich.edu>

On Sat, Mar 12, 2016 at 11:47 PM, Sarah Newman <srn at prgmr.com> wrote:
> On 03/10/2016 12:05 AM, Sarah Newman wrote:
>> On 03/09/2016 08:15 PM, Sarah Newman wrote:
>>> I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups
>>> within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.
>>>
>>
>> Here is mpstat from around the time of the issue:
>>
>> 0:08:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 10:09:10 PM  all    0.00    0.00   66.67    0.00    0.00   33.33    0.00    0.00    0.00
>> 10:09:11 PM  all    2.17    0.00    5.43   32.61    0.00   58.70    1.09    0.00    0.00
>> 10:09:12 PM  all    0.00    0.00    1.15    0.00    0.00   85.06    0.00    0.00   13.79
>> 10:09:13 PM  all    0.00    0.00    1.08    0.00    0.00   83.87    0.00    0.00   15.05
>> 10:09:14 PM  all    0.00    0.00    1.10    0.00    0.00   83.52    0.00    0.00   15.38
>> 10:09:15 PM  all    1.09    0.00    1.09    0.00    0.00   85.87    0.00    0.00   11.96
>> 10:09:51 PM  all    0.00    0.00    1.09    0.00    0.00   84.78    1.09    0.00   13.04
>> Message from syslogd at Mar  9 22:09:51 ...
>>  kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
>> 10:10:02 PM  all    0.00    0.00   33.33   50.00    0.00   16.67    0.00    0.00    0.00
>> 10:10:03 PM  all    3.16    0.00   10.53    8.42    0.00    2.11    1.05    0.00   74.74
>> 10:10:04 PM  all    0.00    0.00    3.23   38.71    0.00    1.08    1.08    0.00   55.91
>> 10:10:05 PM  all    0.00    0.00    4.30   11.83    0.00    3.23    1.08    0.00   79.57
>>
>> Typical load:
>>
>> 10:22:15 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 10:22:16 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00    0.00   97.96
>> 10:22:17 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.04    0.00   98.96
>> 10:22:18 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01    0.00   97.98
>> 10:22:19 PM  all    0.00    0.00    1.01    0.00    0.00    1.01    0.00    0.00   97.98
>> 10:22:20 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.02    0.00   98.98
>> 10:22:21 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00    0.00   97.96
>> 10:22:22 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01    0.00   97.98
>>
>>
>> I reverted to an older kernel since the older kernel had run for a couple of months without issues.
>
>
> This did not fix it. I isolated the issue to a vif rate limit of 100Mb/s being applied to one of the guests and am now able to reproduce on a
> different machine.
>
> I will look into whether this has been fixed already; if so I will submit a pull request for the Xen4CentOS kernel and if not I will take it up with
> the xen-devel list.

Yes, I was going to suggest posting this to xen-users -- it's not
unlikely someone has already run across this.

 -George