[CentOS-virt] BUG: soft lockup detected on CPU#?

Tue Jan 22 19:50:21 UTC 2008
Ross S. W. Walker <rwalker at medallion.com>

Ross S. W. Walker wrote:
> 
> Brett Worth wrote:
> > 
> > Hello All.
> > 
> > I've just started looking into Xen and have a test 
> > environment in place.  I'm seeing an
> > annoying problem that I thought worthy of a post.
> > 
> > Config:
> > 
> > I have 2 x HP DL585 servers each with 4 Dual core Opterons 
> > (non-vmx) and 16GB RAM
> > configured as Xen servers.  These run CentOS 5.1 with the 
> > latest updates applied.  These
> > system both attach to an iSCSI target which is an HP DL385 
> > running ietd and serving SAN
> > based storage.
> > 
> > I have a test VM running CentOS 5.1 also updated.
> > 
> > Problem:
> > 
> > If I run the VM on a single server everything is OK.  If I do 
> > a migrate of the VM to the
> > other server I start getting random "BUG: soft lockup 
> > detected on CPU#?" messages on the
> > VM console.  The messages seem to happen with IO but not 
> > every time.  A reboot of the VM
> > on the new server will stop these messages.
> > 
> > I've also left the VM running overnight a couple of times and 
> > when I do I find that any
> > external sessions (ssh) are hung in the morning but the 
> > console session is not.  New ssh
> > sessions can be started and seem to work.
> > 
> > After much googling it looks like the kernel messages can 
> > occur if dom0 is very busy but
> > mine is not.
> > 
> > Any suggestions?
> 
> The soft lockup is technically not a BUG.
> 
> You will see these errors if an IRQ takes more then 10 seconds
> to respond.
> 
> In your case I would take a look at your iSCSI setup and the
> time it takes to migrate the VM from one node to another along
> with SCSI reserve/release setup on the iSCSI target.
> 
> I also have been using the Xen 3.2 RPMs off xen.org to CentOS
> 5.1 which good results, the VM migration may run smoother and
> quicker in Xen 3.2, but in doing so you take Xen off the
> reservation, if your OK with that it may fix your issues.

After seeing this same issue on my Xen 3.2 install, but with NO
migration or iSCSI happening I decided it is probably NOT iSCSI's
fault, so I decided to research it a little more and this is what
I found:

http://docs.xensource.com/XenServer/4.0.1/guest/ch04s08.html#rhel5_limitations

XenSource does provide a repo of CentOS 5 kernels that have been
patched to fix this though:

http://updates.xensource.com/XenServer/4.0.1/centos5x/

But these seem to be woefully out of date.

I wonder if a kind soul would add the fix to the centosplus kernel
with XenSource's patch so those rogue Xen users could benefit from
this fix until upstream decides to include it.

I suppose the centosplus patch would need to be flagged interm in
case it needs removed when upstream has their own fix.


-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.