[CentOS-virt] KVM instance keep crashing

Thu Oct 14 17:27:44 UTC 2010
Eric Searcy <emsearcy at gmail.com>

On Oct 14, 2010, at 1:38 AM, Poh Yong Hwang wrote:

> Hi,
> 
> I have one KVM instance (centos 5) that keeps crashing and i see the message log with the following:
> 
> Oct 14 16:24:48 localhost kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away.
> Oct 14 16:24:49 localhost kernel: BUG: soft lockup - CPU#0 stuck for 12s! [ntpd:2363]
> Oct 14 16:24:49 localhost kernel: CPU 0:
> Oct 14 16:24:49 localhost kernel: Modules linked in: backupdriver(PU) ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc talpa_pedevice(U) dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_balloon virtio_pci ide_cd i2c_piix4 virtio_ring 8139too cdrom 8139cp pcspkr i2c_core virtio mii serio_raw dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
> Oct 14 16:24:49 localhost kernel: Pid: 2363, comm: ntpd Tainted: P      2.6.18-194.3.1.el5 #1
[...]
> Afterwhich the instance become very sluggish and unresponsive. Please advise what could be the issue.

I'm no expert on kernel stuff, but I thought I'd throw in a couple suggested points of clarification on your request since the above is not clear to me.

Is the above in /var/log/message on the guest or host?

Is it always an "ntpd" process on the CPU#0 stuck/soft lockup line?  Does the soft lockup always occur after a psmouse.c warning?  (Even so, the psmouse.c warning could maybe be a symptom of the CPU being stuck, not the cause...)

What type of hardware is this?  Noticing that is says "tainted" and I'm assuming this is the kernel (as I have no idea how a userland process, ntpd, could be "tainted"!), then you have a binary-distributed kernel module and you should probably try with that unloaded to see if the issue goes away.  It could be a machine check error, but that's less likely I think.  To double check, run the following in both the host and guest:

cat /proc/sys/kernel/tainted

This ORed value can be checked against the flags given in http://www.kernel.org/doc/Documentation/sysctl/kernel.txt

Eric