Hi,
I have one KVM instance (centos 5) that keeps crashing and i see the message log with the following:
Oct 14 16:24:48 localhost kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away. Oct 14 16:24:49 localhost kernel: BUG: soft lockup - CPU#0 stuck for 12s! [ntpd:2363] Oct 14 16:24:49 localhost kernel: CPU 0: Oct 14 16:24:49 localhost kernel: Modules linked in: backupdriver(PU) ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc talpa_pedevice(U) dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_balloon virtio_pci ide_cd i2c_piix4 virtio_ring 8139too cdrom 8139cp pcspkr i2c_core virtio mii serio_raw dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Oct 14 16:24:49 localhost kernel: Pid: 2363, comm: ntpd Tainted: P 2.6.18-194.3.1.el5 #1 Oct 14 16:24:49 localhost kernel: RIP: 0010:[<ffffffff80064b50>] [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9 Oct 14 16:24:49 localhost kernel: RSP: 0018:ffffffff80446ee0 EFLAGS: 00000296 Oct 14 16:24:49 localhost kernel: RAX: 00000000000002fd RBX: ffff81007cb46b40 RCX: ffff81006975b978 Oct 14 16:24:49 localhost kernel: RDX: 0000000000000060 RSI: 0000000000000296 RDI: ffffffff80348e58 Oct 14 16:24:49 localhost kernel: RBP: ffffffff80446e60 R08: ffff81007cb46a70 R09: 0000000000000020 Oct 14 16:24:49 localhost kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8005dc8e Oct 14 16:24:49 localhost kernel: R13: 000000000000003d R14: ffffffff8007820e R15: ffffffff80446e60 Oct 14 16:24:49 localhost kernel: FS: 00002b3519a1c030(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000 Oct 14 16:25:06 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 14 16:25:06 localhost kernel: CR2: 00002b4f3abac3d8 CR3: 0000000069726000 CR4: 00000000000006e0 Oct 14 16:25:06 localhost kernel: Oct 14 16:25:06 localhost kernel: Call Trace: Oct 14 16:25:06 localhost kernel: <IRQ> [<ffffffff80209e43>] i8042_interrupt+0x92/0x1e9 Oct 14 16:25:06 localhost kernel: [<ffffffff80010bd1>] handle_IRQ_event+0x51/0xa6 Oct 14 16:25:07 localhost kernel: [<ffffffff800baec9>] __do_IRQ+0xa4/0x103 Oct 14 16:25:07 localhost kernel: [<ffffffff8006ca11>] do_IRQ+0xe7/0xf5 Oct 14 16:25:07 localhost kernel: [<ffffffff8005d615>] ret_from_intr+0x0/0xa Oct 14 16:25:07 localhost kernel: <EOI> [<ffffffff8002f73b>] dev_queue_xmit+0x0/0x271 Oct 14 16:25:07 localhost kernel: [<ffffffff881987c6>] :8139cp:cp_start_xmit+0x4ef/0x511 Oct 14 16:25:07 localhost kernel: [<ffffffff8819842d>] :8139cp:cp_start_xmit+0x156/0x511 Oct 14 16:25:07 localhost kernel: [<ffffffff8022eede>] dev_hard_start_xmit+0x1b7/0x28a Oct 14 16:25:08 localhost kernel: [<ffffffff8023f0b8>] __qdisc_run+0x136/0x1f9 Oct 14 16:25:08 localhost kernel: [<ffffffff8002f88b>] dev_queue_xmit+0x150/0x271 Oct 14 16:25:08 localhost kernel: [<ffffffff80031f87>] ip_output+0x2ae/0x2dd Oct 14 16:25:08 localhost kernel: [<ffffffff8024d651>] ip_push_pending_frames+0x37d/0x465 Oct 14 16:25:08 localhost kernel: [<ffffffff8025daad>] udp_push_pending_frames+0x21e/0x243 Oct 14 16:25:08 localhost kernel: [<ffffffff8005297d>] udp_sendmsg+0x4d8/0x5ef Oct 14 16:25:08 localhost kernel: [<ffffffff80055336>] sock_sendmsg+0xf8/0x14a Oct 14 16:25:09 localhost kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e Oct 14 16:25:09 localhost kernel: [<ffffffff80098f3b>] __dequeue_signal+0x12d/0x193 Oct 14 16:25:09 localhost kernel: [<ffffffff8009899d>] recalc_sigpending+0xe/0x25 Oct 14 16:25:09 localhost kernel: [<ffffffff8009a0db>] dequeue_signal+0x47/0xcd Oct 14 16:25:09 localhost kernel: [<ffffffff80070b89>] init_fpu+0x62/0x7f Oct 14 16:25:09 localhost kernel: [<ffffffff8006beee>] math_state_restore+0x23/0x4c Oct 14 16:25:09 localhost kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84 Oct 14 16:25:09 localhost kernel: [<ffffffff802264ac>] sys_sendto+0x11c/0x14f Oct 14 16:25:10 localhost kernel: [<ffffffff8006b011>] __switch_to+0xfe/0x22f Oct 14 16:25:10 localhost kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe Oct 14 16:25:10 localhost kernel: [<ffffffff80043b84>] sys_rt_sigreturn+0x323/0x356 Oct 14 16:25:10 localhost kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Oct 14 16:25:10 localhost kernel:
Afterwhich the instance become very sluggish and unresponsive. Please advise what could be the issue.
Thanks
YongSan
On Oct 14, 2010, at 1:38 AM, Poh Yong Hwang wrote:
Hi,
I have one KVM instance (centos 5) that keeps crashing and i see the message log with the following:
Oct 14 16:24:48 localhost kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away. Oct 14 16:24:49 localhost kernel: BUG: soft lockup - CPU#0 stuck for 12s! [ntpd:2363] Oct 14 16:24:49 localhost kernel: CPU 0: Oct 14 16:24:49 localhost kernel: Modules linked in: backupdriver(PU) ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc talpa_pedevice(U) dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_balloon virtio_pci ide_cd i2c_piix4 virtio_ring 8139too cdrom 8139cp pcspkr i2c_core virtio mii serio_raw dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Oct 14 16:24:49 localhost kernel: Pid: 2363, comm: ntpd Tainted: P 2.6.18-194.3.1.el5 #1
[...]
Afterwhich the instance become very sluggish and unresponsive. Please advise what could be the issue.
I'm no expert on kernel stuff, but I thought I'd throw in a couple suggested points of clarification on your request since the above is not clear to me.
Is the above in /var/log/message on the guest or host?
Is it always an "ntpd" process on the CPU#0 stuck/soft lockup line? Does the soft lockup always occur after a psmouse.c warning? (Even so, the psmouse.c warning could maybe be a symptom of the CPU being stuck, not the cause...)
What type of hardware is this? Noticing that is says "tainted" and I'm assuming this is the kernel (as I have no idea how a userland process, ntpd, could be "tainted"!), then you have a binary-distributed kernel module and you should probably try with that unloaded to see if the issue goes away. It could be a machine check error, but that's less likely I think. To double check, run the following in both the host and guest:
cat /proc/sys/kernel/tainted
This ORed value can be checked against the flags given in http://www.kernel.org/doc/Documentation/sysctl/kernel.txt
Eric
Hi,
The message log belongs to the guest which will become unresponsive from time to time. I have done the following and it report the same both on host as well as guest:
[root@localhost conf]# cat /proc/sys/kernel/tainted 65
YongSan
On Fri, Oct 15, 2010 at 1:27 AM, Eric Searcy emsearcy@gmail.com wrote:
On Oct 14, 2010, at 1:38 AM, Poh Yong Hwang wrote:
Hi,
I have one KVM instance (centos 5) that keeps crashing and i see the
message log with the following:
Oct 14 16:24:48 localhost kernel: psmouse.c: Explorer Mouse at
isa0060/serio1/input0 lost synchronization, throwing 1 bytes away.
Oct 14 16:24:49 localhost kernel: BUG: soft lockup - CPU#0 stuck for 12s!
[ntpd:2363]
Oct 14 16:24:49 localhost kernel: CPU 0: Oct 14 16:24:49 localhost kernel: Modules linked in: backupdriver(PU)
ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc talpa_pedevice(U) dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy virtio_balloon virtio_pci ide_cd i2c_piix4 virtio_ring 8139too cdrom 8139cp pcspkr i2c_core virtio mii serio_raw dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Oct 14 16:24:49 localhost kernel: Pid: 2363, comm: ntpd Tainted: P
2.6.18-194.3.1.el5 #1 [...]
Afterwhich the instance become very sluggish and unresponsive. Please
advise what could be the issue.
I'm no expert on kernel stuff, but I thought I'd throw in a couple suggested points of clarification on your request since the above is not clear to me.
Is the above in /var/log/message on the guest or host?
Is it always an "ntpd" process on the CPU#0 stuck/soft lockup line? Does the soft lockup always occur after a psmouse.c warning? (Even so, the psmouse.c warning could maybe be a symptom of the CPU being stuck, not the cause...)
What type of hardware is this? Noticing that is says "tainted" and I'm assuming this is the kernel (as I have no idea how a userland process, ntpd, could be "tainted"!), then you have a binary-distributed kernel module and you should probably try with that unloaded to see if the issue goes away. It could be a machine check error, but that's less likely I think. To double check, run the following in both the host and guest:
cat /proc/sys/kernel/tainted
This ORed value can be checked against the flags given in http://www.kernel.org/doc/Documentation/sysctl/kernel.txt
Eric _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
On Fri, Oct 15, 2010 at 2:57 AM, Poh Yong Hwang yongsan@gmail.com wrote:
Hi, The message log belongs to the guest which will become unresponsive from time to time. I have done the following and it report the same both on host as well as guest: [root@localhost conf]# cat /proc/sys/kernel/tainted 65
65 = 1 + 64
1 - A module with a non-GPL license has been loaded, this includes modules with no license. 64 - The user has asked that the system be marked "tainted". This could be because they are running software that directly modifies the hardware, or for other reasons.
So, you won't be able to get any help from kernel people (probably) unless you can reproduce the problem without any binary kernel modules.
http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-boa... http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-boa...
Look up some of GregKH's keynote addresses for more background.
Eric
On Sat, Oct 16, 2010 at 3:37 PM, Eric Searcy emsearcy@gmail.com wrote:
On Fri, Oct 15, 2010 at 2:57 AM, Poh Yong Hwang yongsan@gmail.com wrote:
Hi, The message log belongs to the guest which will become unresponsive from time to time. I have done the following and it report the same both on host as well as guest: [root@localhost conf]# cat /proc/sys/kernel/tainted 65
65 = 1 + 64
1 - A module with a non-GPL license has been loaded, this includes modules with no license. 64 - The user has asked that the system be marked "tainted". This could be because they are running software that directly modifies the hardware, or for other reasons.
So, you won't be able to get any help from kernel people (probably) unless you can reproduce the problem without any binary kernel modules.
The OP's kernel is apparently "Tainted" as seen on this line:
Oct 14 16:24:49 localhost kernel: Pid: 2363, comm: ntpd Tainted: P 2.6.18-194.3.1.el5 #1
However, what is strange is (now this is going to be off-topic here) that systems loaded with kmod-kvm show:
$ cat /proc/sys/kernel/tainted 64 $ rpm -qi kmod-kvm | grep License Size : 4614945 License: GPLv2
Despite the fact the kvm module is GPL'd, the value of tainted is non-zero. This kernel is supposed to be NOT tainted. Could someone using kvm confirm this ?
Akemi
On Oct 16, 2010, at 4:47 PM, Akemi Yagi wrote: [trim]
However, what is strange is (now this is going to be off-topic here) that systems loaded with kmod-kvm show:
$ cat /proc/sys/kernel/tainted 64 $ rpm -qi kmod-kvm | grep License Size : 4614945 License: GPLv2
Despite the fact the kvm module is GPL'd, the value of tainted is non-zero. This kernel is supposed to be NOT tainted. Could someone using kvm confirm this ?
Following that thought, I hadn't thought to check that, but yes my KVM systems have that set to 64 as well. Probably not due to licensing, that would add a 1 bit which neither of us have (the OPer does).
Note: one of my KVM systems that has '64' has HP Proliant Support Pack installed, which includes some kernel modules (GPLed though they be), the rest are Dell with OMSA packages installed, but I don't think OMSA loads any kernel modules (though 64 does appear to be related to userland, so PSP/OMSA could be related). Guess a third person with KVM and only distribution-based hardware support could chime in...
Also, there wasn't a taint warning in the kernel traces on my machine (just had some kernel errors last week due to a perennial problem of mine: RHEL Cluster Suite) so "64" must be less significant (i.e. I still think the OP is going to need to get rid of the "1" flag before going to LKML).
Maybe there wouldn't be a taint flag, but here it is. And remember this is from a machine has a tainted value of 64:
Oct 15 07:57:44 ha7 kernel: INFO: task clvmd:6804 blocked for more than 120 seconds. Oct 15 07:57:44 ha7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 15 07:57:44 ha7 kernel: clvmd D ffff810001003420 0 6804 1 6488 (NOTLB) Oct 15 07:57:44 ha7 kernel: ffff81033dfefdb8 0000000000000082 000a81000000000a 0000000000000202 Oct 15 07:57:44 ha7 kernel: 00000110000000f5 0000000000000009 ffff81012d6fc040 ffff81010b734080 Oct 15 07:57:44 ha7 kernel: 00089588d2b6e851 000000000000adc7 ffff81012d6fc228 0000000100000000 Oct 15 07:57:44 ha7 kernel: Call Trace: Oct 15 07:57:44 ha7 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92 Oct 15 07:57:44 ha7 kernel: [<ffffffff88505468>] :dlm:dlm_user_request+0x2d/0x175 Oct 15 07:57:44 ha7 kernel: [<ffffffff8008c7d2>] deactivate_task+0x28/0x5f Oct 15 07:57:44 ha7 kernel: [<ffffffff8012abc9>] file_has_perm+0x94/0xa3 Oct 15 07:57:44 ha7 kernel: [<ffffffff8850c707>] :dlm:device_write+0x2f5/0x5e5 Oct 15 07:57:44 ha7 kernel: [<ffffffff80016a17>] vfs_write+0xce/0x174 Oct 15 07:57:44 ha7 kernel: [<ffffffff800988b7>] recalc_sigpending+0xe/0x25 Oct 15 07:57:44 ha7 kernel: [<ffffffff800172e4>] sys_write+0x45/0x6e Oct 15 07:57:44 ha7 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83
Further thought: I have other machines with HP PSP and Dell OMSA, why not check them?
HP PSP, no virt: 0 Dell OMSA, no virt: 0 Dell OMSA, Xen Dom0: 0 HP PSP, VMware Server 2: 66
So I guess 64 is from kmod-kvm. (Hard to search online or the code for "64" taint flag when "64 bit" is already so heavily all over the place bit...)
Eric