Hi All, I've been trying to trace the cause of a hang on a 5.6 i386 system.
After running for almost a year, it hung last week, when I plugged in a screen it was blank, machine was unresponsive to the keyboard, over the network ssh and other daemons didn't respond but the thing has two network cards and routing from one to the other was still working. So the kernel was up and I suspected a dying disk. But smartctl -a revealed nothing untoward. Didn't see anything significant in the log files at the time. Logging had stopped when the machine hung.
The machine rebooted normally and has run for almost a week and hung again with the same symptoms. Again rebooted, nothing untoward in the logs and smartctl still OK. But shortly after I left site this was logged in /var/log/messages
Mar 27 16:52:04 cjcsrv kernel: INFO: task hald-addon-stor:2179 blocked for more than 120 seconds. Mar 27 16:52:04 cjcsrv kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 27 16:52:04 cjcsrv kernel: hald-addon-st D 00000197 2552 2179 2160 2170 (NOTLB) Mar 27 16:52:04 cjcsrv kernel: d895bbbc 00000086 a3374d80 00000197 e89436f0 d895bbbc c084e894 0000000a Mar 27 16:52:04 cjcsrv kernel: d891aaa0 a3468fc0 00000197 000f4240 00000000 d891abac c1506800 e43383c0 Mar 27 16:52:04 cjcsrv kernel: 00000000 00000086 00000000 e7ad2b80 c061f7ca 00000000 c1506844 d895bc0c Mar 27 16:52:04 cjcsrv kernel: Call Trace: Mar 27 16:52:04 cjcsrv kernel: [<e89436f0>] cdrom_do_pc_continuation+0x0/0x2c [ide_cd] Mar 27 16:52:04 cjcsrv kernel: [<c061f7ca>] schedule+0x9c6/0xa4f Mar 27 16:52:04 cjcsrv kernel: [<c061f905>] wait_for_completion+0x6b/0x8f Mar 27 16:52:04 cjcsrv kernel: [<c041f80f>] default_wake_function+0x0/0xc Mar 27 16:52:04 cjcsrv kernel: [<c0575b1b>] ide_do_drive_cmd+0xd7/0xfa Mar 27 16:52:04 cjcsrv kernel: [<e894071c>] cdrom_queue_packet_command+0x35/0xbc [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<c0488406>] poll_freewait+0x18/0x4c Mar 27 16:52:05 cjcsrv kernel: [<c048874e>] do_sys_poll+0x314/0x339 Mar 27 16:52:05 cjcsrv kernel: [<e8940c16>] cdrom_check_status+0x52/0x5d [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<c04e29ee>] blk_end_sync_rq+0x0/0x1d Mar 27 16:52:05 cjcsrv kernel: [<e8940c3b>] ide_cdrom_check_media_change_real+0x1a/0x34 [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<e88da06e>] media_changed+0x40/0x6e [cdrom] Mar 27 16:52:05 cjcsrv kernel: [<c047de20>] check_disk_change+0x13/0x3b Mar 27 16:52:05 cjcsrv kernel: [<e88ddfe4>] cdrom_open+0x833/0x876 [cdrom] Mar 27 16:52:05 cjcsrv kernel: [<c04c95c3>] avc_has_perm+0x3c/0x46 Mar 27 16:52:05 cjcsrv kernel: [<c04c95c3>] avc_has_perm+0x3c/0x46 Mar 27 16:52:05 cjcsrv kernel: [<c048c42f>] __d_lookup+0x98/0xdb Mar 27 16:52:05 cjcsrv kernel: [<c04c95c3>] avc_has_perm+0x3c/0x46 Mar 27 16:52:05 cjcsrv kernel: [<c04c9c29>] inode_has_perm+0x54/0x5c Mar 27 16:52:05 cjcsrv kernel: [<c04eef8a>] kobject_get+0xf/0x13 Mar 27 16:52:05 cjcsrv kernel: [<c04e5e51>] get_disk+0x35/0x6e Mar 27 16:52:05 cjcsrv kernel: [<c04e5e91>] exact_lock+0x7/0xd Mar 27 16:52:05 cjcsrv kernel: [<c056291d>] kobj_lookup+0x10d/0x168 Mar 27 16:52:05 cjcsrv kernel: [<e8941042>] idecd_open+0x7b/0xa8 [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<c047e448>] do_open+0x89/0x2cc Mar 27 16:52:05 cjcsrv kernel: [<c047e7f7>] blkdev_open+0x0/0x44 Mar 27 16:52:05 cjcsrv kernel: [<c047e813>] blkdev_open+0x1c/0x44 Mar 27 16:52:05 cjcsrv kernel: [<c0475937>] __dentry_open+0xc7/0x1ab Mar 27 16:52:05 cjcsrv kernel: [<c0475a7f>] nameidata_to_filp+0x19/0x28 Mar 27 16:52:05 cjcsrv kernel: [<c0475ab9>] do_filp_open+0x2b/0x31 Mar 27 16:52:05 cjcsrv kernel: [<c0475afd>] do_sys_open+0x3e/0xae Mar 27 16:52:05 cjcsrv kernel: [<c0475b9a>] sys_open+0x16/0x18 Mar 27 16:52:05 cjcsrv kernel: [<c0404f4b>] syscall_call+0x7/0xb Mar 27 16:52:05 cjcsrv kernel: ======================= Mar 27 16:52:19 cjcsrv kernel: ide1: reset timed-out, status=0xd0
ide1 has a CD attached. Not essential, the CD could be unplugged. ide0 has the hard disk, hda, attached. Looking back through the logs there is another of these recorded before the previous hang. Any clues as to what this is telling me - other than something crashed.
Thanks
Ken
On 03/27/2012 11:59 PM, Ken Smith wrote:
Hi All, I've been trying to trace the cause of a hang on a 5.6 i386 system.
After running for almost a year, it hung last week, when I plugged in a screen it was blank, machine was unresponsive to the keyboard, over the network ssh and other daemons didn't respond but the thing has two network cards and routing from one to the other was still working. So the kernel was up and I suspected a dying disk. But smartctl -a revealed nothing untoward. Didn't see anything significant in the log files at the time. Logging had stopped when the machine hung.
The machine rebooted normally and has run for almost a week and hung again with the same symptoms. Again rebooted, nothing untoward in the logs and smartctl still OK. But shortly after I left site this was logged in /var/log/messages
Mar 27 16:52:04 cjcsrv kernel: INFO: task hald-addon-stor:2179 blocked for more than 120 seconds. Mar 27 16:52:04 cjcsrv kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 27 16:52:04 cjcsrv kernel: hald-addon-st D 00000197 2552 2179 2160 2170 (NOTLB) Mar 27 16:52:04 cjcsrv kernel: d895bbbc 00000086 a3374d80 00000197 e89436f0 d895bbbc c084e894 0000000a Mar 27 16:52:04 cjcsrv kernel: d891aaa0 a3468fc0 00000197 000f4240 00000000 d891abac c1506800 e43383c0 Mar 27 16:52:04 cjcsrv kernel: 00000000 00000086 00000000 e7ad2b80 c061f7ca 00000000 c1506844 d895bc0c Mar 27 16:52:04 cjcsrv kernel: Call Trace: Mar 27 16:52:04 cjcsrv kernel: [<e89436f0>] cdrom_do_pc_continuation+0x0/0x2c [ide_cd] Mar 27 16:52:04 cjcsrv kernel: [<c061f7ca>] schedule+0x9c6/0xa4f Mar 27 16:52:04 cjcsrv kernel: [<c061f905>] wait_for_completion+0x6b/0x8f Mar 27 16:52:04 cjcsrv kernel: [<c041f80f>] default_wake_function+0x0/0xc Mar 27 16:52:04 cjcsrv kernel: [<c0575b1b>] ide_do_drive_cmd+0xd7/0xfa Mar 27 16:52:04 cjcsrv kernel: [<e894071c>] cdrom_queue_packet_command+0x35/0xbc [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<c0488406>] poll_freewait+0x18/0x4c Mar 27 16:52:05 cjcsrv kernel: [<c048874e>] do_sys_poll+0x314/0x339 Mar 27 16:52:05 cjcsrv kernel: [<e8940c16>] cdrom_check_status+0x52/0x5d [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<c04e29ee>] blk_end_sync_rq+0x0/0x1d Mar 27 16:52:05 cjcsrv kernel: [<e8940c3b>] ide_cdrom_check_media_change_real+0x1a/0x34 [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<e88da06e>] media_changed+0x40/0x6e [cdrom] Mar 27 16:52:05 cjcsrv kernel: [<c047de20>] check_disk_change+0x13/0x3b Mar 27 16:52:05 cjcsrv kernel: [<e88ddfe4>] cdrom_open+0x833/0x876 [cdrom] Mar 27 16:52:05 cjcsrv kernel: [<c04c95c3>] avc_has_perm+0x3c/0x46 Mar 27 16:52:05 cjcsrv kernel: [<c04c95c3>] avc_has_perm+0x3c/0x46 Mar 27 16:52:05 cjcsrv kernel: [<c048c42f>] __d_lookup+0x98/0xdb Mar 27 16:52:05 cjcsrv kernel: [<c04c95c3>] avc_has_perm+0x3c/0x46 Mar 27 16:52:05 cjcsrv kernel: [<c04c9c29>] inode_has_perm+0x54/0x5c Mar 27 16:52:05 cjcsrv kernel: [<c04eef8a>] kobject_get+0xf/0x13 Mar 27 16:52:05 cjcsrv kernel: [<c04e5e51>] get_disk+0x35/0x6e Mar 27 16:52:05 cjcsrv kernel: [<c04e5e91>] exact_lock+0x7/0xd Mar 27 16:52:05 cjcsrv kernel: [<c056291d>] kobj_lookup+0x10d/0x168 Mar 27 16:52:05 cjcsrv kernel: [<e8941042>] idecd_open+0x7b/0xa8 [ide_cd] Mar 27 16:52:05 cjcsrv kernel: [<c047e448>] do_open+0x89/0x2cc Mar 27 16:52:05 cjcsrv kernel: [<c047e7f7>] blkdev_open+0x0/0x44 Mar 27 16:52:05 cjcsrv kernel: [<c047e813>] blkdev_open+0x1c/0x44 Mar 27 16:52:05 cjcsrv kernel: [<c0475937>] __dentry_open+0xc7/0x1ab Mar 27 16:52:05 cjcsrv kernel: [<c0475a7f>] nameidata_to_filp+0x19/0x28 Mar 27 16:52:05 cjcsrv kernel: [<c0475ab9>] do_filp_open+0x2b/0x31 Mar 27 16:52:05 cjcsrv kernel: [<c0475afd>] do_sys_open+0x3e/0xae Mar 27 16:52:05 cjcsrv kernel: [<c0475b9a>] sys_open+0x16/0x18 Mar 27 16:52:05 cjcsrv kernel: [<c0404f4b>] syscall_call+0x7/0xb Mar 27 16:52:05 cjcsrv kernel: ======================= Mar 27 16:52:19 cjcsrv kernel: ide1: reset timed-out, status=0xd0
ide1 has a CD attached. Not essential, the CD could be unplugged. ide0 has the hard disk, hda, attached. Looking back through the logs there is another of these recorded before the previous hang. Any clues as to what this is telling me - other than something crashed.
The process "hald-addon-storage" got stuck while trying to access the cd ("ide-cd"). It probably tried to poll the drive to check if there was a cd inserted. Unplugging the drive should do the trick although you could try to disable the polling by creating a file "/etc/hal/fdi/policy/99-custom.fdi" with the following content:
<?xml version="1.0" encoding="UTF-8"?> <deviceinfo version="0.2"> <device> <match key="storage.removable" bool="true"> <remove key="info.addons" type="strlist">hald-addon-storage</remove> </match> </device> </deviceinfo>
After doing so restart hald or reboot. hald should no longer poll the drive after this.
Regards, Dennis