Hi All.
I have a server which hanged two times because we could not connect to the box, it was not responding. In /var/log/messages I saw:
Apr 11 10:13:29 server kernel: INFO: task imap:5855 blocked for more than 120 seconds. Apr 11 10:13:29 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 10:13:29 server kernel: imap D 000124D0 1716 5855 14393 6320 5839 (NOTLB) Apr 11 10:13:29 server kernel: d1adeed0 00000082 632f304d 000124d0 000124cf 0000000e 00000000 00000009 Apr 11 10:13:29 server kernel: c750b000 632f379a 000124d0 0000074d 00000000 c750b10c c17f6d00 d2637580 Apr 11 10:13:29 server kernel: f75d3e68 00000000 00000000 d1adeecc c041ec40 00000000 d1adeed8 f75d3e50 Apr 11 10:13:29 server kernel: Call Trace: Apr 11 10:13:29 server kernel: [<c041ec40>] __wake_up+0x2a/0x3d Apr 11 10:13:29 server kernel: [<c0436d23>] prepare_to_wait+0x24/0x46 Apr 11 10:13:29 server kernel: [<f885f1e2>] log_wait_commit+0x80/0xc7 [jbd] Apr 11 10:13:29 server kernel: [<c0436bd7>] autoremove_wake_function+0x0/0x2d Apr 11 10:13:29 server kernel: [<f885a661>] journal_stop+0x195/0x1ba [jbd] Apr 11 10:13:29 server kernel: [<c0495caa>] __writeback_single_inode+0x197/0x2a3 Apr 11 10:13:29 server kernel: [<c045ee18>] do_writepages+0x2b/0x32 Apr 11 10:13:29 server kernel: [<c045a96c>] __filemap_fdatawrite_range+0x66/0x72 Apr 11 10:13:29 server kernel: [<c0496346>] sync_inode+0x19/0x24 Apr 11 10:13:29 server kernel: [<f8983009>] ext3_sync_file+0xb1/0xdc [ext3] Apr 11 10:13:29 server kernel: [<c047956c>] do_fsync+0x41/0x83 Apr 11 10:13:29 server kernel: [<c04795cb>] __do_fsync+0x1d/0x2b Apr 11 10:13:29 server kernel: [<c0404f4b>] syscall_call+0x7/0xb Apr 11 10:13:29 server kernel: ======================= Apr 11 10:13:29 server kernel: INFO: task imap:6882 blocked for more than 120 seconds. Apr 11 10:13:29 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 10:13:29 server kernel: imap D 000124D0 1772 6882 14393 6999 6053 (NOTLB) Apr 11 10:13:29 server kernel: cf1d0ed0 00000082 632f379a 000124d0 000124cc 0000000e 00000000 00000009 Apr 11 10:13:29 server kernel: f52e9000 632f3f0a 000124d0 00000770 00000000 f52e910c c17f6d00 cbe44740 Apr 11 10:13:29 server kernel: f75d3e68 00000000 00000000 cf1d0ecc c041ec40 00000000 cf1d0ed8 f75d3e50 Apr 11 10:13:29 server kernel: Call Trace: Apr 11 10:13:29 server kernel: [<c041ec40>] __wake_up+0x2a/0x3d Apr 11 10:13:29 server kernel: [<c0436d23>] prepare_to_wait+0x24/0x46 Apr 11 10:13:31 server kernel: [<f885f1e2>] log_wait_commit+0x80/0xc7 [jbd] Apr 11 10:13:31 server kernel: [<c0436bd7>] autoremove_wake_function+0x0/0x2d Apr 11 10:13:31 server kernel: [<f885a661>] journal_stop+0x195/0x1ba [jbd] Apr 11 10:13:31 server kernel: [<c0495caa>] __writeback_single_inode+0x197/0x2a3 Apr 11 10:13:31 server kernel: [<c045ee18>] do_writepages+0x2b/0x32 Apr 11 10:13:31 server kernel: [<c045a96c>] __filemap_fdatawrite_range+0x66/0x72 Apr 11 10:13:31 server kernel: [<c0496346>] sync_inode+0x19/0x24 Apr 11 10:13:31 server kernel: [<f8983009>] ext3_sync_file+0xb1/0xdc [ext3] Apr 11 10:13:31 server kernel: [<c047956c>] do_fsync+0x41/0x83 Apr 11 10:13:31 server kernel: [<c04795cb>] __do_fsync+0x1d/0x2b Apr 11 10:13:31 server kernel: [<c0404f4b>] syscall_call+0x7/0xb Apr 11 10:13:31 server kernel: =======================
I have found: http://www.nico.schottelius.org/blog/reboot-linux-if-task-blocked-for-more-t..., so I think that the processes took all resources of the machine.
What should be done to troubleshoot the situation? I have information that probably imap:5855 and imap:6882 caused the problem. How the output:
Apr 11 10:13:29 server kernel: imap D 000124D0 1772 6882 14393 6999 6053 (NOTLB) Apr 11 10:13:29 server kernel: cf1d0ed0 00000082 632f379a 000124d0 000124cc 0000000e 00000000 00000009 Apr 11 10:13:29 server kernel: f52e9000 632f3f0a 000124d0 00000770 00000000 f52e910c c17f6d00 cbe44740 Apr 11 10:13:29 server kernel: f75d3e68 00000000 00000000 cf1d0ecc c041ec40 00000000 cf1d0ed8 f75d3e50 Apr 11 10:13:29 server kernel: Call Trace: Apr 11 10:13:29 server kernel: [<c041ec40>] __wake_up+0x2a/0x3d Apr 11 10:13:29 server kernel: [<c0436d23>] prepare_to_wait+0x24/0x46 Apr 11 10:13:31 server kernel: [<f885f1e2>] log_wait_commit+0x80/0xc7 [jbd] Apr 11 10:13:31 server kernel: [<c0436bd7>] autoremove_wake_function+0x0/0x2d Apr 11 10:13:31 server kernel: [<f885a661>] journal_stop+0x195/0x1ba [jbd] Apr 11 10:13:31 server kernel: [<c0495caa>] __writeback_single_inode+0x197/0x2a3 Apr 11 10:13:31 server kernel: [<c045ee18>] do_writepages+0x2b/0x32 Apr 11 10:13:31 server kernel: [<c045a96c>] __filemap_fdatawrite_range+0x66/0x72 Apr 11 10:13:31 server kernel: [<c0496346>] sync_inode+0x19/0x24 Apr 11 10:13:31 server kernel: [<f8983009>] ext3_sync_file+0xb1/0xdc [ext3] Apr 11 10:13:31 server kernel: [<c047956c>] do_fsync+0x41/0x83 Apr 11 10:13:31 server kernel: [<c04795cb>] __do_fsync+0x1d/0x2b Apr 11 10:13:31 server kernel: [<c0404f4b>] syscall_call+0x7/0xb should be read? Any howto?
Best regards, Rafal Radecki.
The host is: Linux server 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:32:29 EDT 2011 i686 i686 i386 GNU/Linux CentOS release 5.6 (Final) dovecot 1.2.16
Best regards, R.
W dniu 11 kwietnia 2012 11:48 użytkownik Rafał Radecki < radecki.rafal@gmail.com> napisał:
Hi All.
I have a server which hanged two times because we could not connect to the box, it was not responding. In /var/log/messages I saw:
Apr 11 10:13:29 server kernel: INFO: task imap:5855 blocked for more than 120 seconds. Apr 11 10:13:29 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 10:13:29 server kernel: imap D 000124D0 1716 5855 14393 6320 5839 (NOTLB) Apr 11 10:13:29 server kernel: d1adeed0 00000082 632f304d 000124d0 000124cf 0000000e 00000000 00000009 Apr 11 10:13:29 server kernel: c750b000 632f379a 000124d0 0000074d 00000000 c750b10c c17f6d00 d2637580 Apr 11 10:13:29 server kernel: f75d3e68 00000000 00000000 d1adeecc c041ec40 00000000 d1adeed8 f75d3e50 Apr 11 10:13:29 server kernel: Call Trace: Apr 11 10:13:29 server kernel: [<c041ec40>] __wake_up+0x2a/0x3d Apr 11 10:13:29 server kernel: [<c0436d23>] prepare_to_wait+0x24/0x46 Apr 11 10:13:29 server kernel: [<f885f1e2>] log_wait_commit+0x80/0xc7 [jbd] Apr 11 10:13:29 server kernel: [<c0436bd7>] autoremove_wake_function+0x0/0x2d Apr 11 10:13:29 server kernel: [<f885a661>] journal_stop+0x195/0x1ba [jbd] Apr 11 10:13:29 server kernel: [<c0495caa>] __writeback_single_inode+0x197/0x2a3 Apr 11 10:13:29 server kernel: [<c045ee18>] do_writepages+0x2b/0x32 Apr 11 10:13:29 server kernel: [<c045a96c>] __filemap_fdatawrite_range+0x66/0x72 Apr 11 10:13:29 server kernel: [<c0496346>] sync_inode+0x19/0x24 Apr 11 10:13:29 server kernel: [<f8983009>] ext3_sync_file+0xb1/0xdc [ext3] Apr 11 10:13:29 server kernel: [<c047956c>] do_fsync+0x41/0x83 Apr 11 10:13:29 server kernel: [<c04795cb>] __do_fsync+0x1d/0x2b Apr 11 10:13:29 server kernel: [<c0404f4b>] syscall_call+0x7/0xb Apr 11 10:13:29 server kernel: ======================= Apr 11 10:13:29 server kernel: INFO: task imap:6882 blocked for more than 120 seconds. Apr 11 10:13:29 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 10:13:29 server kernel: imap D 000124D0 1772 6882 14393 6999 6053 (NOTLB) Apr 11 10:13:29 server kernel: cf1d0ed0 00000082 632f379a 000124d0 000124cc 0000000e 00000000 00000009 Apr 11 10:13:29 server kernel: f52e9000 632f3f0a 000124d0 00000770 00000000 f52e910c c17f6d00 cbe44740 Apr 11 10:13:29 server kernel: f75d3e68 00000000 00000000 cf1d0ecc c041ec40 00000000 cf1d0ed8 f75d3e50 Apr 11 10:13:29 server kernel: Call Trace: Apr 11 10:13:29 server kernel: [<c041ec40>] __wake_up+0x2a/0x3d Apr 11 10:13:29 server kernel: [<c0436d23>] prepare_to_wait+0x24/0x46 Apr 11 10:13:31 server kernel: [<f885f1e2>] log_wait_commit+0x80/0xc7 [jbd] Apr 11 10:13:31 server kernel: [<c0436bd7>] autoremove_wake_function+0x0/0x2d Apr 11 10:13:31 server kernel: [<f885a661>] journal_stop+0x195/0x1ba [jbd] Apr 11 10:13:31 server kernel: [<c0495caa>] __writeback_single_inode+0x197/0x2a3 Apr 11 10:13:31 server kernel: [<c045ee18>] do_writepages+0x2b/0x32 Apr 11 10:13:31 server kernel: [<c045a96c>] __filemap_fdatawrite_range+0x66/0x72 Apr 11 10:13:31 server kernel: [<c0496346>] sync_inode+0x19/0x24 Apr 11 10:13:31 server kernel: [<f8983009>] ext3_sync_file+0xb1/0xdc [ext3] Apr 11 10:13:31 server kernel: [<c047956c>] do_fsync+0x41/0x83 Apr 11 10:13:31 server kernel: [<c04795cb>] __do_fsync+0x1d/0x2b Apr 11 10:13:31 server kernel: [<c0404f4b>] syscall_call+0x7/0xb Apr 11 10:13:31 server kernel: =======================
I have found: http://www.nico.schottelius.org/blog/reboot-linux-if-task-blocked-for-more-t..., so I think that the processes took all resources of the machine.
What should be done to troubleshoot the situation? I have information that probably imap:5855 and imap:6882 caused the problem. How the output:
Apr 11 10:13:29 server kernel: imap D 000124D0 1772 6882 14393 6999 6053 (NOTLB) Apr 11 10:13:29 server kernel: cf1d0ed0 00000082 632f379a 000124d0 000124cc 0000000e 00000000 00000009 Apr 11 10:13:29 server kernel: f52e9000 632f3f0a 000124d0 00000770 00000000 f52e910c c17f6d00 cbe44740 Apr 11 10:13:29 server kernel: f75d3e68 00000000 00000000 cf1d0ecc c041ec40 00000000 cf1d0ed8 f75d3e50 Apr 11 10:13:29 server kernel: Call Trace: Apr 11 10:13:29 server kernel: [<c041ec40>] __wake_up+0x2a/0x3d Apr 11 10:13:29 server kernel: [<c0436d23>] prepare_to_wait+0x24/0x46 Apr 11 10:13:31 server kernel: [<f885f1e2>] log_wait_commit+0x80/0xc7 [jbd] Apr 11 10:13:31 server kernel: [<c0436bd7>] autoremove_wake_function+0x0/0x2d Apr 11 10:13:31 server kernel: [<f885a661>] journal_stop+0x195/0x1ba [jbd] Apr 11 10:13:31 server kernel: [<c0495caa>] __writeback_single_inode+0x197/0x2a3 Apr 11 10:13:31 server kernel: [<c045ee18>] do_writepages+0x2b/0x32 Apr 11 10:13:31 server kernel: [<c045a96c>] __filemap_fdatawrite_range+0x66/0x72 Apr 11 10:13:31 server kernel: [<c0496346>] sync_inode+0x19/0x24 Apr 11 10:13:31 server kernel: [<f8983009>] ext3_sync_file+0xb1/0xdc [ext3] Apr 11 10:13:31 server kernel: [<c047956c>] do_fsync+0x41/0x83 Apr 11 10:13:31 server kernel: [<c04795cb>] __do_fsync+0x1d/0x2b Apr 11 10:13:31 server kernel: [<c0404f4b>] syscall_call+0x7/0xb should be read? Any howto?
Best regards, Rafal Radecki.
We've been having the same problem on centos 5.8, with a 2.6.18-308.1.1.el5 kernel. It locked up twice and became unresponsive and had to be rebooted. After it came back up we discovered a number of instances of the 'task blocked' message - happening several times a day.
We reflashed the raid controller's firmware last week, and the box has stayed up since. I was about to declare victory when I double-checked the logs and found that we got three 'task blocked' messages overnight. The times of the messages match up with the times that we were transferring large iso images to the box.
Ali
Ali Corbin wrote:
We've been having the same problem on centos 5.8, with a 2.6.18-308.1.1.el5 kernel. It locked up twice and became unresponsive
and had to be
rebooted. After it came back up we discovered a number of instances of the 'task blocked' message - happening several times a day.
We reflashed the raid controller's firmware last week, and the box has stayed up since. I was about to declare victory when I double-checked the logs and found that we got three 'task blocked' messages overnight. The times of the messages match up with the times that we were transferring large iso images to the box.
Been seeing that on 6.2, also, though I've not noticed them locking up (the ones that have were in a cluster, and someone could have hosed memory). It got less with the latest kernel, but I'm still seeing them occasionally.
mark
On Wed, 11 Apr 2012, m.roth@5-cent.us wrote:
Been seeing that on 6.2, also, though I've not noticed them locking up (the ones that have were in a cluster, and someone could have hosed memory). It got less with the latest kernel, but I'm still seeing them occasionally.
This is most likely a problem due to data arriving too quickly for the VM subsystem to flush dirty pages to disk to keep up; the kernel switches to synchronous mode when the dirty_ratio percentage of memory pages is used. You might try upping your dirty_ratio, or reducing it drastically.
Steve
Steve Thompson wrote:
On Wed, 11 Apr 2012, m.roth@5-cent.us wrote:
Been seeing that on 6.2, also, though I've not noticed them locking up (the ones that have were in a cluster, and someone could have hosed memory). It got less with the latest kernel, but I'm still seeing them occasionally.
This is most likely a problem due to data arriving too quickly for the VM subsystem to flush dirty pages to disk to keep up; the kernel switches to synchronous mode when the dirty_ratio percentage of memory pages is used. You might try upping your dirty_ratio, or reducing it drastically.
But mine are *not* VMs.
mark
On Wed, 11 Apr 2012, m.roth@5-cent.us wrote:
Steve Thompson wrote:
On Wed, 11 Apr 2012, m.roth@5-cent.us wrote:
Been seeing that on 6.2, also, though I've not noticed them locking up (the ones that have were in a cluster, and someone could have hosed memory). It got less with the latest kernel, but I'm still seeing them occasionally.
This is most likely a problem due to data arriving too quickly for the VM subsystem to flush dirty pages to disk to keep up; the kernel switches to synchronous mode when the dirty_ratio percentage of memory pages is used. You might try upping your dirty_ratio, or reducing it drastically.
But mine are *not* VMs.
VM = Virtual Memory.
Steve Thompson wrote:
On Wed, 11 Apr 2012, m.roth@5-cent.us wrote:
Steve Thompson wrote:
On Wed, 11 Apr 2012, m.roth@5-cent.us wrote:
Been seeing that on 6.2, also, though I've not noticed them locking up (the ones that have were in a cluster, and someone could have hosed memory). It got less with the latest kernel, but I'm still seeing them occasionally.
This is most likely a problem due to data arriving too quickly for the VM subsystem to flush dirty pages to disk to keep up; the kernel switches to synchronous mode when the dirty_ratio percentage of memory pages is used. You might try upping your dirty_ratio, or reducing it drastically.
But mine are *not* VMs.
VM = Virtual Memory.
Sorry, VM (virtual memory) is *so* different than VM (virtual machine), and in such a different context, dunno how I could have confused the two.... <g>
I'll try that, though IIRC, I've seen blocked tasks with someone running emacs.
mark