Every so often, one of our servers will go into what I can only describe as an undefined state: it pings, but there's zero access - you can't ssh in, and if I go plug a keyboard and monitor into the server itself, you can see the monitor's live, it's not the "monitor turned off" color, but there is zero response to the keyboard. The upshot is that I wind up having to power cycle it.
Well, it just happened again on one of our servers Friday evening, as I found this morning. Looking at the logs this morning, I see that sar last shows 10:20:01 PM all 34.38 0.00 8.29 0.00 0.00 57.33
On of my users dropped me an email at 22:45 that it was "off", and the last things I see in /var/log/messages are one of those annoying Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more than 120 seconds. Feb 21 22:26:23 <server> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
I also see Feb 21 22:26:23 <server> kernel: perl D ffffffff80158250 0 20596 20557 which, as I just found by googling perl NOTLD, means that this is in a kernel uninterruptable state In addition, in the stack trace, some nfs messages Feb 21 22:26:23 <server> kernel: [<ffffffff886b58d1>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
So, it *appears* to be either an NFS issue, or a NIC issue. The user's home directory server is CentOS running 6.5, and the server that hung is 5.10. Mount on the formerly hung server, su-d to his account shows merely nfs, so I'm guessing it's NFS3. Looking at lsmod and /var/log/dmesg, I see it's running the tg3 NIC driver.
Anyone else seeing this, and if so, any thoughts on the matter? Note that I've had this on Penguins, which are all Supermicro, and they're using the igb NIC driver, but the one this past weekend is a Dell, so it's not just one system.
mark
On Mon, 24 Feb 2014, m.roth@5-cent.us wrote:
Every so often, one of our servers will go into what I can only describe as an undefined state: it pings, but there's zero access - you can't ssh in, and if I go plug a keyboard and monitor into the server itself, you can see the monitor's live, it's not the "monitor turned off" color, but there is zero response to the keyboard. The upshot is that I wind up having to power cycle it.
Well, it just happened again on one of our servers Friday evening, as I found this morning. Looking at the logs this morning, I see that sar last shows 10:20:01 PM all 34.38 0.00 8.29 0.00 0.00 57.33
On of my users dropped me an email at 22:45 that it was "off", and the last things I see in /var/log/messages are one of those annoying Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more than 120 seconds. Feb 21 22:26:23 <server> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
I also see Feb 21 22:26:23 <server> kernel: perl D ffffffff80158250 0 20596 20557 which, as I just found by googling perl NOTLD, means that this is in a kernel uninterruptable state In addition, in the stack trace, some nfs messages Feb 21 22:26:23 <server> kernel: [<ffffffff886b58d1>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
So, it *appears* to be either an NFS issue, or a NIC issue. The user's home directory server is CentOS running 6.5, and the server that hung is 5.10. Mount on the formerly hung server, su-d to his account shows merely nfs, so I'm guessing it's NFS3. Looking at lsmod and /var/log/dmesg, I see it's running the tg3 NIC driver.
Anyone else seeing this, and if so, any thoughts on the matter? Note that I've had this on Penguins, which are all Supermicro, and they're using the igb NIC driver, but the one this past weekend is a Dell, so it's not just one system.
mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
What CPU's do these systems have? AMD or Intel.
What kernel are the server and client running?
-Connie Sieh
Connie Sieh wrote:
On Mon, 24 Feb 2014, m.roth@5-cent.us wrote:
Every so often, one of our servers will go into what I can only describe as an undefined state: it pings, but there's zero access - you can't ssh in, and if I go plug a keyboard and monitor into the server itself, you can see the monitor's live, it's not the "monitor turned off" color, but there is zero response to the keyboard. The upshot is that I wind up having to power cycle it.
Well, it just happened again on one of our servers Friday evening, as I found this morning. Looking at the logs this morning, I see that sar last shows 10:20:01 PM all 34.38 0.00 8.29 0.00 0.00 57.33
On of my users dropped me an email at 22:45 that it was "off", and the last things I see in /var/log/messages are one of those annoying Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more than 120 seconds. Feb 21 22:26:23 <server> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
I also see Feb 21 22:26:23 <server> kernel: perl D ffffffff80158250 0 20596 20557 which, as I just found by googling perl NOTLD, means that this is in a kernel uninterruptable state In addition, in the stack trace, some nfs messages Feb 21 22:26:23 <server> kernel: [<ffffffff886b58d1>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
So, it *appears* to be either an NFS issue, or a NIC issue. The user's home directory server is CentOS running 6.5, and the server that hung is 5.10. Mount on the formerly hung server, su-d to his account shows merely nfs, so I'm guessing it's NFS3. Looking at lsmod and
/var/log/dmesg, I
see it's running the tg3 NIC driver.
Anyone else seeing this, and if so, any thoughts on the matter? Note that I've had this on Penguins, which are all Supermicro, and they're
using
the igb NIC driver, but the one this past weekend is a Dell, so it's not just one system.
What CPU's do these systems have? AMD or Intel.
What kernel are the server and client running?
In the case from this weekend, the NFS home directory server is running an AMD Opteron - that's the Dell with th Broadcom NIC; the one that's running 5.10 and hung is running Intel Xeon, and the tg3 NIC driver, and the other server - don't remember which of them it was, several have done this, so I'm just picking one Penguin - is a different model AMD Opteron, and the Intel with igb as the driver.
mark
It seems system was in hung state . The message
Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more
than 120 seconds.
Just indicates that process 20596 was stuck/hang in cpu for more than 120 seconds. To begin with the troubleshooting, I would suggest you to check what this process does. Whether this required any REMOTE storage/disk access. Btw, the same perl process is going to D state/hang state first/always ?
If no remote storage/disk access is required for this perl application AND In case you are running this application as root user, try run this application as a normal user in which a resource limitation is applicable via limits.conf.
If the process required a storage/NFS access, you may want to check the disk/storage status at the time when application moved to D state. I understand that you can't predict the issue time and perform all the checks mentioned above. afaik, to find the root cause of this problem, you may want to analyse core dump collected at the time of the issue.
Cheers, Dominic
Here are some suggestions:
1. Enable and configure kdump
2. Enable Magic SysRq
3. Consider enabling "kernel.softlockup_panic" and "vm.panic_on_oom", but doing so will cause you server to crash sooner than it would normally --> it depends upon whether you want to capture the first instance (e.g. smoking gun) or that you want to wait until the system is completely hosed (and may have more evidence of the issue).
Then test and verify that Magic SysRq can be used to generate a kernel core dump.
Then, sit back and wait .....
I do this on all my production servers -- saving the pain of having to do this under pressure plus capturing the vmcore on the first instance is very much worth the effort ....
HTH
-rak-