[CentOS] system hangs

Mon Feb 24 18:27:17 UTC 2014
m.roth at 5-cent.us <m.roth at 5-cent.us>

Connie Sieh wrote:
> On Mon, 24 Feb 2014, m.roth at 5-cent.us wrote:
>
>> Every so often, one of our servers will go into what I can only describe
>> as an undefined state: it pings, but there's zero access - you can't ssh
>> in, and if I go plug a keyboard and monitor into the server itself, you
>> can see the monitor's live, it's not the "monitor turned off" color, but
>> there is zero response to the keyboard. The upshot is that I wind up
>> having to power cycle it.
>>
>> Well, it just happened again on one of our servers Friday evening, as I
>> found this morning. Looking at the logs this morning, I see that sar
>> last shows
>> 10:20:01 PM       all     34.38      0.00      8.29      0.00      0.00
>> 57.33
>>
>> On of my users dropped me an email at 22:45 that it was "off", and the
>> last things I see in /var/log/messages are one of those annoying
>> Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more
>> than 120 seconds.
>> Feb 21 22:26:23 <server> kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
>> I also see
>> Feb 21 22:26:23 <server> kernel: perl          D ffffffff80158250     0
>> 20596  20557
>> which, as I just found by googling perl NOTLD, means that this is in a
>> kernel uninterruptable state
>> In addition, in the stack trace, some nfs messages
>> Feb 21 22:26:23 <server> kernel:  [<ffffffff886b58d1>]
>> :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
>>
>> So, it *appears* to be either an NFS issue, or a NIC issue. The user's
>> home directory server is CentOS running 6.5, and the server that hung is
>> 5.10. Mount on the formerly hung server, su-d to his account shows
>> merely nfs, so I'm guessing it's NFS3. Looking at lsmod and
/var/log/dmesg, I
>> see it's running the tg3 NIC driver.
>>
>> Anyone else seeing this, and if so, any thoughts on the matter? Note
>> that I've had this on Penguins, which are all Supermicro, and they're
using
>> the igb NIC driver, but the one this past weekend is a Dell, so it's not
>> just one system.
>>
> What CPU's do these systems have?  AMD or Intel.
>
> What kernel are the server and client running?

In the case from this weekend, the NFS home directory server is running an
AMD Opteron - that's the Dell with th Broadcom NIC; the one that's running
5.10 and hung is running Intel Xeon, and the tg3 NIC driver, and the other
server - don't remember which of them it was, several have done this, so
I'm just picking one Penguin - is a different model AMD Opteron, and the
Intel with igb as the driver.

       mark