[CentOS] system hangs

Mon Feb 24 15:20:05 UTC 2014
m.roth at 5-cent.us <m.roth at 5-cent.us>

Every so often, one of our servers will go into what I can only describe
as an undefined state: it pings, but there's zero access - you can't ssh
in, and if I go plug a keyboard and monitor into the server itself, you
can see the monitor's live, it's not the "monitor turned off" color, but
there is zero response to the keyboard. The upshot is that I wind up
having to power cycle it.

Well, it just happened again on one of our servers Friday evening, as I
found this morning. Looking at the logs this morning, I see that sar last
10:20:01 PM       all     34.38      0.00      8.29      0.00      0.00   

On of my users dropped me an email at 22:45 that it was "off", and the
last things I see in /var/log/messages are one of those annoying
Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more
than 120 seconds.
Feb 21 22:26:23 <server> kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

I also see
Feb 21 22:26:23 <server> kernel: perl          D ffffffff80158250     0
20596  20557
which, as I just found by googling perl NOTLD, means that this is in a
kernel uninterruptable state
In addition, in the stack trace, some nfs messages
Feb 21 22:26:23 <server> kernel:  [<ffffffff886b58d1>]

So, it *appears* to be either an NFS issue, or a NIC issue. The user's
home directory server is CentOS running 6.5, and the server that hung is
5.10. Mount on the formerly hung server, su-d to his account shows merely
nfs, so I'm guessing it's NFS3. Looking at lsmod and /var/log/dmesg, I see
it's running the tg3 NIC driver.

Anyone else seeing this, and if so, any thoughts on the matter? Note that
I've had this on Penguins, which are all Supermicro, and they're using the
igb NIC driver, but the one this past weekend is a Dell, so it's not just
one system.