Connie Sieh wrote: > On Mon, 24 Feb 2014, m.roth at 5-cent.us wrote: > >> Every so often, one of our servers will go into what I can only describe >> as an undefined state: it pings, but there's zero access - you can't ssh >> in, and if I go plug a keyboard and monitor into the server itself, you >> can see the monitor's live, it's not the "monitor turned off" color, but >> there is zero response to the keyboard. The upshot is that I wind up >> having to power cycle it. >> >> Well, it just happened again on one of our servers Friday evening, as I >> found this morning. Looking at the logs this morning, I see that sar >> last shows >> 10:20:01 PM all 34.38 0.00 8.29 0.00 0.00 >> 57.33 >> >> On of my users dropped me an email at 22:45 that it was "off", and the >> last things I see in /var/log/messages are one of those annoying >> Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more >> than 120 seconds. >> Feb 21 22:26:23 <server> kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> >> I also see >> Feb 21 22:26:23 <server> kernel: perl D ffffffff80158250 0 >> 20596 20557 >> which, as I just found by googling perl NOTLD, means that this is in a >> kernel uninterruptable state >> In addition, in the stack trace, some nfs messages >> Feb 21 22:26:23 <server> kernel: [<ffffffff886b58d1>] >> :nfs:nfs_wait_bit_uninterruptible+0x0/0xd >> >> So, it *appears* to be either an NFS issue, or a NIC issue. The user's >> home directory server is CentOS running 6.5, and the server that hung is >> 5.10. Mount on the formerly hung server, su-d to his account shows >> merely nfs, so I'm guessing it's NFS3. Looking at lsmod and /var/log/dmesg, I >> see it's running the tg3 NIC driver. >> >> Anyone else seeing this, and if so, any thoughts on the matter? Note >> that I've had this on Penguins, which are all Supermicro, and they're using >> the igb NIC driver, but the one this past weekend is a Dell, so it's not >> just one system. >> > What CPU's do these systems have? AMD or Intel. > > What kernel are the server and client running? In the case from this weekend, the NFS home directory server is running an AMD Opteron - that's the Dell with th Broadcom NIC; the one that's running 5.10 and hung is running Intel Xeon, and the tg3 NIC driver, and the other server - don't remember which of them it was, several have done this, so I'm just picking one Penguin - is a different model AMD Opteron, and the Intel with igb as the driver. mark