Jumping late on this thread, pardon my ignorance of some details...
On Wed, Apr 18, 2012 at 4:35 PM, Steve Thompson smt@vgersoft.com wrote:
Interesting. It looks like some kind of RPC failure. During the hang, I cannot contact the nfs service via RPC:
# rpcinfo -t <server> nfs rpcinfo: RPC: Timed out program 100003 version 0 is not available
Did you run this command during "the hang" or is it constantly returning you that?
If the later, are you blocking UDP on either the server or the client?
# rpcinfo -p <server> program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 1007 status 100024 1 tcp 1010 status 100021 1 udp 35077 nlockmgr 100021 3 udp 35077 nlockmgr 100021 4 udp 35077 nlockmgr 100021 1 tcp 56622 nlockmgr 100021 3 tcp 56622 nlockmgr 100021 4 tcp 56622 nlockmgr 100011 1 udp 1009 rquotad 100011 2 udp 1009 rquotad 100011 1 tcp 1012 rquotad 100011 2 tcp 1012 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 605 mountd 100005 1 tcp 608 mountd 100005 2 udp 605 mountd 100005 2 tcp 608 mountd 100005 3 udp 605 mountd 100005 3 tcp 608 mountd
However, I can connect to the service via telnet:
# telnet <server> nfs Trying <ipaddr>... Connected to <server> (<ipaddr>). Escape character is '^]'.
If you don't specify transport protocol, rpcinfo will use whatever is defined in the /etc/netconfig database and that's usually UDP.
A couple of ideas/questions:
- Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45). This might help you to identify a script/program that follows that schedule. - Is there any configuration different between this server and the others? /etc/system, root crontab, etc. - When you say everything else BUT NFS is working fine, are pings answered properly without increased latency during "the hang" ? - What about other services? Can you set up a monitoring script connecting to some other service (eg. ftp, ls, exit or ssh) and reporting the total run time? - Can you set up a monitoring script running "rpcinfo" on localhost to make sure both local and remote communications hang?