[CentOS] Help needed with NFS issue

Thu Apr 19 13:22:18 UTC 2012
Giovanni Tirloni <gtirloni at sysdroid.com>

Jumping late on this thread, pardon my ignorance of some details...

On Wed, Apr 18, 2012 at 4:35 PM, Steve Thompson <smt at vgersoft.com> wrote:

> Interesting. It looks like some kind of RPC failure. During the hang, I
> cannot contact the nfs service via RPC:
>
> # rpcinfo -t <server> nfs
> rpcinfo: RPC: Timed out
> program 100003 version 0 is not available
>


Did you run this command during "the hang" or is it constantly returning
you that?

If the later, are you blocking UDP on either the server or the client?


> # rpcinfo -p <server>
>    program vers proto   port
>     100000    2   tcp    111  portmapper
>     100000    2   udp    111  portmapper
>     100024    1   udp   1007  status
>     100024    1   tcp   1010  status
>     100021    1   udp  35077  nlockmgr
>     100021    3   udp  35077  nlockmgr
>     100021    4   udp  35077  nlockmgr
>     100021    1   tcp  56622  nlockmgr
>     100021    3   tcp  56622  nlockmgr
>     100021    4   tcp  56622  nlockmgr
>     100011    1   udp   1009  rquotad
>     100011    2   udp   1009  rquotad
>     100011    1   tcp   1012  rquotad
>     100011    2   tcp   1012  rquotad
>     100003    2   udp   2049  nfs
>     100003    3   udp   2049  nfs
>     100003    4   udp   2049  nfs
>     100003    2   tcp   2049  nfs
>     100003    3   tcp   2049  nfs
>     100003    4   tcp   2049  nfs
>     100005    1   udp    605  mountd
>     100005    1   tcp    608  mountd
>     100005    2   udp    605  mountd
>     100005    2   tcp    608  mountd
>     100005    3   udp    605  mountd
>     100005    3   tcp    608  mountd
>
> However, I can connect to the service via telnet:
>
> # telnet <server> nfs
> Trying <ipaddr>...
> Connected to <server> (<ipaddr>).
> Escape character is '^]'.
>

If you don't specify transport protocol, rpcinfo will use whatever is
defined in the /etc/netconfig database and that's usually UDP.

A couple of ideas/questions:

- Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45).
This might help you to identify a script/program that follows that schedule.
- Is there any configuration different between this server and the others?
/etc/system, root crontab, etc.
- When you say everything else BUT NFS is working fine, are pings answered
properly without increased latency during "the hang" ?
- What about other services? Can you set up a monitoring script connecting
to some other service (eg. ftp, ls, exit or ssh) and reporting the total
run time?
- Can you set up a monitoring script running "rpcinfo" on localhost to make
sure both local and remote communications hang?

-- 
Giovanni