I have four NFS servers running on Dell hardware (PE2900) under CentOS 5.7, x86_64. The number of NFS clients is about 170.
A few days ago, one of the four, with no apparent changes, stopped responding to NFS requests for two minutes every half an hour (approx). Let's call this "the hang". It has been doing this for four days now. There are no log messages of any kind pertaining to this. The other three servers are fine, although they are less loaded. Between hangs, performance is excellent. Load is more or less constant, not peaky.
NFS clients do get the usual "not responding, still trying" message during a hang.
There are no cron or other jobs that launch every half an hour.
All hardware on the affected server seems to be good. Disk volumes being served are RAID-5 sets with write-back cache enabled (BBU is good). RAID controller logs are free of errors.
NFS servers used dual bonded gigabit links in balance-alb mode. Turning off one interface in the bond made no difference.
Relevant /etc/sysctl.conf parameters:
vm.dirty_ratio = 50 vm.dirty_background_ratio = 1 vm.dirty_expire_centisecs = 1000 vm.dirty_writeback_centisecs = 100 vm.min_free_kbytes = 65536 net.core.rmem_default = 262144 net.core.rmem_max = 262144 net.core.wmem_default = 262144 net.core.wmem_max = 262144 net.core.netdev_max_backlog = 25000 net.ipv4.tcp_reordering = 127 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.tcp_no_metrics_save = 1
The {r,w}mem_{max,default} values are twice what they were previously; changing these had no effect.
The number of dirty pages is nowhere near the dirty_ratio when the hangs occur; there may be only 50MB of dirty memory.
A local process on the NFS server is reading from disk at around 40-50 MB/sec on average; this continues unaffected during the hang, as do all other network services on the host (eg an LDAP server). During the hang the server seems to be quite snappy in all respects apart from NFS. The network itself is fine as far as I can tell, and all NFS-related processes on the server are intact.
NFS mounts on clients are made with UDP or TCP with no difference in results. A client mount cannot be completed ("timed out") and access to an already NFS mounted volume stalls during the hang (both automounted and manual mounts).
NFS block size is 32768 r and w; using 16384 makes no difference.
Tcpdump shows no NFS packets exchanged between client and server during a hang.
I have not rebooted the affected server yet, but I have restarted NFS with no change.
Help! I cannot figure out what is wrong, and I cannot find anything amiss. I'm running out of something but I don't know what it is (except perhaps brains). Hints, please!
Steve
On Apr 17, 2012, at 5:40 PM, Steve Thompson smt@vgersoft.com wrote:
I have four NFS servers running on Dell hardware (PE2900) under CentOS 5.7, x86_64. The number of NFS clients is about 170.
A few days ago, one of the four, with no apparent changes, stopped responding to NFS requests for two minutes every half an hour (approx). Let's call this "the hang". It has been doing this for four days now. There are no log messages of any kind pertaining to this. The other three servers are fine, although they are less loaded. Between hangs, performance is excellent. Load is more or less constant, not peaky.
NFS clients do get the usual "not responding, still trying" message during a hang.
There are no cron or other jobs that launch every half an hour.
All hardware on the affected server seems to be good. Disk volumes being served are RAID-5 sets with write-back cache enabled (BBU is good). RAID controller logs are free of errors.
NFS servers used dual bonded gigabit links in balance-alb mode. Turning off one interface in the bond made no difference.
Relevant /etc/sysctl.conf parameters:
vm.dirty_ratio = 50 vm.dirty_background_ratio = 1 vm.dirty_expire_centisecs = 1000 vm.dirty_writeback_centisecs = 100 vm.min_free_kbytes = 65536 net.core.rmem_default = 262144 net.core.rmem_max = 262144 net.core.wmem_default = 262144 net.core.wmem_max = 262144 net.core.netdev_max_backlog = 25000 net.ipv4.tcp_reordering = 127 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.tcp_no_metrics_save = 1
The {r,w}mem_{max,default} values are twice what they were previously; changing these had no effect.
The number of dirty pages is nowhere near the dirty_ratio when the hangs occur; there may be only 50MB of dirty memory.
A local process on the NFS server is reading from disk at around 40-50 MB/sec on average; this continues unaffected during the hang, as do all other network services on the host (eg an LDAP server). During the hang the server seems to be quite snappy in all respects apart from NFS. The network itself is fine as far as I can tell, and all NFS-related processes on the server are intact.
NFS mounts on clients are made with UDP or TCP with no difference in results. A client mount cannot be completed ("timed out") and access to an already NFS mounted volume stalls during the hang (both automounted and manual mounts).
NFS block size is 32768 r and w; using 16384 makes no difference.
Tcpdump shows no NFS packets exchanged between client and server during a hang.
I have not rebooted the affected server yet, but I have restarted NFS with no change.
Help! I cannot figure out what is wrong, and I cannot find anything amiss. I'm running out of something but I don't know what it is (except perhaps brains). Hints, please!
Just a shot in the dark here.
Take a look at the NIC and switch port flow control status during an outage, they may be paused due to switch load.
Is there anything else on the network switches that might flood them every half hour for a two minute duration?
-Ross
On Apr 17, 2012, at 6:49 PM, Ross Walker rswwalker@gmail.com wrote:
Just a shot in the dark here.
Take a look at the NIC and switch port flow control status during an outage, they may be paused due to switch load.
Is there anything else on the network switches that might flood them every half hour for a two minute duration?
Let me also add that constant spanning tree convergence can cause this too. Make sure your choice of protocol and priority suit your topology and equipment.
-Ross
On Tue, 17 Apr 2012, Ross Walker wrote:
Let me also add that constant spanning tree convergence can cause this too. Make sure your choice of protocol and priority suit your topology and equipment.
Gives me an idea! The switch is under control of different people. I did have a new VLAN created for an unrelated purpose two days before this all started. Hmmm...
Steve
On Apr 17, 2012, at 6:57 PM, Steve Thompson wrote:
On Tue, 17 Apr 2012, Ross Walker wrote:
Let me also add that constant spanning tree convergence can cause this too. Make sure your choice of protocol and priority suit your topology and equipment.
Gives me an idea! The switch is under control of different people. I did have a new VLAN created for an unrelated purpose two days before this all started. Hmmm...
Maybe one of the ports of the bonded interfaces was assigned to this vlan causing LACP to break.
-Ross
On Tue, 17 Apr 2012, Ross Walker wrote:
Take a look at the NIC and switch port flow control status during an outage, they may be paused due to switch load. Is there anything else on the network switches that might flood them every half hour for a two minute duration?
Unfortunately not. All of the NFS servers are on the same switch (an HP procurve) and only the one is having issues. The hang is always the same length, too. Nice try though!
Steve
Also shot in the dark from me. There maybe some IP conflict in the network.
Sent from my iPhone
On Wed, 18 Apr 2012, Fajar Priyanto wrote:
Also shot in the dark from me. There maybe some IP conflict in the network.
Yes, I thought of that one too. I am in control of all IP's on the network, so I am sure that nothing changed around the time that the trouble started. I checked for that anyway :-(
Steve
Interesting. It looks like some kind of RPC failure. During the hang, I cannot contact the nfs service via RPC:
# rpcinfo -t <server> nfs rpcinfo: RPC: Timed out program 100003 version 0 is not available
even though it is supposedly available:
# rpcinfo -p <server> program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 1007 status 100024 1 tcp 1010 status 100021 1 udp 35077 nlockmgr 100021 3 udp 35077 nlockmgr 100021 4 udp 35077 nlockmgr 100021 1 tcp 56622 nlockmgr 100021 3 tcp 56622 nlockmgr 100021 4 tcp 56622 nlockmgr 100011 1 udp 1009 rquotad 100011 2 udp 1009 rquotad 100011 1 tcp 1012 rquotad 100011 2 tcp 1012 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 605 mountd 100005 1 tcp 608 mountd 100005 2 udp 605 mountd 100005 2 tcp 608 mountd 100005 3 udp 605 mountd 100005 3 tcp 608 mountd
However, I can connect to the service via telnet:
# telnet <server> nfs Trying <ipaddr>... Connected to <server> (<ipaddr>). Escape character is '^]'.
so the service is running but internally borked in some way.
Steve
On Apr 18, 2012, at 3:35 PM, Steve Thompson smt@vgersoft.com wrote:
Interesting. It looks like some kind of RPC failure. During the hang, I cannot contact the nfs service via RPC:
# rpcinfo -t <server> nfs rpcinfo: RPC: Timed out program 100003 version 0 is not available
even though it is supposedly available:
# rpcinfo -p <server> program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 1007 status 100024 1 tcp 1010 status 100021 1 udp 35077 nlockmgr 100021 3 udp 35077 nlockmgr 100021 4 udp 35077 nlockmgr 100021 1 tcp 56622 nlockmgr 100021 3 tcp 56622 nlockmgr 100021 4 tcp 56622 nlockmgr 100011 1 udp 1009 rquotad 100011 2 udp 1009 rquotad 100011 1 tcp 1012 rquotad 100011 2 tcp 1012 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 605 mountd 100005 1 tcp 608 mountd 100005 2 udp 605 mountd 100005 2 tcp 608 mountd 100005 3 udp 605 mountd 100005 3 tcp 608 mountd
However, I can connect to the service via telnet:
# telnet <server> nfs Trying <ipaddr>... Connected to <server> (<ipaddr>). Escape character is '^]'.
so the service is running but internally borked in some way.
Is iptables disabled? If not, problem with rules or RPC helper?
What about selinux?
-Ross
Jumping late on this thread, pardon my ignorance of some details...
On Wed, Apr 18, 2012 at 4:35 PM, Steve Thompson smt@vgersoft.com wrote:
Interesting. It looks like some kind of RPC failure. During the hang, I cannot contact the nfs service via RPC:
# rpcinfo -t <server> nfs rpcinfo: RPC: Timed out program 100003 version 0 is not available
Did you run this command during "the hang" or is it constantly returning you that?
If the later, are you blocking UDP on either the server or the client?
# rpcinfo -p <server> program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 1007 status 100024 1 tcp 1010 status 100021 1 udp 35077 nlockmgr 100021 3 udp 35077 nlockmgr 100021 4 udp 35077 nlockmgr 100021 1 tcp 56622 nlockmgr 100021 3 tcp 56622 nlockmgr 100021 4 tcp 56622 nlockmgr 100011 1 udp 1009 rquotad 100011 2 udp 1009 rquotad 100011 1 tcp 1012 rquotad 100011 2 tcp 1012 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 605 mountd 100005 1 tcp 608 mountd 100005 2 udp 605 mountd 100005 2 tcp 608 mountd 100005 3 udp 605 mountd 100005 3 tcp 608 mountd
However, I can connect to the service via telnet:
# telnet <server> nfs Trying <ipaddr>... Connected to <server> (<ipaddr>). Escape character is '^]'.
If you don't specify transport protocol, rpcinfo will use whatever is defined in the /etc/netconfig database and that's usually UDP.
A couple of ideas/questions:
- Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45). This might help you to identify a script/program that follows that schedule. - Is there any configuration different between this server and the others? /etc/system, root crontab, etc. - When you say everything else BUT NFS is working fine, are pings answered properly without increased latency during "the hang" ? - What about other services? Can you set up a monitoring script connecting to some other service (eg. ftp, ls, exit or ssh) and reporting the total run time? - Can you set up a monitoring script running "rpcinfo" on localhost to make sure both local and remote communications hang?
On Thu, 19 Apr 2012, Giovanni Tirloni wrote:
Did you run this command during "the hang" or is it constantly returning you that?
It is returning the time out only during the hang; the rest of the time it works normally.
If the later, are you blocking UDP on either the server or the client?
No blocking.
If you don't specify transport protocol, rpcinfo will use whatever is defined in the /etc/netconfig database and that's usually UDP.
Using UDP or TCP makes no difference. "rpcinfo -{u,t} host nfs" both give a timeout during the hang, and work normally during other times.
- Is it happening at the exact same minute (eg. 2:15, 2:45, 3:15, 3:45).
This might help you to identify a script/program that follows that schedule.
It is not related to any script that I can find. It is not happening at _exactly_ the same time all the time, although it is similar within a few minutes.
- Is there any configuration different between this server and the others?
/etc/system, root crontab, etc.
No differences that I can find.
- When you say everything else BUT NFS is working fine, are pings answered
properly without increased latency during "the hang" ?
Yes. I can even run an iperf server on the host during the hang, and from a client I run iperf -c and get normal performance.
- What about other services? Can you set up a monitoring script connecting
to some other service (eg. ftp, ls, exit or ssh) and reporting the total run time?
No other service appears to be impacted at all.
- Can you set up a monitoring script running "rpcinfo" on localhost to make
sure both local and remote communications hang?
Yes, can do.
-Steve
Have you looked at the rpcd process with top or ps to see what state it is in? What about running strace? What about your dns server or any other (reverse) client lookup services that you might have enabled?
Nataraj
All,
Many thanks to everyone who commented on this issue. I believe that I have solved it.
It turns out that the number of nfsd's that I was running (32) was way too low. I observed that adding more nfsd's when NFS was hung always caused the hang to go away immediately. Now I am in the tuning stage where I'm adding more nfsd's until there are no more hangs. I am up to 172 of them now, and the hang frequency has decreased by about a factor of six. Evidently my workload has changed when I wasn't looking closely enough. I'll probably end up with about 256 nfsd's.
For the sake of completeness, here's how to change the number of nfsd's on the fly:
echo 172 > /proc/fs/nfsd/threads
and, of course, edit /etc/sysconfig/nfs to change RPCNFSDCOUNT to set the value for the next boot.
Steve