[CentOS] Help needed with NFS issue

Tue Apr 17 21:40:18 UTC 2012
Steve Thompson <smt at vgersoft.com>

I have four NFS servers running on Dell hardware (PE2900) under CentOS 
5.7, x86_64. The number of NFS clients is about 170.

A few days ago, one of the four, with no apparent changes, stopped 
responding to NFS requests for two minutes every half an hour (approx). 
Let's call this "the hang". It has been doing this for four days now. 
There are no log messages of any kind pertaining to this. The other three 
servers are fine, although they are less loaded. Between hangs, 
performance is excellent. Load is more or less constant, not peaky.

NFS clients do get the usual "not responding, still trying" message during 
a hang.

There are no cron or other jobs that launch every half an hour.

All hardware on the affected server seems to be good. Disk volumes being 
served are RAID-5 sets with write-back cache enabled (BBU is good). RAID 
controller logs are free of errors.

NFS servers used dual bonded gigabit links in balance-alb mode. Turning 
off one interface in the bond made no difference.

Relevant /etc/sysctl.conf parameters:

vm.dirty_ratio = 50
vm.dirty_background_ratio = 1
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100
vm.min_free_kbytes = 65536
net.core.rmem_default = 262144
net.core.rmem_max = 262144
net.core.wmem_default = 262144
net.core.wmem_max = 262144
net.core.netdev_max_backlog = 25000
net.ipv4.tcp_reordering = 127
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_no_metrics_save = 1

The {r,w}mem_{max,default} values are twice what they were previously; 
changing these had no effect.

The number of dirty pages is nowhere near the dirty_ratio when the hangs 
occur; there may be only 50MB of dirty memory.

A local process on the NFS server is reading from disk at around 40-50 
MB/sec on average; this continues unaffected during the hang, as do all 
other network services on the host (eg an LDAP server). During the hang 
the server seems to be quite snappy in all respects apart from NFS. The 
network itself is fine as far as I can tell, and all NFS-related processes 
on the server are intact.

NFS mounts on clients are made with UDP or TCP with no difference in 
results. A client mount cannot be completed ("timed out") and access to an 
already NFS mounted volume stalls during the hang (both automounted and 
manual mounts).

NFS block size is 32768 r and w; using 16384 makes no difference.

Tcpdump shows no NFS packets exchanged between client and server during a 
hang.

I have not rebooted the affected server yet, but I have restarted NFS
with no change.

Help! I cannot figure out what is wrong, and I cannot find anything amiss. 
I'm running out of something but I don't know what it is (except perhaps
brains). Hints, please!

Steve