On Apr 17, 2012, at 5:40 PM, Steve Thompson smt@vgersoft.com wrote:
I have four NFS servers running on Dell hardware (PE2900) under CentOS 5.7, x86_64. The number of NFS clients is about 170.
A few days ago, one of the four, with no apparent changes, stopped responding to NFS requests for two minutes every half an hour (approx). Let's call this "the hang". It has been doing this for four days now. There are no log messages of any kind pertaining to this. The other three servers are fine, although they are less loaded. Between hangs, performance is excellent. Load is more or less constant, not peaky.
NFS clients do get the usual "not responding, still trying" message during a hang.
There are no cron or other jobs that launch every half an hour.
All hardware on the affected server seems to be good. Disk volumes being served are RAID-5 sets with write-back cache enabled (BBU is good). RAID controller logs are free of errors.
NFS servers used dual bonded gigabit links in balance-alb mode. Turning off one interface in the bond made no difference.
Relevant /etc/sysctl.conf parameters:
vm.dirty_ratio = 50 vm.dirty_background_ratio = 1 vm.dirty_expire_centisecs = 1000 vm.dirty_writeback_centisecs = 100 vm.min_free_kbytes = 65536 net.core.rmem_default = 262144 net.core.rmem_max = 262144 net.core.wmem_default = 262144 net.core.wmem_max = 262144 net.core.netdev_max_backlog = 25000 net.ipv4.tcp_reordering = 127 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.tcp_no_metrics_save = 1
The {r,w}mem_{max,default} values are twice what they were previously; changing these had no effect.
The number of dirty pages is nowhere near the dirty_ratio when the hangs occur; there may be only 50MB of dirty memory.
A local process on the NFS server is reading from disk at around 40-50 MB/sec on average; this continues unaffected during the hang, as do all other network services on the host (eg an LDAP server). During the hang the server seems to be quite snappy in all respects apart from NFS. The network itself is fine as far as I can tell, and all NFS-related processes on the server are intact.
NFS mounts on clients are made with UDP or TCP with no difference in results. A client mount cannot be completed ("timed out") and access to an already NFS mounted volume stalls during the hang (both automounted and manual mounts).
NFS block size is 32768 r and w; using 16384 makes no difference.
Tcpdump shows no NFS packets exchanged between client and server during a hang.
I have not rebooted the affected server yet, but I have restarted NFS with no change.
Help! I cannot figure out what is wrong, and I cannot find anything amiss. I'm running out of something but I don't know what it is (except perhaps brains). Hints, please!
Just a shot in the dark here.
Take a look at the NIC and switch port flow control status during an outage, they may be paused due to switch load.
Is there anything else on the network switches that might flood them every half hour for a two minute duration?
-Ross