[CentOS] NFS/RDMA connection closed

Hi, we are having a problem with NFS using RDMA protocol over our FDR10 
Infiniband network.  I previously wrote to NFS mailing list about this, 
so you may find our discussion there.  I have taken some load off the 
server which was using NFS for backups, and converted it to use SSH, but 
we are still having critical problems with NFS clients losing connection 
to the server, causing the clients to hang and needing a reboot.  I 
wanted to check in here before filing a bug with CentOS.

Our setup is a cluster with one head node (NFS server) and 9 compute 
nodes (NFS clients).  All the machines are running CentOS 6.9 
2.6.32-696.30.1.el6.x86_64 and using the "Inbox"/CentOS RDMA 
implementation/drivers (not Mellanox OFED).  (We also have other NFS 
clients but they are using 1GbE for NFS connection and, while they will 
still hang with messages like "NFS server not responding, retrying" or 
"timed out", they will eventually recover and don't need a reboot.)

On the server (which is named pac) I will see messages like this:
Jul 30 18:19:38 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 30 18:19:38 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:03:05 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:09:06 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:16:09 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:23:31 pac kernel: svcrdma: Error -107 posting RDMA_READ
Jul 31 15:53:55 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5

Previously I had also seen messages like "Jul 11 21:09:56 pac kernel: 
nfsd: peername failed (err 107)!" however have not seen that in this 
latest hangup.

And on the clients (named n001-n009) I will see messages like this:
Jul 30 18:17:26 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810674024c0 (stale): WR flushed
Jul 30 18:17:26 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff88106638a640 (stale): WR flushed
Jul 30 18:19:26 n001 kernel: nfs: server 10.10.11.100 not responding, 
still trying
Jul 30 18:19:36 n001 kernel: nfs: server 10.10.10.100 not responding, 
timed out
Jul 30 18:19:38 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 
on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n001 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:42:08 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810671f02c0 (stale): WR flushed
Jul 31 14:42:08 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810677bda40 (stale): WR flushed
Jul 31 14:42:08 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810677bd940 (stale): WR flushed
Jul 31 14:42:08 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810671f0240 (stale): WR flushed
Jul 31 14:43:35 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:01:53 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff881065133140 (stale): WR flushed
Jul 31 15:01:53 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810666e3f00 (stale): WR flushed
Jul 31 15:01:53 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff881063ea0dc0 (stale): WR flushed
Jul 31 15:01:53 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810677bdb40 (stale): WR flushed
Jul 31 15:03:05 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:07:07 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff881060e59d40 (stale): WR flushed
Jul 31 15:07:07 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810677efac0 (stale): WR flushed
Jul 31 15:07:07 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff88106638a640 (stale): WR flushed
Jul 31 15:07:07 n001 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810671f03c0 (stale): WR flushed
Jul 31 15:09:06 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:16:09 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 
closed (-103)
Jul 31 15:53:32 n001 kernel: nfs: server 10.10.10.100 not responding, 
timed out
Jul 31 16:08:56 n001 kernel: nfs: server 10.10.10.100 not responding, 
timed out

Jul 30 18:17:26 n002 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff881064461500 (stale): WR flushed
Jul 30 18:17:26 n002 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr 
ffff8810604b2600 (stale): WR flushed
Jul 30 18:19:26 n002 kernel: nfs: server 10.10.11.100 not responding, 
still trying
Jul 30 18:19:38 n002 kernel: rpcrdma: connection to 10.10.11.100:20049 
on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n002 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:43:35 n002 kernel: rpcrdma: connection to 10.10.11.100:20049 
closed (-103)
Jul 31 16:08:56 n002 kernel: nfs: server 10.10.10.100 not responding, 
timed out

Similar messages show up on the other clients n003-n009.  After these 
messages on the clients, their load will continually go up (viewable 
through Ganglia) (I would guess since they are waiting for NFS mount to 
re-appear).  They aren't reachable any longer through SSH and neither 
can root log in through console via IPMI web applet (just hangs after 
entering password, may get to prompt eventually but system load is so 
high), they need to be rebooted through IPMI interface.

Here is /etc/fstab on the server,
UUID=f15df051-ffb8-408c-8ad2-1987b6f082a2	/	ext3	defaults	0 1
UUID=c854ee27-32cf-445d-8308-4e6f1a87d364	/boot	ext3	defaults	0 2
UUID=b92a100f-2521-408b-9b15-93671c6ae056	swap	swap	defaults	0 0
UUID=a8a7b737-25ed-43a7-ae4b-391c71aa8c08	/data	xfs	defaults	0 2
UUID=d5692ec2-d5dc-4bb8-98d4-a4fb2ff54748	/projects xfs	defaults	0 2
/dev/drbd0					/newwing xfs	noauto	0 0
UUID=a305f309-d997-43ec-8e4f-78e26b07652f	/working xfs	defaults	0 2
tmpfs	/dev/shm	tmpfs   defaults        0 0
devpts	/dev/pts	devpts  gid=5,mode=620  0 0
sysfs	/sys		sysfs   defaults        0 0
proc	/proc		proc    defaults        0 0

I read that adding "inode64,nobarrier" for the xfs mount options may 
help?  That is something I can try once the server can be rebooted.

Here is current mounts on the server,
/dev/sda3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)

Here is /etc/exports on the server,
/data    10.10.10.0/24(rw,no_root_squash,async)
/data    10.10.11.0/24(rw,no_root_squash,async)
/data    150.x.x.192/27(rw,no_root_squash,async)
/data    150.x.x.64/26(rw,no_root_squash,async)
/home    10.10.10.0/24(rw,no_root_squash,async)
/home    10.10.11.0/24(rw,no_root_squash,async)
/opt    10.10.10.0/24(rw,no_root_squash,async)
/opt    10.10.11.0/24(rw,no_root_squash,async)
/projects    10.10.10.0/24(rw,no_root_squash,async)
/projects    10.10.11.0/24(rw,no_root_squash,async)
/projects    150.x.x.192/27(rw,no_root_squash,async)
/projects    150.x.x.64/26(rw,no_root_squash,async)
/tools    10.10.10.0/24(rw,no_root_squash,async)
/tools    10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.11.10/24(rw,no_root_squash,async)
/usr/local    10.10.10.10/24(rw,no_root_squash,async)
/usr/local    10.10.11.10/24(rw,no_root_squash,async)
/working    10.10.10.0/24(rw,no_root_squash,async)
/working    10.10.11.0/24(rw,no_root_squash,async)
/working    150.x.x.192/27(rw,no_root_squash,async)
/working    150.x.x.64/26(rw,no_root_squash,async)
/newwing    10.10.10.0/24(rw,no_root_squash,async)
/newwing    10.10.11.0/24(rw,no_root_squash,async)
/newwing    150.x.x.192/27(rw,no_root_squash,async)
/newwing    150.x.x.64/26(rw,no_root_squash,async)

The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the 
Infiniband.  The other networks are also 1GbE.  Our cluster nodes will 
normally mount all of these using the Infiniband with RDMA and the 
computation jobs will normally be using /working which will see the most 
reading/writing but /newwing, /projects, and /data are also used.

Here is an /etc/fstab from the nodes,
#NFS/RDMA
#10.10.11.100:/opt			/opt			nfs	rdma,port=20049	0 0
#10.10.11.100:/data			/data			nfs	rdma,port=20049	0 0
#10.10.11.100:/tools			/tools			nfs	rdma,port=20049	0 0
#10.10.11.100:/home			/home			nfs	rdma,port=20049	0 0
#10.10.11.100:/usr/local			/usr/local		nfs	rdma,port=20049	0 0
#10.10.11.100:/usr/share/gridengine	/usr/share/gridengine	nfs 
rdma,port=20049 0 0
#10.10.11.100:/projects   		/projects		nfs	rdma,port=20049	0 0
#10.10.11.100:/working			/working		nfs	rdma,port=20049 0 0
#10.10.11.100:/newwing			/newwing		nfs	rdma,port=20049 0 0

#NFS/IPoIB
10.10.11.100:/opt			/opt			nfs	tcp		0 0
10.10.11.100:/data			/data			nfs	tcp		0 0
10.10.11.100:/tools			/tools			nfs	tcp		0 0
10.10.11.100:/home			/home			nfs	tcp		0 0
10.10.11.100:/usr/local		/usr/local		nfs	tcp		0 0
10.10.11.100:/usr/share/gridengine	/usr/share/gridengine	nfs	tcp		0 0
10.10.11.100:/projects   		/projects		nfs	tcp		0 0
10.10.11.100:/working			/working		nfs	tcp		0 0
10.10.11.100:/newwing			/newwing		nfs	tcp		0 0

#NFS/TCP
#10.10.10.100:/opt			/opt			nfs	defaults	0 0
#10.10.10.100:/data			/data			nfs	defaults	0 0
#10.10.10.100:/tools			/tools			nfs	defaults	0 0
#10.10.10.100:/home			/home			nfs	defaults	0 0
#10.10.10.100:/usr/local			/usr/local		nfs	defaults	0 0
#10.10.10.100:/usr/share/gridengine	/usr/share/gridengine	nfs	defaults	0 0
#10.10.10.100:/projects			/projects		nfs	defaults	0 0
#10.10.10.100:/working			/working		nfs	defaults	0 0
#10.10.10.100:/newwing			/newwing		nfs	defaults	0 0

Here I can switch between different interfaces/protocols for the NFS 
mounts.  Currently we are trying the IPoIB.  We haven't started a 
cluster job yet so not sure how it will perform.  With the NFS/TCP over 
1GbE the server/nodes would hang from time to time but still did not 
crash at least, however was of course slow being limited by 1GbE.

We haven't had this problem until recently.  I upgraded our cluster to 
add the two additional nodes (n008 and n009) and we also added more 
storage to the server (/newwing and /working).  The new nodes are AMD 
EPYC platform whereas the server and the nodes n001-n007 are Intel Xeon 
platform, not sure if that would cause such a crash.  The new nodes were 
cloned from n001 and only kernel command line and network parameters 
were changed.

The jobs are submitted to the cluster via Sun Grid Engine, and in total 
there are about 61 jobs that may start at once and open connections to 
the NFS server... it sounds like it is a system overload, although the 
load on the server remains low, under 10%, even as it hangs the load may 
increase to 80%.  The server is a few years old but still has 2x 6-core 
Intel Xeon E5-2620 v2 @ 2.10GHz with 128GB of RAM.

Would appreciate your assistance to troubleshoot this critical problem 
and, if needed, gather the required information to submit a bug to the 
tracker!

Thanks,
-- 
Chandler
Arizona Genomics Institute
www.genome.arizona.edu