Hi, we are having a problem with NFS using RDMA protocol over our FDR10 Infiniband network. I previously wrote to NFS mailing list about this, so you may find our discussion there. I have taken some load off the server which was using NFS for backups, and converted it to use SSH, but we are still having critical problems with NFS clients losing connection to the server, causing the clients to hang and needing a reboot. I wanted to check in here before filing a bug with CentOS.
Our setup is a cluster with one head node (NFS server) and 9 compute nodes (NFS clients). All the machines are running CentOS 6.9 2.6.32-696.30.1.el6.x86_64 and using the "Inbox"/CentOS RDMA implementation/drivers (not Mellanox OFED). (We also have other NFS clients but they are using 1GbE for NFS connection and, while they will still hang with messages like "NFS server not responding, retrying" or "timed out", they will eventually recover and don't need a reboot.)
On the server (which is named pac) I will see messages like this: Jul 30 18:19:38 pac kernel: svcrdma: failed to send reply chunks, rc=-5 Jul 30 18:19:38 pac kernel: svcrdma: failed to send write chunks, rc=-5 Jul 31 15:03:05 pac kernel: svcrdma: failed to send write chunks, rc=-5 Jul 31 15:09:06 pac kernel: svcrdma: failed to send write chunks, rc=-5 Jul 31 15:16:09 pac kernel: svcrdma: failed to send write chunks, rc=-5 Jul 31 15:23:31 pac kernel: svcrdma: Error -107 posting RDMA_READ Jul 31 15:53:55 pac kernel: svcrdma: failed to send write chunks, rc=-5 Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5 Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Previously I had also seen messages like "Jul 11 21:09:56 pac kernel: nfsd: peername failed (err 107)!" however have not seen that in this latest hangup.
And on the clients (named n001-n009) I will see messages like this: Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810674024c0 (stale): WR flushed Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff88106638a640 (stale): WR flushed Jul 30 18:19:26 n001 kernel: nfs: server 10.10.11.100 not responding, still trying Jul 30 18:19:36 n001 kernel: nfs: server 10.10.10.100 not responding, timed out Jul 30 18:19:38 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 30 18:19:38 n001 kernel: nfs: server 10.10.11.100 OK Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f02c0 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bda40 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bd940 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f0240 (stale): WR flushed Jul 31 14:43:35 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881065133140 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810666e3f00 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881063ea0dc0 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bdb40 (stale): WR flushed Jul 31 15:03:05 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881060e59d40 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677efac0 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff88106638a640 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f03c0 (stale): WR flushed Jul 31 15:09:06 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 31 15:16:09 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 closed (-103) Jul 31 15:53:32 n001 kernel: nfs: server 10.10.10.100 not responding, timed out Jul 31 16:08:56 n001 kernel: nfs: server 10.10.10.100 not responding, timed out
Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881064461500 (stale): WR flushed Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810604b2600 (stale): WR flushed Jul 30 18:19:26 n002 kernel: nfs: server 10.10.11.100 not responding, still trying Jul 30 18:19:38 n002 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 30 18:19:38 n002 kernel: nfs: server 10.10.11.100 OK Jul 31 14:43:35 n002 kernel: rpcrdma: connection to 10.10.11.100:20049 closed (-103) Jul 31 16:08:56 n002 kernel: nfs: server 10.10.10.100 not responding, timed out
Similar messages show up on the other clients n003-n009. After these messages on the clients, their load will continually go up (viewable through Ganglia) (I would guess since they are waiting for NFS mount to re-appear). They aren't reachable any longer through SSH and neither can root log in through console via IPMI web applet (just hangs after entering password, may get to prompt eventually but system load is so high), they need to be rebooted through IPMI interface.
Here is /etc/fstab on the server, UUID=f15df051-ffb8-408c-8ad2-1987b6f082a2 / ext3 defaults 0 1 UUID=c854ee27-32cf-445d-8308-4e6f1a87d364 /boot ext3 defaults 0 2 UUID=b92a100f-2521-408b-9b15-93671c6ae056 swap swap defaults 0 0 UUID=a8a7b737-25ed-43a7-ae4b-391c71aa8c08 /data xfs defaults 0 2 UUID=d5692ec2-d5dc-4bb8-98d4-a4fb2ff54748 /projects xfs defaults 0 2 /dev/drbd0 /newwing xfs noauto 0 0 UUID=a305f309-d997-43ec-8e4f-78e26b07652f /working xfs defaults 0 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0
I read that adding "inode64,nobarrier" for the xfs mount options may help? That is something I can try once the server can be rebooted.
Here is current mounts on the server, /dev/sda3 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/sda1 on /boot type ext3 (rw) /dev/sdc1 on /data type xfs (rw) /dev/sdb1 on /projects type xfs (rw) /dev/sde1 on /working type xfs (rw,nobarrier) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/drbd0 on /newwing type xfs (rw)
Here is /etc/exports on the server, /data 10.10.10.0/24(rw,no_root_squash,async) /data 10.10.11.0/24(rw,no_root_squash,async) /data 150.x.x.192/27(rw,no_root_squash,async) /data 150.x.x.64/26(rw,no_root_squash,async) /home 10.10.10.0/24(rw,no_root_squash,async) /home 10.10.11.0/24(rw,no_root_squash,async) /opt 10.10.10.0/24(rw,no_root_squash,async) /opt 10.10.11.0/24(rw,no_root_squash,async) /projects 10.10.10.0/24(rw,no_root_squash,async) /projects 10.10.11.0/24(rw,no_root_squash,async) /projects 150.x.x.192/27(rw,no_root_squash,async) /projects 150.x.x.64/26(rw,no_root_squash,async) /tools 10.10.10.0/24(rw,no_root_squash,async) /tools 10.10.11.0/24(rw,no_root_squash,async) /usr/share/gridengine 10.10.10.10/24(rw,no_root_squash,async) /usr/share/gridengine 10.10.11.10/24(rw,no_root_squash,async) /usr/local 10.10.10.10/24(rw,no_root_squash,async) /usr/local 10.10.11.10/24(rw,no_root_squash,async) /working 10.10.10.0/24(rw,no_root_squash,async) /working 10.10.11.0/24(rw,no_root_squash,async) /working 150.x.x.192/27(rw,no_root_squash,async) /working 150.x.x.64/26(rw,no_root_squash,async) /newwing 10.10.10.0/24(rw,no_root_squash,async) /newwing 10.10.11.0/24(rw,no_root_squash,async) /newwing 150.x.x.192/27(rw,no_root_squash,async) /newwing 150.x.x.64/26(rw,no_root_squash,async)
The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the Infiniband. The other networks are also 1GbE. Our cluster nodes will normally mount all of these using the Infiniband with RDMA and the computation jobs will normally be using /working which will see the most reading/writing but /newwing, /projects, and /data are also used.
Here is an /etc/fstab from the nodes, #NFS/RDMA #10.10.11.100:/opt /opt nfs rdma,port=20049 0 0 #10.10.11.100:/data /data nfs rdma,port=20049 0 0 #10.10.11.100:/tools /tools nfs rdma,port=20049 0 0 #10.10.11.100:/home /home nfs rdma,port=20049 0 0 #10.10.11.100:/usr/local /usr/local nfs rdma,port=20049 0 0 #10.10.11.100:/usr/share/gridengine /usr/share/gridengine nfs rdma,port=20049 0 0 #10.10.11.100:/projects /projects nfs rdma,port=20049 0 0 #10.10.11.100:/working /working nfs rdma,port=20049 0 0 #10.10.11.100:/newwing /newwing nfs rdma,port=20049 0 0
#NFS/IPoIB 10.10.11.100:/opt /opt nfs tcp 0 0 10.10.11.100:/data /data nfs tcp 0 0 10.10.11.100:/tools /tools nfs tcp 0 0 10.10.11.100:/home /home nfs tcp 0 0 10.10.11.100:/usr/local /usr/local nfs tcp 0 0 10.10.11.100:/usr/share/gridengine /usr/share/gridengine nfs tcp 0 0 10.10.11.100:/projects /projects nfs tcp 0 0 10.10.11.100:/working /working nfs tcp 0 0 10.10.11.100:/newwing /newwing nfs tcp 0 0
#NFS/TCP #10.10.10.100:/opt /opt nfs defaults 0 0 #10.10.10.100:/data /data nfs defaults 0 0 #10.10.10.100:/tools /tools nfs defaults 0 0 #10.10.10.100:/home /home nfs defaults 0 0 #10.10.10.100:/usr/local /usr/local nfs defaults 0 0 #10.10.10.100:/usr/share/gridengine /usr/share/gridengine nfs defaults 0 0 #10.10.10.100:/projects /projects nfs defaults 0 0 #10.10.10.100:/working /working nfs defaults 0 0 #10.10.10.100:/newwing /newwing nfs defaults 0 0
Here I can switch between different interfaces/protocols for the NFS mounts. Currently we are trying the IPoIB. We haven't started a cluster job yet so not sure how it will perform. With the NFS/TCP over 1GbE the server/nodes would hang from time to time but still did not crash at least, however was of course slow being limited by 1GbE.
We haven't had this problem until recently. I upgraded our cluster to add the two additional nodes (n008 and n009) and we also added more storage to the server (/newwing and /working). The new nodes are AMD EPYC platform whereas the server and the nodes n001-n007 are Intel Xeon platform, not sure if that would cause such a crash. The new nodes were cloned from n001 and only kernel command line and network parameters were changed.
The jobs are submitted to the cluster via Sun Grid Engine, and in total there are about 61 jobs that may start at once and open connections to the NFS server... it sounds like it is a system overload, although the load on the server remains low, under 10%, even as it hangs the load may increase to 80%. The server is a few years old but still has 2x 6-core Intel Xeon E5-2620 v2 @ 2.10GHz with 128GB of RAM.
Would appreciate your assistance to troubleshoot this critical problem and, if needed, gather the required information to submit a bug to the tracker!
Thanks,
Hi I also forgot to add the following information which was discussed on NFS mailing list with Chuck Lever, leading us to believe there is a software bug in the kernel, not necessarily a server overload.
On the NFS server, we also mount some other NFS shares from other NFS servers, over 1GbE: 150.x.x.116:/wing on /wing type nfs (rw,addr=150.x.x.116) 10.10.10.201:/opt/ftproot on /opt/ftproot type nfs (rw,vers=4,addr=10.10.10.201,clientaddr=10.10.10.100) 150.x.x.202:/archive on /archive type nfs (rw,vers=4,addr=150.x.x.202,clientaddr=128.x.x.2)
This hangup/bug seems to occur when we are reading/writing to these other shares from the NFS server and the NFS server is also busy processing our work from the cluster using the RDMA exports. There used to be two other NFS mounts, which were used to send/write backups to, and were scheduled every night at 8PM. I noticed the RDMA errors from my original post were all showing up shortly after 8PM. So we decided to get rid of these NFS mounts and convert the backup to transfer via SSH instead. The RDMA errors stopped happening after 8PM when the backup ran, but now the errors are still showing up, when we are reading/writing to the other NFS mounts above that we still need.
It seems we should be able to use these different mounts and exports without issue, leading us to believe there is a software bug somewhere.
Are there any other suggested solutions to this problem? Perhaps some system, network and/or filesystem tuning? Any comments on adding the "inode64,nobarrier" XFS mount options? Any extra information I can gather to help with a bug report? Debug info or whatnot?
Thanks