[CentOS] Slow performance with NFSv4.1 on CentOS 7.5 ?

Wed May 8 13:44:19 UTC 2019
James Pearson <james-p at moving-picture.com>

James Pearson wrote:
> 
> We have a number of identical NFS clients mounting a server using
> NFSv4.1 - server and clients are all running CentOS 7.5 (kernel
> 3.10.0-862.14.4.el7.x86_64)
> 
> However, on some clients, the NFS performance 'degrades' with time ...
> 
> Running a simple test - a python script that just imports a module
> (python and its modules are installed on the NFS share) can be an order
> of magnitude or more slower on some clients. i.e. very little data is
> transferred, it is the rate of stat'ing and opening files on the NFS
> server that is 'slow'
> 
> Running a tcpdump on a 'slow' client shows that the NFS traffic
> generated on the 'slow' client is again an order of magnitude or more
> when compared with that generated by a 'fast' client
> 
> The majority of the extra NFS traffic in the slow case, appears to be a
> large number of NFS 'TEST_STATEID' calls the client makes - which are
> not there in the tcpdump on the fast client
> 
> The issue can be 'fixed' in the short term by rebooting the affected
> client - and after a reboot, running the same tcpdump shows no
> TEST_STATEID calls - however after a while (several days), the
> performance might degrade again
> 
> I've found a number of reports of excessive TEST_STATEID calls - but
> most seem to relate to NFSv4 client hangs - which is not happening here
> - things are working, but much slower than they should be ...
> 
> Has anyone come across this issue - and have any fixes/workarounds?

After a bit of further poking about, it looks like we are hitting the 
issue described in:

  https://bugzilla.redhat.com/show_bug.cgi?id=1552203

The fix should be in 7.7, but the linked:

  https://access.redhat.com/solutions/3915571

(login required)

has a few suggested workarounds which includes mounting with NFSv4.0 or 
disabling 'NFSv4 delegations' on the server via:

  sysctl -w fs.leases-enable=0

I've used the above sysctl which appears to have fixed the issue - 
although I had to restart NFS on the server to notice any change with 
the affected clients - so I'm not sure if the sysctl change or the NFS 
restart 'fixed' the issue ...

James Pearson