[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

Thu Apr 30 12:24:27 UTC 2015
Peter van Hooft <hooft at natlab.research.philips.com>

> Message: 4
> Date: Wed, 29 Apr 2015 08:35:29 -0500
> From: Matt Garman <matthew.garman at gmail.com>
> To: CentOS mailing list <centos at centos.org>
> Subject: [CentOS] nfs (or tcp or scheduler) changes between centos 5
> 	and 6?
> Message-ID:
> 	<CAJvUf-CyTg8ZiGq3OXRLKw7s1K2dGx1gqo_2XwOAXXQty=RHZQ at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> We have a "compute cluster" of about 100 machines that do a read-only
> NFS mount to a big NAS filer (a NetApp FAS6280).  The jobs running on
> these boxes are analysis/simulation jobs that constantly read data off
> the NAS.
> 
> We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5.
> We did a "piecemeal" upgrade, usually upgrading five or so machines at
> a time, every few days.  We noticed improved performance on the CentOS
> 6 boxes.  But as the number of CentOS 6 boxes increased, we actually
> saw performance on the CentOS 5 boxes decrease.  By the time we had
> only a few CentOS 5 boxes left, they were performing so badly as to be
> effectively worthless.
> 
> What we observed in parallel to this upgrade process was that the read
> latency on our NetApp device skyrocketed.  This in turn caused all
> compute jobs to actually run slower, as it seemed to move the
> bottleneck from the client servers' OS to the NetApp.  This is
> somewhat counter-intuitive: CentOS 6 performs faster, but actually
> results in net performance loss because it creates a bottleneck on our
> centralized storage.
> 
> All indications are that CentOS 6 seems to be much more "aggressive"
> in how it does NFS reads.  And likewise, CentOS 5 was very "polite",
> to the point that it basically got starved out by the introduction of
> the 6.5 boxes.
> 
> What I'm looking for is a "deep dive" list of changes to the NFS
> implementation between CentOS 5 and CentOS 6.  Or maybe this is due to
> a change in the TCP stack?  Or maybe the scheduler?  We've tried a lot
> of sysctl tcp tunings, various nfs mount options, anything that's
> obviously different between 5 and 6... But so far we've been unable to
> find the "smoking gun" that causes the obvious behavior change between
> the two OS versions.
> 
> Just hoping that maybe someone else out there has seen something like
> this, or can point me to some detailed documentation that might clue
> me in on what to look for next.
> 
> Thanks!
> 


You may want to try reducing sunrpc.tcp_max_slot_table_entries .
In CentOS 5 the number of slots is fixed: sunrpc.tcp_slot_table_entries = 16
In CentOS 6, this number is dynamic with a maximum of
sunrpc.tcp_max_slot_table_entries which by default has a value of 65536.

We put that in /etc/sysconfig/modprobe.d/sunrpc.conf: options sunrpc
tcp_max_slot_table_entries=128

You can't put this in /etc/sysctl.conf because the sunrpc kernel module
is loaded before sysctl -p is done.

peter