[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

Thu Apr 30 12:31:38 UTC 2015
Peter van Hooft <hooft at natlab.research.philips.com>

On Thu, Apr 30, 2015 at 02:24:27PM +0200, Peter van Hooft wrote:
> > Message: 4
> > Date: Wed, 29 Apr 2015 08:35:29 -0500
> > From: Matt Garman <matthew.garman at gmail.com>
> > To: CentOS mailing list <centos at centos.org>
> > Subject: [CentOS] nfs (or tcp or scheduler) changes between centos 5
> > 	and 6?
> > Message-ID:
> > 	<CAJvUf-CyTg8ZiGq3OXRLKw7s1K2dGx1gqo_2XwOAXXQty=RHZQ at mail.gmail.com>
> > Content-Type: text/plain; charset=UTF-8
> > 
> > We have a "compute cluster" of about 100 machines that do a read-only
> > NFS mount to a big NAS filer (a NetApp FAS6280).  The jobs running on
> > these boxes are analysis/simulation jobs that constantly read data off
> > the NAS.
> > 
> > We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5.
> > We did a "piecemeal" upgrade, usually upgrading five or so machines at
> > a time, every few days.  We noticed improved performance on the CentOS
> > 6 boxes.  But as the number of CentOS 6 boxes increased, we actually
> > saw performance on the CentOS 5 boxes decrease.  By the time we had
> > only a few CentOS 5 boxes left, they were performing so badly as to be
> > effectively worthless.
> > 
> > What we observed in parallel to this upgrade process was that the read
> > latency on our NetApp device skyrocketed.  This in turn caused all
> > compute jobs to actually run slower, as it seemed to move the
> > bottleneck from the client servers' OS to the NetApp.  This is
> > somewhat counter-intuitive: CentOS 6 performs faster, but actually
> > results in net performance loss because it creates a bottleneck on our
> > centralized storage.
> > 
> > All indications are that CentOS 6 seems to be much more "aggressive"
> > in how it does NFS reads.  And likewise, CentOS 5 was very "polite",
> > to the point that it basically got starved out by the introduction of
> > the 6.5 boxes.
> > 
> > What I'm looking for is a "deep dive" list of changes to the NFS
> > implementation between CentOS 5 and CentOS 6.  Or maybe this is due to
> > a change in the TCP stack?  Or maybe the scheduler?  We've tried a lot
> > of sysctl tcp tunings, various nfs mount options, anything that's
> > obviously different between 5 and 6... But so far we've been unable to
> > find the "smoking gun" that causes the obvious behavior change between
> > the two OS versions.
> > 
> > Just hoping that maybe someone else out there has seen something like
> > this, or can point me to some detailed documentation that might clue
> > me in on what to look for next.
> > 
> > Thanks!
> > 
> 
> 
> You may want to try reducing sunrpc.tcp_max_slot_table_entries .
> In CentOS 5 the number of slots is fixed: sunrpc.tcp_slot_table_entries = 16
> In CentOS 6, this number is dynamic with a maximum of
> sunrpc.tcp_max_slot_table_entries which by default has a value of 65536.
> 
> We put that in /etc/sysconfig/modprobe.d/sunrpc.conf: options sunrpc
> tcp_max_slot_table_entries=128

Make that /etc/modprobe.d/sunrpc.conf, of course.

peter