[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

Wed Apr 29 16:32:26 UTC 2015

On Wed, Apr 29, 2015 at 10:36 AM, Devin Reade <gdr at gno.org> wrote:
> Have you looked at the client-side NFS cache?  Perhaps the C6 cache
> is either disabled, has fewer resources, or is invalidating faster?
> (I don't think that would explain the C5 starvation, though, unless
> it's a secondary effect from retransmits, etc.)

Do you know where the NFS cache settings are specified?  I've looked
at the various nfs mount options.  Anything cache-related appears to
be the same between the two OSes, assuming I didn't miss anything.  We
did experiment with the "noac" mount option, though that had no effect
in our tests.

FWIW, we've done a tcpdump on both OSes, performing the same tasks,
and it appears that 5 actually has more "chatter".  Just looking at
packet counts, 5 has about 17% more packets than 6, for the same
workload.  I haven't dug too deep into the tcpdump files, since we
need a pretty big workload to trigger the measurable performance
discrepancy.  So the resulting pcap files are on the order of 5 GB.

> Regarding the cache, do you have multiple mount points on a client
> that resolve to the same server filesystem?  If so, do they have
> different mount options?  If so, that can result in multiple caches
> instead of a single disk cache.  The client cache can also be bypassed
> if your application is doing direct I/O on the files.  Perhaps there
> is a difference in the application between C5 and C6, including
> whether or not it was just recompiled?  (If so, can you try a C5 version
> on the C6 machines?)

No multiple mount points to the same server.

No application differences.  We're still compiling on 5, regardless of
target platform.

> If you determine that C6 is doing aggressive caching, does this match
> the needs of your application?  That is, do you have the situation
> where the client NFS layer does an aggressive read-ahead that is never
> used by the application?

That was one of our early theories.  On 6, you can adjust this via
/sys/class/bdi/X:Y/read_ahead_kb (use stat on the mountpoint to
determine X and Y).  This file doesn't exist on 5.  But we tried
increasing and decreasing it from the default (960), and didn't see
any changes.

> Are C5 and C6 using the same NFS protocol version?  How about TCP vs
> UDP?  If UDP is in play, have a look at fragmentation stats under load.

Yup, both are using tcp, protocol version 3.

> Are both using the same authentication method (ie: maybe just
> UID-based)?

Yup, sec=sys.

> And, like always, is DNS sane for all your clients and servers?  Everything
> (including clients) has proper PTR records, consistent with A records,
> et al?  DNS is so fundamental to everything that if it is out of whack
> you can get far-reaching symptoms that don't seem to have anything to do
> with DNS.

I believe so.  I wouldn't bet my life on it.  But there were certainly
no changes to our DNS before, during or since the OS upgrade.

> You may want to look at NFSometer and see if it can help.

Haven't seen that, will definitely give it a try!

Thanks for your thoughts and suggestions!