On Wed, Apr 29, 2015 at 10:36 AM, Devin Reade gdr@gno.org wrote:
Have you looked at the client-side NFS cache? Perhaps the C6 cache is either disabled, has fewer resources, or is invalidating faster? (I don't think that would explain the C5 starvation, though, unless it's a secondary effect from retransmits, etc.)
Do you know where the NFS cache settings are specified? I've looked at the various nfs mount options. Anything cache-related appears to be the same between the two OSes, assuming I didn't miss anything. We did experiment with the "noac" mount option, though that had no effect in our tests.
FWIW, we've done a tcpdump on both OSes, performing the same tasks, and it appears that 5 actually has more "chatter". Just looking at packet counts, 5 has about 17% more packets than 6, for the same workload. I haven't dug too deep into the tcpdump files, since we need a pretty big workload to trigger the measurable performance discrepancy. So the resulting pcap files are on the order of 5 GB.
Regarding the cache, do you have multiple mount points on a client that resolve to the same server filesystem? If so, do they have different mount options? If so, that can result in multiple caches instead of a single disk cache. The client cache can also be bypassed if your application is doing direct I/O on the files. Perhaps there is a difference in the application between C5 and C6, including whether or not it was just recompiled? (If so, can you try a C5 version on the C6 machines?)
No multiple mount points to the same server.
No application differences. We're still compiling on 5, regardless of target platform.
If you determine that C6 is doing aggressive caching, does this match the needs of your application? That is, do you have the situation where the client NFS layer does an aggressive read-ahead that is never used by the application?
That was one of our early theories. On 6, you can adjust this via /sys/class/bdi/X:Y/read_ahead_kb (use stat on the mountpoint to determine X and Y). This file doesn't exist on 5. But we tried increasing and decreasing it from the default (960), and didn't see any changes.
Are C5 and C6 using the same NFS protocol version? How about TCP vs UDP? If UDP is in play, have a look at fragmentation stats under load.
Yup, both are using tcp, protocol version 3.
Are both using the same authentication method (ie: maybe just UID-based)?
Yup, sec=sys.
And, like always, is DNS sane for all your clients and servers? Everything (including clients) has proper PTR records, consistent with A records, et al? DNS is so fundamental to everything that if it is out of whack you can get far-reaching symptoms that don't seem to have anything to do with DNS.
I believe so. I wouldn't bet my life on it. But there were certainly no changes to our DNS before, during or since the OS upgrade.
You may want to look at NFSometer and see if it can help.
Haven't seen that, will definitely give it a try!
Thanks for your thoughts and suggestions!