On Wed, Apr 29, 2015 at 10:36 AM, Devin Reade <gdr at gno.org> wrote: > Have you looked at the client-side NFS cache? Perhaps the C6 cache > is either disabled, has fewer resources, or is invalidating faster? > (I don't think that would explain the C5 starvation, though, unless > it's a secondary effect from retransmits, etc.) Do you know where the NFS cache settings are specified? I've looked at the various nfs mount options. Anything cache-related appears to be the same between the two OSes, assuming I didn't miss anything. We did experiment with the "noac" mount option, though that had no effect in our tests. FWIW, we've done a tcpdump on both OSes, performing the same tasks, and it appears that 5 actually has more "chatter". Just looking at packet counts, 5 has about 17% more packets than 6, for the same workload. I haven't dug too deep into the tcpdump files, since we need a pretty big workload to trigger the measurable performance discrepancy. So the resulting pcap files are on the order of 5 GB. > Regarding the cache, do you have multiple mount points on a client > that resolve to the same server filesystem? If so, do they have > different mount options? If so, that can result in multiple caches > instead of a single disk cache. The client cache can also be bypassed > if your application is doing direct I/O on the files. Perhaps there > is a difference in the application between C5 and C6, including > whether or not it was just recompiled? (If so, can you try a C5 version > on the C6 machines?) No multiple mount points to the same server. No application differences. We're still compiling on 5, regardless of target platform. > If you determine that C6 is doing aggressive caching, does this match > the needs of your application? That is, do you have the situation > where the client NFS layer does an aggressive read-ahead that is never > used by the application? That was one of our early theories. On 6, you can adjust this via /sys/class/bdi/X:Y/read_ahead_kb (use stat on the mountpoint to determine X and Y). This file doesn't exist on 5. But we tried increasing and decreasing it from the default (960), and didn't see any changes. > Are C5 and C6 using the same NFS protocol version? How about TCP vs > UDP? If UDP is in play, have a look at fragmentation stats under load. Yup, both are using tcp, protocol version 3. > Are both using the same authentication method (ie: maybe just > UID-based)? Yup, sec=sys. > And, like always, is DNS sane for all your clients and servers? Everything > (including clients) has proper PTR records, consistent with A records, > et al? DNS is so fundamental to everything that if it is out of whack > you can get far-reaching symptoms that don't seem to have anything to do > with DNS. I believe so. I wouldn't bet my life on it. But there were certainly no changes to our DNS before, during or since the OS upgrade. > You may want to look at NFSometer and see if it can help. Haven't seen that, will definitely give it a try! Thanks for your thoughts and suggestions!