Message: 4 Date: Wed, 29 Apr 2015 08:35:29 -0500 From: Matt Garman matthew.garman@gmail.com To: CentOS mailing list centos@centos.org Subject: [CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6? Message-ID: CAJvUf-CyTg8ZiGq3OXRLKw7s1K2dGx1gqo_2XwOAXXQty=RHZQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
We have a "compute cluster" of about 100 machines that do a read-only NFS mount to a big NAS filer (a NetApp FAS6280). The jobs running on these boxes are analysis/simulation jobs that constantly read data off the NAS.
We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5. We did a "piecemeal" upgrade, usually upgrading five or so machines at a time, every few days. We noticed improved performance on the CentOS 6 boxes. But as the number of CentOS 6 boxes increased, we actually saw performance on the CentOS 5 boxes decrease. By the time we had only a few CentOS 5 boxes left, they were performing so badly as to be effectively worthless.
What we observed in parallel to this upgrade process was that the read latency on our NetApp device skyrocketed. This in turn caused all compute jobs to actually run slower, as it seemed to move the bottleneck from the client servers' OS to the NetApp. This is somewhat counter-intuitive: CentOS 6 performs faster, but actually results in net performance loss because it creates a bottleneck on our centralized storage.
All indications are that CentOS 6 seems to be much more "aggressive" in how it does NFS reads. And likewise, CentOS 5 was very "polite", to the point that it basically got starved out by the introduction of the 6.5 boxes.
What I'm looking for is a "deep dive" list of changes to the NFS implementation between CentOS 5 and CentOS 6. Or maybe this is due to a change in the TCP stack? Or maybe the scheduler? We've tried a lot of sysctl tcp tunings, various nfs mount options, anything that's obviously different between 5 and 6... But so far we've been unable to find the "smoking gun" that causes the obvious behavior change between the two OS versions.
Just hoping that maybe someone else out there has seen something like this, or can point me to some detailed documentation that might clue me in on what to look for next.
Thanks!
You may want to try reducing sunrpc.tcp_max_slot_table_entries . In CentOS 5 the number of slots is fixed: sunrpc.tcp_slot_table_entries = 16 In CentOS 6, this number is dynamic with a maximum of sunrpc.tcp_max_slot_table_entries which by default has a value of 65536.
We put that in /etc/sysconfig/modprobe.d/sunrpc.conf: options sunrpc tcp_max_slot_table_entries=128
You can't put this in /etc/sysctl.conf because the sunrpc kernel module is loaded before sysctl -p is done.
peter
On Thu, Apr 30, 2015 at 02:24:27PM +0200, Peter van Hooft wrote:
Message: 4 Date: Wed, 29 Apr 2015 08:35:29 -0500 From: Matt Garman matthew.garman@gmail.com To: CentOS mailing list centos@centos.org Subject: [CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6? Message-ID: CAJvUf-CyTg8ZiGq3OXRLKw7s1K2dGx1gqo_2XwOAXXQty=RHZQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
We have a "compute cluster" of about 100 machines that do a read-only NFS mount to a big NAS filer (a NetApp FAS6280). The jobs running on these boxes are analysis/simulation jobs that constantly read data off the NAS.
We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5. We did a "piecemeal" upgrade, usually upgrading five or so machines at a time, every few days. We noticed improved performance on the CentOS 6 boxes. But as the number of CentOS 6 boxes increased, we actually saw performance on the CentOS 5 boxes decrease. By the time we had only a few CentOS 5 boxes left, they were performing so badly as to be effectively worthless.
What we observed in parallel to this upgrade process was that the read latency on our NetApp device skyrocketed. This in turn caused all compute jobs to actually run slower, as it seemed to move the bottleneck from the client servers' OS to the NetApp. This is somewhat counter-intuitive: CentOS 6 performs faster, but actually results in net performance loss because it creates a bottleneck on our centralized storage.
All indications are that CentOS 6 seems to be much more "aggressive" in how it does NFS reads. And likewise, CentOS 5 was very "polite", to the point that it basically got starved out by the introduction of the 6.5 boxes.
What I'm looking for is a "deep dive" list of changes to the NFS implementation between CentOS 5 and CentOS 6. Or maybe this is due to a change in the TCP stack? Or maybe the scheduler? We've tried a lot of sysctl tcp tunings, various nfs mount options, anything that's obviously different between 5 and 6... But so far we've been unable to find the "smoking gun" that causes the obvious behavior change between the two OS versions.
Just hoping that maybe someone else out there has seen something like this, or can point me to some detailed documentation that might clue me in on what to look for next.
Thanks!
You may want to try reducing sunrpc.tcp_max_slot_table_entries . In CentOS 5 the number of slots is fixed: sunrpc.tcp_slot_table_entries = 16 In CentOS 6, this number is dynamic with a maximum of sunrpc.tcp_max_slot_table_entries which by default has a value of 65536.
We put that in /etc/sysconfig/modprobe.d/sunrpc.conf: options sunrpc tcp_max_slot_table_entries=128
Make that /etc/modprobe.d/sunrpc.conf, of course.
peter
On Thu, Apr 30, 2015 at 7:31 AM, Peter van Hooft hooft@natlab.research.philips.com wrote:
You may want to try reducing sunrpc.tcp_max_slot_table_entries . In CentOS 5 the number of slots is fixed: sunrpc.tcp_slot_table_entries = 16 In CentOS 6, this number is dynamic with a maximum of sunrpc.tcp_max_slot_table_entries which by default has a value of 65536.
We put that in /etc/sysconfig/modprobe.d/sunrpc.conf: options sunrpc tcp_max_slot_table_entries=128
Make that /etc/modprobe.d/sunrpc.conf, of course.
This appears to be the "smoking gun" we were looking for, or at least a significant piece of the puzzle.
We actually tried this early on in our investigation, but were changing it via sysctl, which apparently has no effect. Your email convinced me to try again, but this time configuring the parameters via modprobe.
In our case, 128 was still too high. So we dropped it all the way down to 16. Our understanding is that 16 is the CentOS 5 value. What we're seeing is now our apps are starved for data, so looks like we might have to nudge it up. In other words, there's either something else at play which we're not aware of, or the meaning of that parameter is different between CentOS 5 and CentOS 6.
Anyway, thank you very much for the suggestion. You turned on the light at the end of the tunnel!