Hi all,
This might not be CentOS related at all. Sorry about that.
I have lots of C6 & C7 machines in use and all of them have the default swappiness of 60. The problem now is that a lot of those machines do swap although there is no memory pressure. I'm now thinking about lowering swappiness to 1. But I'd still like to find out why this happens. The only common thing between all those machines is that there are nightly backups done with Bacula. I once came across issues with the fs-cache bringing Linux to start paging out. Any hints, explanations and suggestions would be much appreciated.
Cheers, Shorty
On 04.06.2015 22:18, Markus "Shorty" Uckelmann wrote:
Hi all,
This might not be CentOS related at all. Sorry about that.
I have lots of C6 & C7 machines in use and all of them have the default swappiness of 60. The problem now is that a lot of those machines do swap although there is no memory pressure. I'm now thinking about lowering swappiness to 1. But I'd still like to find out why this happens. The only common thing between all those machines is that there are nightly backups done with Bacula. I once came across issues with the fs-cache bringing Linux to start paging out. Any hints, explanations and suggestions would be much appreciated.
If I'd have to venture a guess then I'd say there are memory pages that are never touched by any processes and as a result the algorithm has decided that it's more effective to swap out these pages to disk and use the freed ram for the page-cache. Swap usage isn't inherently evil and what you really want to check for is the si/so columns in the output of the "vmstat" command. If the system is using swap space but these columns are mostly 0 then that means memory has been swapped out in the past but there is no actual swap activity happening right now and there should be no performance impact. If however these numbers are consistently larger than 0 then then that means the system is under acute memory pressure and has to constantly move pages between ram and disk and that will have a large negative performance impact on the system. This is the moment when swap usage becomes bad.
Regards, Dennis
Am 05.06.2015 um 00:23 schrieb Dennis Jacobfeuerborn:
If I'd have to venture a guess then I'd say there are memory pages that are never touched by any processes and as a result the algorithm has decided that it's more effective to swap out these pages to disk and use the freed ram for the page-cache.
That's my guess too.
[...]
impact. If however these numbers are consistently larger than 0 then then that means the system is under acute memory pressure and has to constantly move pages between ram and disk and that will have a large negative performance impact on the system. This is the moment when swap usage becomes bad.
Gladly I don't have constant paging on all systems. And if there is paging activity it's very low. AFAIK it's, as you already suggested, just that some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out. I already made some tests with swappiness 10 which mildly made it better. But there still was swap usage. So I tend to set swappiness to 1. Which I don't like to do, since those default values aren't there for nothing.
Is it possible that this happens because the servers are VMs on an ESX-server. How could that affect this? How can I further debug this problem and find out what's the culprit? I will go back to our metrics and see if I can find any patterns/correlations.
Cheers, Shorty
On Fri, Jun 05, 2015 at 12:29:04PM +0200, Markus "Shorty" Uckelmann wrote:
How can I further debug this problem and find out what's the culprit?
It's working as designed.
Linux does not treat various kinds of memory pages differently. If you want a daemon to be fully in core, call mlockall(). Here's one way to do that without changing the daemon's source:
http://superuser.com/questions/196725/how-to-mlock-all-pages-of-a-process-tr...
(I've always only done this with my own code explicitly calling mlock)
If you don't explicitly lock things into memory, file I/O can and will cause idle pages to get pushed out. It happens less often if you manipulate swappines.
-- greg
Am 05.06.2015 um 17:40 schrieb Greg Lindahl:
On Fri, Jun 05, 2015 at 12:29:04PM +0200, Markus "Shorty" Uckelmann wrote:
How can I further debug this problem and find out what's the culprit?
It's working as designed.
Sadly. It is just my first time I see this behaviour to this extent/on so many servers. So you can say that I'm kind of a newbie in swapping ;)
If you don't explicitly lock things into memory, file I/O can and will cause idle pages to get pushed out. It happens less often if you manipulate swappines.
So, is a swappiness value of 60 not recommended for servers? I worked with hundreds of servers (swappiness 60) on a social platform and swapping very rarely happened and then only on databases (which had swappiness set to 0). The only two differences (that I can see) to my current servers are that I used Debian and there was no "extra" I/O from backups.
I might be overstating the swapping thing. That's what I'm trying to find out.
Cheers, Shorty
On Fri, Jun 05, 2015 at 09:21:43PM +0200, Markus "Shorty" Uckelmann wrote:
If you don't explicitly lock things into memory, file I/O can and will cause idle pages to get pushed out. It happens less often if you manipulate swappines.
So, is a swappiness value of 60 not recommended for servers?
It's probably a fine default. For my most recent purposes, a web-scale search engine, I locked a ton of daemons into memory with mlockall (on latency-optimized clusters) and set swappiness to 0 on all clusters (including batch-optimized clusters.)
This last bit was because I don't expect my systems to ever swap... I only have a small amount of swap configured to reduce the mayhem caused by OOMs, and give my home-grown oom daemon (which is locked into memory, of course) time to open fire on my choice of offending process.
-- greg
On Fri, Jun 05, 2015 at 08:40:27AM -0700, Greg Lindahl wrote:
Linux does not treat various kinds of memory pages differently. If you want a daemon to be fully in core, call mlockall(). Here's one way to do that without changing the daemon's source:
Another way to do this is to put the services into a named CGroup, and set memory.swappiness=1 for that cgroup.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/htm...
Not necessarily as effective as mlock() but you might want to set some of the other cgroup features as well.
On 06/05/2015 03:29 AM, Markus "Shorty" Uckelmann wrote:
some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out.
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
On Fri, Jun 05, 2015 at 09:33:11AM -0700, Gordon Messmer wrote:
On 06/05/2015 03:29 AM, Markus "Shorty" Uckelmann wrote:
some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out.
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
No.
Let's say the application only uses the page once per hour. If there is also I/O going on, then it's easy to see that the kernel could decide to page the page out after 50 minutes, leaving the application having to page it back in 10 minutes later.
-- greg
On 05.06.2015 19:47, Greg Lindahl wrote:
On Fri, Jun 05, 2015 at 09:33:11AM -0700, Gordon Messmer wrote:
On 06/05/2015 03:29 AM, Markus "Shorty" Uckelmann wrote:
some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out.
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
No.
Let's say the application only uses the page once per hour. If there is also I/O going on, then it's easy to see that the kernel could decide to page the page out after 50 minutes, leaving the application having to page it back in 10 minutes later.
That's true but it also means that if you lock that page so it cannot be swapped out then this page is not available for the page cache so you incur the i/o hit either way and it's probably going to be worse because the system has no longer an option to optimize the memory management. I wouldn't worry about it until there's actually permanent swap activity going on and then you have to decide if you want to add more ram to the system or maybe find a way to tell e.g. Bacula to use direct i/o and not pollute the page cache. For application that do not allow to specify this a wrapper could be used such as this one: http://arighi.blogspot.de/2007/04/how-to-bypass-buffer-cache-in-linux.html
Regards, Dennis
On 06.06.2015 04:48, Dennis Jacobfeuerborn wrote:
On 05.06.2015 19:47, Greg Lindahl wrote:
On Fri, Jun 05, 2015 at 09:33:11AM -0700, Gordon Messmer wrote:
On 06/05/2015 03:29 AM, Markus "Shorty" Uckelmann wrote:
some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out.
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
No.
Let's say the application only uses the page once per hour. If there is also I/O going on, then it's easy to see that the kernel could decide to page the page out after 50 minutes, leaving the application having to page it back in 10 minutes later.
That's true but it also means that if you lock that page so it cannot be swapped out then this page is not available for the page cache so you incur the i/o hit either way and it's probably going to be worse because the system has no longer an option to optimize the memory management. I wouldn't worry about it until there's actually permanent swap activity going on and then you have to decide if you want to add more ram to the system or maybe find a way to tell e.g. Bacula to use direct i/o and not pollute the page cache. For application that do not allow to specify this a wrapper could be used such as this one: http://arighi.blogspot.de/2007/04/how-to-bypass-buffer-cache-in-linux.html
Actually I found better links: https://code.google.com/p/pagecache-mangagement/ http://lwn.net/Articles/224653/
"It is to address the "waah, backups fill my memory with pagecache" and the "waah, updatedb swapped everything out" and the "waah, copying a DVD gobbled all my memory" problems."
Regards, Dennis
Am 06.06.2015 um 05:06 schrieb Dennis Jacobfeuerborn:
That's true but it also means that if you lock that page so it cannot be swapped out then this page is not available for the page cache so you incur the i/o hit either way and it's probably going to be worse because the system has no longer an option to optimize the memory management. I wouldn't worry about it until there's actually permanent swap activity going on and then you have to decide if you want to add more ram to the system or maybe find a way to tell e.g. Bacula to use direct i/o and not pollute the page cache. For application that do not allow to specify this a wrapper could be used such as this one: http://arighi.blogspot.de/2007/04/how-to-bypass-buffer-cache-in-linux.html
Actually I found better links: https://code.google.com/p/pagecache-mangagement/ http://lwn.net/Articles/224653/
"It is to address the "waah, backups fill my memory with pagecache" and the "waah, updatedb swapped everything out" and the "waah, copying a DVD gobbled all my memory" problems."
Dennis, thanks for the links. I hope to get around using these tools. But it's good to have them in my "arsenal" ;)
Cheers, Shorty
Am 05.06.2015 um 18:33 schrieb Gordon Messmer:
On 06/05/2015 03:29 AM, Markus "Shorty" Uckelmann wrote:
some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out.
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
Why not? If you have an application which sees action only every 12 to 24 hours,I think this can happen. Our salt-minion would be a candidate for this. Allthough we constantly check if it's alive, we only do once or twice a day something "heavy" like a deployment. And very often we have to run thos deployments twice, because the first time we get a lot of timeouts. Sure, it might be the software itself. But I think it could be possible that it is because of swapped out pages.
I can't be sure about this. That's why I want to find out what and why things are happening. But first I need to find the right tools to do this ;)
Cheers, Shorty
On 06/05/2015 12:09 PM, Markus "Shorty" Uckelmann wrote:
Am 05.06.2015 um 18:33 schrieb Gordon Messmer:
On 06/05/2015 03:29 AM, Markus "Shorty" Uckelmann wrote:
some (probably unused) parts are swapped out. But, some of those parts are the salt-minion, php-fpm or mysqld. All services which are important for us and which suffer badly from being swapped out.
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
Why not? If you have an application which sees action only every 12 to 24 hours,I think this can happen.
Well, that's not "unused," then.
To measure the swap use of your processes, install "smem". It will show you the amount of swap that each process is using.
For more specific information, make a copy of /proc/<pid>/smaps.
To quantify your problem, let bacula run then save the output of smem, or /proc/<pid>/smaps for each of your critical services, or both, and then access each of the services and quantify the latency relative to the normal latency. Finally, after collecting latency information, get the output of smem and/or /proc/<pid>/smaps again. You can compare swap use before and after accessing the service to see how much was swapped out beforehand (presumably because of the backup), and how much had to be recovered for your test query.
I'd suggest collecting that information at the normal swappiness setting and at 0.
If the kernel is swapping out processes in favor of filesystem cache when swappiness is 0, I believe that would be a bug, and should be reported to the kernel developers.
Our salt-minion would be a candidate for this. Allthough we constantly check if it's alive, we only do once or twice a day something "heavy" like a deployment. And very often we have to run thos deployments twice, because the first time we get a lot of timeouts. Sure, it might be the software itself. But I think it could be possible that it is because of swapped out pages.
"Timeouts" is pretty vague. Very generally, it's possible that you have a timeout configured somewhere that is failing on the first run because the filesystem cache now contains content from your backup, and your process only completes in time when the files needed for the deployment are in the filesystem cache. That's a stretch as far as explanations go, but if that is the case, then swappiness isn't going to fix the problem. You need to fix your timeout so that it allows enough time for the deployment to finish when the server is cold booted (using no cache), or prime your caches before doing deployments.
Am 05.06.2015 um 23:32 schrieb Gordon Messmer:
Those two things can't really both be true. If the pages swapped out are unused, then the application won't suffer as a result.
Why not? If you have an application which sees action only every 12 to 24 hours,I think this can happen.
Well, that's not "unused," then.
In a matter of speaking it's not "unused". But in the case of "rarely used" it is possible that parts of the programm are in swap which are needed.
To measure the swap use of your processes, install "smem". It will show you the amount of swap that each process is using.
Brilliant! Until now I was using the script under [1].
For more specific information, make a copy of /proc/<pid>/smaps.
To quantify your problem, let bacula run then save the output of smem, or /proc/<pid>/smaps for each of your critical services, or both, and then access each of the services and quantify the latency relative to the normal latency. Finally, after collecting latency information, get the output of smem and/or /proc/<pid>/smaps again. You can compare swap use before and after accessing the service to see how much was swapped out beforehand (presumably because of the backup), and how much had to be recovered for your test query.
I'd suggest collecting that information at the normal swappiness setting and at 0.
Thank you, this will get me further.
If the kernel is swapping out processes in favor of filesystem cache when swappiness is 0, I believe that would be a bug, and should be reported to the kernel developers.
Because of what I read in [2] I'm not planning to use 0, rather 1. But please correct me if I'm wrong.
"Timeouts" is pretty vague. Very generally, it's possible that you have a timeout configured somewhere that is failing on the first run because the filesystem cache now contains content from your backup, and your process only completes in time when the files needed for the deployment are in the filesystem cache. That's a stretch as far as explanations go, but if that is the case, then swappiness isn't going to fix the problem. You need to fix your timeout so that it allows enough time for the deployment to finish when the server is cold booted (using no cache), or prime your caches before doing deployments.
With timeouts I meant that the salt master tries to contact the salt-minion to send it the payload. At this point happens the timeout. In this case it means that the minion doesn't get back to the master in the configured timeout. Currently it's set to 20 seconds. When we start a job the first time after several hours we get a lot of timeouts. A second run mostly helps. I think it is possible that parts of the minion process which are needed for the payload we send it are swapped out. At the first run it takes too long to get the pages back into RAM. But eventually all pages are paged in. So the second run works. But this is just an assumption. On Monday I'll try to find out if I'm right or wrong.
BTW: Is there a way to find out which parts of a programm are swapped out without using monsters like Valgrind? Damn, sounds like an interesting start of the week...
[1] http://northernmost.org/blog/find-out-what-is-using-your-swap/ [2] http://www.mysqlperformanceblog.com/2014/04/28/oom-relation-vm-swappiness0-n...
Cheers to all for the feedback and help, Shorty
On 06/06/2015 02:23 AM, Markus "Shorty" Uckelmann wrote:
When we start a job the first time after several hours we get a lot of timeouts. A second run mostly helps.
In addition to capturing swap use before and after a run that times out, I'd cold boot all of the systems involved and see if that job times out as well. If that times out, it's likely that you need to prime your caches before a job, or break the job into smaller bits, or extend your timeout.
BTW: Is there a way to find out which parts of a programm are swapped out without using monsters like Valgrind? Damn, sounds like an interesting start of the week...
The smaps file has that information, in a general sense. If you want to know what variables hold references to the areas that are swapped out, you'll need a debugger.
On Thu, Jun 4, 2015 at 4:18 PM, Markus "Shorty" Uckelmann shorty@koeln.de wrote:
I have lots of C6 & C7 machines in use and all of them have the default swappiness of 60. The problem now is that a lot of those machines do swap although there is no memory pressure. I'm now thinking about lowering swappiness to 1. But I'd still like to find out why this happens.
Thanks for this thread. I'm actually looking at the same settings for a different reason. Most of our environment is VMWare-based and one major difference between the Linux and Windows clients is how they use "free" memory. Linux grabs it for cache ("Free memory is wasted memory.") but Windows doesn't appear to touch it at all. This means the VMWare hypervisor can over-commit memory.
Am 04.06.2015 um 22:18 schrieb Markus "Shorty" Uckelmann:
Hi all,
Thanks for all your help!
I just found a few additional things one can or should do when investigating swap-related issues:
* dmesg - always do that! * Look for a RAM-disk. These things are kernel memory. So they don't show up in "smem" et.al. * Take a look at some of swaps contents: strings -f /dev/sda3 | shuf -n 100000 | less
Cheers, Shorty