I have a CentOS 7 server that is running out of memory and I can't figure out why.
Running "free -h" gives me this: total used free shared buff/cache available Mem: 3.4G 2.4G 123M 5.9M 928M 626M Swap: 1.9G 294M 1.6G
The problem is that I can't find 2.4G of usage. If I look at resident memory usage using "top", the top 5 processes are using a total of 390M. The next highest process is using 8M. For simplicity, if I assume the other 168 processes are all using 8M (which is WAY too high), that still only gives a total of 1.7G. The tmpfs filesystems are only using 18M, so that shouldn't be an issue.
Yesterday, the available memory was down around 300M when I checked it. After checking some things and stopping all of the major processes, available memory was still low. I gave up and rebooted the machine, which brought available memory back up to 2.8G with everything running.
How can I track what is using the memory when the usage doesn't show up in top?
On Fri, Jul 27, 2018, 10:10 AM Bowie Bailey Bowie_Bailey@buc.com wrote:
I have a CentOS 7 server that is running out of memory and I can't figure out why.
<snip> The problem is that I can't find 2.4G of usage. If I look at resident memory usage using "top", the top 5 processes are using a total of 390M.
On a lark, what kind of file systems is the system using and how long g had it been up before you rebooted?
On 7/27/2018 11:14 AM, Jon Pruente wrote:
On Fri, Jul 27, 2018, 10:10 AM Bowie Bailey Bowie_Bailey@buc.com wrote:
I have a CentOS 7 server that is running out of memory and I can't figure out why.
<snip> The problem is that I can't find 2.4G of usage. If I look at resident memory usage using "top", the top 5 processes are using a total of 390M.
On a lark, what kind of file systems is the system using and how long g had it been up before you rebooted?
The filesystems are all XFS. I don't know for sure how long it had been up previously, I'd guess at least 2 weeks. Current uptime is about 25 hours and the system has already started getting into swap.
On Fri, Jul 27, 2018 at 10:41 AM, Bowie Bailey Bowie_Bailey@buc.com wrote:
On a lark, what kind of file systems is the system using and how long g
had
it been up before you rebooted?
The filesystems are all XFS. I don't know for sure how long it had been up previously, I'd guess at least 2 weeks. Current uptime is about 25 hours and the system has already started getting into swap.
I've had multiple systems (and VMs) with XFS filesystems that had troubles on the 693 series of kernels. Eventually the kernel xfs driver deadlocks and blocks writes, which then pile up in memory waiting for the block to clear. Eventually you run out of RAM and OOM killer kicks in. The only solutions I had a the time was to revert to booting a 514 series kernel or converting to EXT4, depending on the needs of the particular server. Everything I've converted to EXT4 has been rock stable since, and the very few I had to run a 514 kernel on have been stable, just not ideal. It may be fixed on the newer 8xx series but I haven't dived into them on those systems yet.
If it happens again the look for processes in the D state and see if logging is continuing or if it just cuts off (when the block started).
On Fri, Jul 27, 2018 at 12:07 PM, Jon Pruente jpruente@riskanalytics.com wrote:
to revert to booting a 514 series kernel or converting to EXT4, depending on the needs of the particular server. Everything I've converted to EXT4 has been rock
Scratch that, I just looked and it was reverting to a 327 kernel that helped.
On Jul 27, 2018, at 9:10 AM, Bowie Bailey Bowie_Bailey@BUC.com wrote:
I have a CentOS 7 server that is running out of memory
How do you know that? Give a specific symptom.
Running "free -h" gives me this: total used free shared buff/cache available Mem: 3.4G 2.4G 123M 5.9M
This is such a common misunderstanding that it has its own web site:
On 07/27/2018 08:50 AM, Warren Young wrote:
This is such a common misunderstanding that it has its own web site: https://www.linuxatemyram.com/
The misunderstanding was mostly related to an older version of "free" that included buffers/cache in the "used" column. "used" in this case does not include buffers/cache, so it should be possible to account for the used memory by examining application and kernel memory use.
On 7/27/2018 11:50 AM, Warren Young wrote:
On Jul 27, 2018, at 9:10 AM, Bowie Bailey Bowie_Bailey@BUC.com wrote:
I have a CentOS 7 server that is running out of memory
How do you know that? Give a specific symptom.
This was brought to my attention because one program was killed by the kernel to free memory and another program failed because it was unable to allocate enough memory.
Running "free -h" gives me this: total used free shared buff/cache available Mem: 3.4G 2.4G 123M 5.9M
This is such a common misunderstanding that it has its own web site:
https://www.linuxatemyram.com/
Right, and that website says that you should look at the "available" number in the results from "free", which I what I was referencing. They say that a healthy system should have at least 20% of the memory available. Mine was down to 17% in what I posted in my email and it was at about 8% when I rebooted yesterday.
Bowie Bailey wrote:
On 7/27/2018 11:50 AM, Warren Young wrote:
On Jul 27, 2018, at 9:10 AM, Bowie Bailey Bowie_Bailey@BUC.com wrote:
I have a CentOS 7 server that is running out of memory
How do you know that? Give a specific symptom.
This was brought to my attention because one program was killed by the kernel to free memory and another program failed because it was unable to allocate enough memory.
<snip> Um, wait a minute - are you saying the oom-killer was invoked? My reaction to that is to define the system, at that point, to be in an undefined state, because you don't know what some threads that were killed are.
mark
On 7/27/2018 12:58 PM, mark wrote:
Bowie Bailey wrote:
On 7/27/2018 11:50 AM, Warren Young wrote:
On Jul 27, 2018, at 9:10 AM, Bowie Bailey Bowie_Bailey@BUC.com wrote:
I have a CentOS 7 server that is running out of memory
How do you know that? Give a specific symptom.
This was brought to my attention because one program was killed by the kernel to free memory and another program failed because it was unable to allocate enough memory.
<snip> Um, wait a minute - are you saying the oom-killer was invoked? My reaction to that is to define the system, at that point, to be in an undefined state, because you don't know what some threads that were killed are.
Probably true, but the system has been rebooted since then and the oom-killer has not been activated since then. When I first noticed the problem, I also found that my swap partition had been deactivated, which is why the oom-killer got involved in the first place instead of just having swap usage slow the system to a crawl.
I think I have identified the program that is causing the problem (memory usage went back to normal when the process ended), but I'm still not sure how it ended up using 10x the memory that top reported for it.
On 07/27/2018 08:10 AM, Bowie Bailey wrote:
The problem is that I can't find 2.4G of usage.
Are your results from "top" similar to:
ps axu | sort -nr -k +6
If you don't see 2.4G of use from applications, maybe the kernel is using a lot of memory. Check /proc/slabinfo. You can simplify its content to bytes per object type and a total:
grep -v ^# /proc/slabinfo | awk 'BEGIN {t=0;} {print $1 " " ($3 * $4); t=t+($3 * $4)} END {print "total " t/(1024 * 1024) " MB";}' | column -t
On 7/27/2018 12:13 PM, Gordon Messmer wrote:
On 07/27/2018 08:10 AM, Bowie Bailey wrote:
The problem is that I can't find 2.4G of usage.
Are your results from "top" similar to:
ps axu | sort -nr -k +6
That looks the same.
If you don't see 2.4G of use from applications, maybe the kernel is using a lot of memory. Check /proc/slabinfo. You can simplify its content to bytes per object type and a total:
grep -v ^# /proc/slabinfo | awk 'BEGIN {t=0;} {print $1 " " ($3 * $4); t=t+($3 * $4)} END {print "total " t/(1024 * 1024) " MB";}' | column -t
The total number from that report is about 706M.
My available memory has now jumped up from 640M to 1.5G after one of the processes (which was reportedly using about 100M) finished.
I'll have to wait until the problem re-occurs and see what it looks like then, but for now I used the numbers from "ps axu" to add up a real total and then added the 706M to it and got within 300M of the memory currently reported used by free.
What could account for a process actually using much more memory than is reported by ps or top?
On 07/27/2018 09:38 AM, Bowie Bailey wrote:
The total number from that report is about 706M.
Did it move at all after killing that "one process"?
My available memory has now jumped up from 640M to 1.5G after one of the processes (which was reportedly using about 100M) finished.
I'll have to wait until the problem re-occurs and see what it looks like then, but for now I used the numbers from "ps axu" to add up a real total and then added the 706M to it and got within 300M of the memory currently reported used by free.
What could account for a process actually using much more memory than is reported by ps or top?
Are you counting both resident and shared memory? If the process that you terminated had around 900MB of shared memory, and you aren't looking at that value, then that'd explain your memory use.
We're kinda guessing without seeing any of your command output.