just had a server hang on me...seems pretty clearly that some process stole all the RAM (clamd?)
Jul 30 16:26:04 srv1 kernel: auditd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=-17 Jul 30 16:26:08 srv1 kernel: [<c0457d36>] out_of_memory+0x72/0x1a4 Jul 30 16:26:08 srv1 kernel: [<c0459161>] __alloc_pages+0x201/0x282 Jul 30 16:26:08 srv1 kernel: [<c045a3bf>] __do_page_cache_readahead +0xc4/0x1c6 Jul 30 16:26:08 srv1 kernel: [<c0438ecf>] wake_futex+0x3a/0x44 Jul 30 16:26:08 srv1 kernel: [<c043a2a9>] do_futex+0x738/0xb15 Jul 30 16:26:08 srv1 kernel: [<f88d0b96>] dm_any_congested+0x2f/0x35 [dm_mod] Jul 30 16:26:08 srv1 kernel: [<c04572d8>] filemap_nopage+0x151/0x315 Jul 30 16:26:08 srv1 kernel: [<c045fda3>] __handle_mm_fault+0x172/0x87b Jul 30 16:26:08 srv1 kernel: [<c0606c2b>] do_page_fault+0x20a/0x4b8 Jul 30 16:26:08 srv1 kernel: [<c0606a21>] do_page_fault+0x0/0x4b8 Jul 30 16:26:08 srv1 kernel: [<c0405a71>] error_code+0x39/0x40 Jul 30 16:26:08 srv1 kernel: =======================
snip...
Jul 30 16:26:21 srv1 kernel: Free pages: 12044kB (124kB HighMem) Jul 30 16:26:21 srv1 kernel: Active:119791 inactive:120544 dirty:0 writeback:12 unstable:0 free:3011 slab:7550 mapped-file:1746 mapped-anon:215165 pagetabl es:3364 Jul 30 16:26:32 srv1 kernel: DMA free:4096kB min:68kB low:84kB high:100kB active:3948kB inactive:3608kB present:16384kB pages_scanned:17135 all_unreclaimab le? yes Jul 30 16:26:32 srv1 kernel: lowmem_reserve[]: 0 0 880 1007 Jul 30 16:26:33 srv1 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Jul 30 16:26:33 srv1 kernel: lowmem_reserve[]: 0 0 880 1007 Jul 30 16:26:33 srv1 kernel: Normal free:7824kB min:3756kB low:4692kB high:5632kB active:413832kB inactive:416972kB present:901120kB pages_scanned:2000672 all_unreclaimable? yes Jul 30 16:26:33 srv1 kernel: lowmem_reserve[]: 0 0 0 1023 Jul 30 16:26:33 srv1 kernel: HighMem free:124kB min:128kB low:264kB high:400kB active:61512kB inactive:61412kB present:130944kB pages_scanned:271904 all_un reclaimable? yes Jul 30 16:26:33 srv1 kernel: lowmem_reserve[]: 0 0 0 0 Jul 30 16:26:33 srv1 kernel: DMA: 0*4kB 0*8kB 24*16kB 4*32kB 0*64kB 0*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4096kB Jul 30 16:26:34 srv1 kernel: DMA32: empty Jul 30 16:26:34 srv1 kernel: Normal: 44*4kB 4*8kB 234*16kB 7*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7824kB Jul 30 16:26:34 srv1 kernel: HighMem: 3*4kB 0*8kB 3*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 124kB Jul 30 16:26:34 srv1 kernel: Swap cache: add 61987202, delete 61986895, find 48568667/56549915, race 0+19406 Jul 30 16:26:34 srv1 kernel: Free swap = 0kB Jul 30 16:26:34 srv1 kernel: Total swap = 2031608kB Jul 30 16:26:34 srv1 kernel: Free swap: 0kB Jul 30 16:26:34 srv1 kernel: 262112 pages of RAM Jul 30 16:26:34 srv1 kernel: 32736 pages of HIGHMEM Jul 30 16:26:34 srv1 kernel: 3327 reserved pages Jul 30 16:26:34 srv1 kernel: 8186 pages shared Jul 30 16:26:34 srv1 kernel: 313 pages swap cached Jul 30 16:26:34 srv1 kernel: 0 pages dirty Jul 30 16:26:34 srv1 kernel: 12 pages writeback Jul 30 16:26:34 srv1 kernel: 1746 pages mapped Jul 30 16:26:34 srv1 kernel: 7550 pages slab Jul 30 16:26:34 srv1 kernel: 3364 pages pagetables
how does one determine who the culprit was?
Craig
On Wed, Jul 30, 2008 at 20:31, Craig White craigwhite@azapple.com wrote:
how does one determine who the culprit was?
Very hard... the kernel tries to "guess" which process is causing the issue, but from what I've seen (and I see OOMs every week) it guesses wrong most of the time. In my case, the victim ends up being "nscd" most of the time, even when I'm sure it's not using a lot of memory nor leaking.
In my case, usually when I start having OOMs I have them on several machines running the same programs (it's a grid) so it's more or less easy to find the culprit by looking at the jobs that were running on all affected machines.
In any case, my policy is to always reboot a machine after an OOM, since it may be in an incoherent state.
HTH, Filipe
On Wed, 2008-07-30 at 22:19 -0400, Filipe Brandenburger wrote:
On Wed, Jul 30, 2008 at 20:31, Craig White craigwhite@azapple.com wrote:
how does one determine who the culprit was?
Very hard... the kernel tries to "guess" which process is causing the issue, but from what I've seen (and I see OOMs every week) it guesses wrong most of the time. In my case, the victim ends up being "nscd" most of the time, even when I'm sure it's not using a lot of memory nor leaking.
In my case, usually when I start having OOMs I have them on several machines running the same programs (it's a grid) so it's more or less easy to find the culprit by looking at the jobs that were running on all affected machines.
In any case, my policy is to always reboot a machine after an OOM, since it may be in an incoherent state.
---- well, I stopped using nscd a few years ago and it definitely is off after the reboot and chkconfig says it shouldn't start by itself but I put it in the realm of possible but unlikely.
I did update to 5.2 on Sunday and updated nss-ldap yesterday and today - boink though I have no way to know what actually caused this as the logs don't reveal enough as far as I can tell. The system has been up for quite some time.
I suppose I could run some type of cron script that does something like...
top -n 1 -b >> /tmp/top.log
so if it happens again, I get a memory snapshot history...is there a better idea?
Craig
On Wed, 2008-07-30 at 19:55 -0700, Craig White wrote:
<snip>
I suppose I could run some type of cron script that does something like...
top -n 1 -b >> /tmp/top.log
so if it happens again, I get a memory snapshot history...is there a better idea?
If you have the sar packages installed the available reports will nail it for you.
Craig
<snip sig stuff>
HTH
On Thu, 2008-07-31 at 06:47 -0400, William L. Maltby wrote:
On Wed, 2008-07-30 at 19:55 -0700, Craig White wrote:
<snip>
I suppose I could run some type of cron script that does something like...
top -n 1 -b >> /tmp/top.log
so if it happens again, I get a memory snapshot history...is there a better idea?
If you have the sar packages installed the available reports will nail it for you.
---- hmmm....seems pretty clear that I've got something leaking memory
from this morning (sar -r) 06:30:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad 06:40:01 AM 17456 1017668 98.31 23520 222468 1600880 430728 21.20 133388
1600880 kbswpfree
from yesterday the 30 minutes to the moment of death... 05:00:01 PM 22672 1012468 97.81 35868 131760 1052 2030556 99.95 29452 05:10:01 PM 16228 1018912 98.43 31596 167148 108 2031500 99.99 12288 05:20:02 PM 12136 1023004 98.83 55064 76868 6860 2024748 99.66 55768 05:30:01 PM 12472 1022668 98.80 18608 81296 0 2031608 100.00 48364
So you can see that kbswpfree went from 1052 => 108 => 6850 => 0
and on July 25 (two days before I updated to 5.2) but there were users in the office (same time period)... 05:00:01 PM 21092 1014048 97.96 47580 133536 82320 1949288 95.95 67468 05:10:01 PM 50332 984808 95.14 60632 107352 83560 1948048 95.89 49740 05:20:01 PM 26060 1009080 97.48 51484 123264 87192 1944416 95.71 56560 05:30:01 PM 55480 979660 94.64 24660 123368 87952 1943656 95.67 58716
but on July 27 - the day I updated - no users in office - same time period, the kbswpfree starting swinging wildly.
But sar doesn't tell me which program is leaking memory but perhaps it was just the update without reboot that was the issue.
Craig
On Thu, 2008-07-31 at 10:16 -0700, Craig White wrote:
<snip>
but on July 27 - the day I updated - no users in office - same time period, the kbswpfree starting swinging wildly.
But sar doesn't tell me which program is leaking memory but perhaps it was just the update without reboot that was the issue.
Regardless of users in the office, if they left themselves logged in the old version of libs/programs would have to be kept on the system and in memory (or swap). If your users were like mine used to be, that's likely. Then as users began logging in/out the systems would have two versions of many libs/prgms active at the same time. Result is likely at least a doubling of mem/swap usage because of very little shared code.
Craig
<snip>
On Thu, 2008-07-31 at 17:18 -0400, William L. Maltby wrote:
On Thu, 2008-07-31 at 10:16 -0700, Craig White wrote:
<snip>
but on July 27 - the day I updated - no users in office - same time period, the kbswpfree starting swinging wildly.
But sar doesn't tell me which program is leaking memory but perhaps it was just the update without reboot that was the issue.
Regardless of users in the office, if they left themselves logged in the old version of libs/programs would have to be kept on the system and in memory (or swap). If your users were like mine used to be, that's likely. Then as users began logging in/out the systems would have two versions of many libs/prgms active at the same time. Result is likely at least a doubling of mem/swap usage because of very little shared code.
---- nothing like a system death and cold reboot to make certain that all users are logged out I guess ;-)
Yeah, I have some users who despite my occasional begging to get them to shut down or at least log off, simply don't.
FWIW - according to sar, the kbswpfree has remained high all day so I suspect that this is something that will bite me again a long time in the future when I least expect it.
Thanks for the tip on sar...it's sort of cool and it's been taking snapshots all along which does confirm the problem but of course didn't identify the source of the problem.
Craig
Craig White wrote:
Yeah, I have some users who despite my occasional begging to get them to shut down or at least log off, simply don't.
While it can be done with grep I like the slay command, I don't think it's available in RHEL
NAME slay - kill all processes belonging to a user
SYNOPSIS slay [-signal] name [name...]
DESCRIPTION Slay sends given signal (KILL by default) to all processes belonging to user(s) given on the command line. When called without arguments it dis- plays short help.
You can use -clean as a signal name, in that case a "clean kill" is done, that is processes are first sent TERM signal and after 10 seconds those that haven't terminated yet are killed with KILL
[..]
FILES /etc/slay_mode - contains keywords describing the mode slay works in, separated by newlines:
mean turns mean mode on. In mean mode attempts to slay people without root priviledges are punished. This is the default.
nice turns mean mode off.
butthead switched slay to Butt-head messages mode.
normal switches slay to normal messages mode. This is the default.
You can only use one of mean/nice keywords and one of butthead/normal keywords.
--
nate
On Thu, 31 Jul 2008 15:08:16 -0700 (PDT) nate centos@linuxpowered.net wrote:
While it can be done with grep I like the slay command, I don't think it's available in RHEL
pkill
On Thu, 31 Jul 2008, nate wrote:
Craig White wrote:
Yeah, I have some users who despite my occasional begging to get them to shut down or at least log off, simply don't.
While it can be done with grep I like the slay command, I don't think it's available in RHEL
NAME slay - kill all processes belonging to a user
CentOS includes pkill (in the procps rpm), which performs the same function:
pkill -U username
or, more severely,
pkill -KILL -U username
That's what I use to get rid of open filehandles, esp. on NFS mounts, before shutting down shared servers.
On Thu, Jul 31, 2008 at 2:59 PM, Craig White craigwhite@azapple.com wrote:
nothing like a system death and cold reboot to make certain that all users are logged out I guess ;-)
Yeah, I have some users who despite my occasional begging to get them to shut down or at least log off, simply don't.
init 1 does that without bringing the system all the way down....
mhr
On Wed, 2008-07-30 at 19:55 -0700, Craig White wrote:
<snip>
P.S. And running sar. Use a short sample interval is you suspect a rapidly inflating hog. If the hog is only slowly "bloating", a slow sample rate will do.
If it's a "memory leak", I can't recall (been years... no, decades since I dinked with this stuff) if sar will help or not. Some sampling of memory pool information may be needed in that case. But sar might provide some clues in that case.