On Wed, 2008-07-30 at 22:19 -0400, Filipe Brandenburger wrote:
On Wed, Jul 30, 2008 at 20:31, Craig White craigwhite@azapple.com wrote:
how does one determine who the culprit was?
Very hard... the kernel tries to "guess" which process is causing the issue, but from what I've seen (and I see OOMs every week) it guesses wrong most of the time. In my case, the victim ends up being "nscd" most of the time, even when I'm sure it's not using a lot of memory nor leaking.
In my case, usually when I start having OOMs I have them on several machines running the same programs (it's a grid) so it's more or less easy to find the culprit by looking at the jobs that were running on all affected machines.
In any case, my policy is to always reboot a machine after an OOM, since it may be in an incoherent state.
---- well, I stopped using nscd a few years ago and it definitely is off after the reboot and chkconfig says it shouldn't start by itself but I put it in the realm of possible but unlikely.
I did update to 5.2 on Sunday and updated nss-ldap yesterday and today - boink though I have no way to know what actually caused this as the logs don't reveal enough as far as I can tell. The system has been up for quite some time.
I suppose I could run some type of cron script that does something like...
top -n 1 -b >> /tmp/top.log
so if it happens again, I get a memory snapshot history...is there a better idea?
Craig