on 07:54 Thu 27 Jan, John Hodrien (J.H.Hodrien@leeds.ac.uk) wrote:
On Wed, 26 Jan 2011, Dr. Ed Morbius wrote:
I'd suggest the automount route as well (you're only open to NFS issues while the filesystem is mounted), but you then have to maintain automount maps and run the risk of issues with the automounter (I've seen large production environments in which the OOM killer would arbitrarily select processes to kill ....).
Once you're into an OOM state, you're screwed anyway. Is turning off overcommit a sane option these days or not?
Our suggested fix was to dramtically reduce overcommit, or disable it. I don't recall what was ultimately decided.
Frankly, bouncing the box would generally be better than letting it get in some weird wedge state (and was what we usually ended up doing in this instance anyway). Environment was a distributed batch-process server farm. Engineers were disciplined to either improve memory management or request host resources appropriately.
Now, if you were to run monit, out of init, and restart critical services as they failed, you might get around some of the borkage, but yeah, generally, what OOM is trying to tell you is that you're Doing It Wrong[tm].