PJ wrote: > On Fri, Mar 11, 2011 at 10:06 AM, <m.roth at 5-cent.us> wrote: >> PJ wrote: <snip> >>> I'm running a CentOS 5.5 server, running the latest kernel >>> 2.6.18-194.32.1.el5. >>> >>> Almost everyday around 3:30 AM the server completely locks up and has >>> to be power cycled before it will come back online. >>> (this means someone hat to wake up and reboot the server, oh how I love >> being an internet janitor! :) >> <snip> >>> I was able to pull this from /var/log/messages, this happens just >> seconds before locking up completely... >>> >>> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more >>> than 120 seconds. >>> Mar 8 03:33:19 web1 kernel: "echo 0 > >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0 >> 13608 13607 (NOTLB) >>> Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086 >>> ffff81007bc7bd88 ffff81000100d3f8 >>> Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007 >>> ffff8100849db0c0 ffffffff80308b60 >>> Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237 >>> ffff8100849db2a8 0000000064c82eae >>> Mar 8 03:33:19 web1 kernel: Call Trace: >>> Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>] >>> __mutex_lock_slowpath+0x60/0x9b >> <snip> >> Anyone else smell an OOM killer? But it's clearly whatever the wget's >> after that's killing the system. > > What makes no sense to me is this runs every 5 minutes all day, but > only around 3:30 AM does it look up. > > There is nothing in the log that suggests the kernel is having to kill > processes because it is out of resources. > > No "httpd invoked oom-killer" etc... which I have seen before in other > situations. > > http://bugs.centos.org/view.php?id=4515 sounds like what I have going > on, but not with kjournald of course... Couple things: a few weeks ago, we were getting OOM Killer running with no log entries, but that was due to someone starting a parallel processing job that wanted all the cores... and near the end, wanted half again the memory, and *all* the threads hit that point apparently so fast OOM Killer didn't have time or memory to run. Another thing: it may be running every five minutes, but you might want to look at what it gets at 03:30 that might be different than the rest of the day, such as a major backup, or an entire day's reconsiliations, complete with gigabytes of scans.... mark