[CentOS] Server locking up everyday around 3:30 AM

Fri Mar 11 18:40:22 UTC 2011
m.roth at 5-cent.us <m.roth at 5-cent.us>

PJ wrote:
> On Fri, Mar 11, 2011 at 10:06 AM,  <m.roth at 5-cent.us> wrote:
>> PJ wrote:
<snip>
>>> I'm running a CentOS 5.5 server, running the latest kernel
>>> 2.6.18-194.32.1.el5.
>>>
>>> Almost everyday around 3:30 AM the server completely locks up and has
>>> to be power cycled before it will come back online.
>>> (this means someone hat to wake up and reboot the server, oh how I love
>> being an internet janitor! :)
>> <snip>
>>> I was able to pull this from /var/log/messages, this happens just
>> seconds before locking up completely...
>>>
>>> Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
>>> than 120 seconds.
>>> Mar  8 03:33:19 web1 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Mar  8 03:33:19 web1 kernel: wget          D ffff810001004420     0
>> 13608  13607                     (NOTLB)
>>> Mar  8 03:33:19 web1 kernel:  ffff81007bc7bc78 0000000000000086
>>> ffff81007bc7bd88 ffff81000100d3f8
>>> Mar  8 03:33:19 web1 kernel:  ffff81007bc7bbf0 0000000000000007
>>> ffff8100849db0c0 ffffffff80308b60
>>> Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 0000000000003237
>>> ffff8100849db2a8 0000000064c82eae
>>> Mar  8 03:33:19 web1 kernel: Call Trace:
>>> Mar  8 03:33:20 web1 kernel:  [<ffffffff80063c6f>]
>>> __mutex_lock_slowpath+0x60/0x9b
>> <snip>
>> Anyone else smell an OOM killer? But it's clearly whatever the wget's
>> after that's killing the system.
>
> What makes no sense to me is this runs every 5 minutes all day, but
> only around 3:30 AM does it look up.
>
> There is nothing in the log that suggests the kernel is having to kill
> processes because it is out of resources.
>
> No "httpd invoked oom-killer" etc... which I have seen before in other
> situations.
>
> http://bugs.centos.org/view.php?id=4515 sounds like what I have going
> on, but not with kjournald of course...

Couple things: a few weeks ago, we were getting OOM Killer running with no
log entries, but that was due to someone starting a parallel processing
job that wanted all the cores... and near the end, wanted half again the
memory, and *all* the threads hit that point apparently so fast OOM Killer
didn't have time or memory to run.

Another thing: it may be running every five minutes, but you might want to
look at what it gets at 03:30 that might be different than the rest of the
day, such as a major backup, or an entire day's reconsiliations, complete
with gigabytes of scans....

   mark