Update the kernel will probably be the way to fix your problem. Best Regards Sinux 在 2011-3-12,10:08,"Ross Walker" <rswwalker at gmail.com> 写道: > On Mar 11, 2011, at 12:33 PM, PJ <pauljerome at gmail.com> wrote: > >> This may or may not be CentOS related, but am out of ideas at this >> point and wanted to bounce this off the list. >> >> I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5. >> >> Almost everyday around 3:30 AM the server completely locks up and has >> to be power cycled before it will come back online. >> (this means someone hat to wake up and reboot the server, oh how I >> love being an internet janitor! :) >> >> Smells like a hardware issue to me too, but I went through all of the >> dell diagnostics, updated the firmware, everything checks out as being >> okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell >> tech. No hardware issues, at least that we were able to find. >> >> There are no cron jobs that run at 3:30, no backups, the server has a >> load of 0, nothing is scheduled around that time... >> >> The only crontab entry at all is "*/5 * * * * wget -q >> www.websitedomain.com/cron.php >/dev/null 2>&1" >> They are running Magento for commerce purposes and this runs every 5 minutes. >> >> Why does the server only lockup around 3:30 AM? Because it's knows I >> am fast asleep? >> >> I was able to pull this from /var/log/messages, this happens just >> seconds before locking up completely... >> >> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more >> than 120 seconds. >> Mar 8 03:33:19 web1 kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0 >> 13608 13607 (NOTLB) >> Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086 >> ffff81007bc7bd88 ffff81000100d3f8 >> Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007 >> ffff8100849db0c0 ffffffff80308b60 >> Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237 >> ffff8100849db2a8 0000000064c82eae >> Mar 8 03:33:19 web1 kernel: Call Trace: >> Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>] >> __mutex_lock_slowpath+0x60/0x9b >> Mar 8 03:33:20 web1 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 >> Mar 8 03:33:20 web1 kernel: [<ffffffff8000cf82>] do_lookup+0x90/0x1e6 >> Mar 8 03:33:20 web1 kernel: [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b >> Mar 8 03:33:20 web1 kernel: [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 >> Mar 8 03:33:20 web1 kernel: [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 >> Mar 8 03:33:23 web1 kernel: [<ffffffff80012851>] getname+0x15b/0x1c2 >> Mar 8 03:33:23 web1 kernel: [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c >> Mar 8 03:33:23 web1 kernel: [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a >> Mar 8 03:33:23 web1 kernel: [<ffffffff80023703>] sys_newstat+0x19/0x31 >> Mar 8 03:33:23 web1 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 >> >> If anyone has some advice on where to go from here it would be greatly >> appreciated. > > Do a fsck of the file system wget is writing to as there might be a corruption it hits only on the 3:30am run as that's when the other vendor dumps data to be downloaded. > > You could also check to see if a RAID patrol read (scrub/predictive failure detection) is happening around this time as well and disable/reschedule it. > > -Ross > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >