Dr. Ed Morbius wrote:
on 09:24 Tue 08 Mar, Michael Eager (eager@eagerm.com) wrote:
Hi --
I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.
I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.
Any suggestions where I might look for a clue?
I'd very strongly recommend you configure netconsole. Though not entire clear from the name, it's actually an in-kernel network logging module, which is very useful for kicking out kernel panics which otherwise aren't logged to disk and can't be seen on a (nonresponsive) monitor.
I'll take a look at netconsole.
Alternately, a serial console which actually retains all output sent to it (some remote access systems support this, some don't) may help.
Barring that, I'd start looking at individual HW components, starting with RAM.
The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.