[CentOS] Server hangs on CentOS 5.5

Wed Mar 9 18:43:14 UTC 2011
Dr. Ed Morbius <dredmorbius at gmail.com>

on 07:06 Wed 09 Mar, Michael Eager (eager at eagerm.com) wrote:
> Dr. Ed Morbius wrote:
> >on 09:24 Tue 08 Mar, Michael Eager (eager at eagerm.com) wrote:
> >>Hi --
> >>
> >>I'm running a server which is usually stable, but every
> >>once in a while it hangs.  The server is used as a file
> >>store using NFS and to run VMware machines.
> >>
> >>I don't see anything in /var/log/messages or elsewhere
> >>to indicate any problem or offer any clue why the system
> >>was hung.
> >>
> >>Any suggestions where I might look for a clue?
> >
> >I'd very strongly recommend you configure netconsole.  Though not entire
> >clear from the name, it's actually an in-kernel network logging module,
> >which is very useful for kicking out kernel panics which otherwise
> >aren't logged to disk and can't be seen on a (nonresponsive) monitor.
> 
> I'll take a look at netconsole.
> 
> >Alternately, a serial console which actually retains all output sent to
> >it (some remote access systems support this, some don't) may help.
> >
> >Barring that, I'd start looking at individual HW components, starting
> >with RAM.
> 
> The problem with randomly replacing various components, other than the
> downtime and nuisance, is that there's no way to know that the change
> actually fixed any problem.  When the base rate is one unknown system
> hang every few weeks, how many wees should I wait without a failure to
> conclude that the replaced component was the cause?  A failure which
> happens infrequently isn't really amenable to a random diagnostic
> approach.

This is where vendor management/relations starts coming into the
picture.

Your architecture should also support single-point failures.

If the issue is repeated but rare system failures on one of a set of
similarly configured hosts, I'd RMA the box and get a replacement.  End
of story.

If that's not the case, well, then, I suppose YOUR problem is to figure
out when you've resolved the issue.  I've outlined the steps I'd take.
If this means weeks of uncertainty, then I'd communicate this fact, in
no uncertain terms, to my manager, along with the financial implications
of downtime.

If downtime is more expensive than system replacement costs, the
decision is pretty obvious, even if painful.

Note that most system problems /are/ single-source.  If you'd post
details of the host, more logging information, netconsole panic logs,
etc., it might be possible to narrow down possible causes.

With what you've posted to date, it's not.

-- 
Dr. Ed Morbius, Chief Scientist /            |
  Robot Wrangler / Staff Psychologist        | When you seek unlimited power
Krell Power Systems Unlimited                |                  Go to Krell!