[CentOS] Server hangs on CentOS 5.5

Tue Mar 8 21:48:59 UTC 2011
Dr. Ed Morbius <dredmorbius at gmail.com>

on 10:31 Tue 08 Mar, Michael Eager (eager at eagerm.com) wrote:
> Les Mikesell wrote:
> > On 3/8/2011 11:24 AM, Michael Eager wrote:
> >> Hi --
> >>
> >> I'm running a server which is usually stable, but every
> >> once in a while it hangs.  The server is used as a file
> >> store using NFS and to run VMware machines.
> >>
> >> I don't see anything in /var/log/messages or elsewhere
> >> to indicate any problem or offer any clue why the system
> >> was hung.
> >>
> >> Any suggestions where I might look for a clue?
> > 
> > Probably something hardware related.  Bad memory, overheating, power 
> > supply, etc.  I've even seen some rare cases where a bios update would 
> > fix it although it didn't make much sense for a machine to run for 
> > years, then need a firmware change.
> 
> The system is on a UPS and temps seem reasonable.
> Locating a transient memory problem is time consuming.

Disable or remove half your RAM.  If the problem persists, replace that
RAM and remove the other half.  If the problem resolves, the issue is
likely in the half of the RAM you've removed.  You can binary search
through it, or RMA the lot if warranteed.

> Identifying a power supply which sometimes spikes is even more
> difficult.  

Same drill.  Replace the power supply, or on a dual-PS system, disable
one, then the other.  Follow procedure as for RAM.

> I'd like to have a clue about the likely problem before shutting down
> the server for an extended period.

If the server is critical, get a vendor loaner and bench-test the
equipment until the fault can be identified.
 
> I'll set up sar and sensord to periodically log system status and see
> if this gives me a clue for the next time this happens.

At best, sar will tell you whether or not you're experiencing resource
exhuastion.  It's a valuable tool, but fairly coarse-grained.  Cacti
will give you better resolution and visualization (particularly on
CentOS) than sar (some distros now include sar graphing utilities,
CentOS to the best of my recollection does not).

-- 
Dr. Ed Morbius, Chief Scientist /            |
  Robot Wrangler / Staff Psychologist        | When you seek unlimited power
Krell Power Systems Unlimited                |                  Go to Krell!