[CentOS] Server hangs on CentOS 5.5

Wed Mar 9 17:32:54 UTC 2011

John Hodrien wrote:
> On Wed, 9 Mar 2011, Michael Eager wrote:
> 
>> The problem with randomly replacing various components, other than
>> the downtime and nuisance, is that there's no way to know that the
>> change actually fixed any problem.  When the base rate is one
>> unknown system hang every few weeks, how many wees should I wait
>> without a failure to conclude that the replaced component was the
>> cause?  A failure which happens infrequently isn't really amenable
>> to a random diagnostic approach.
> 
> So you pitch the whole thing over to being a test rig, and buy all new
> hardware?

I'll repeat from my original post:

    I don't see anything in /var/log/messages or elsewhere
    to indicate any problem or offer any clue why the system
    was hung.

    Any suggestions where I might look for a clue?

I'm looking for diagnostics to focus on the cause of the crash.
My thanks for the several suggestions in this area.

I'm not particularly interested in a listing of the myriad of
hypothetical causes absent observable evidence and some of
which are contradicted by evidence (such as overheating).

I've encountered my share of bad power supplies, bad RAM,
poorly seated cards, etc.  I've replaced failing capacitors
in monitors (never on a motherboard).  I've replaced video
cards, hard drives, bad cables.  And so forth.  Each of these
had characteristics which pointed to the problem: kernel oops,
POST failures, flickering screens, etc.  The problem I have is
that there is a lack of diagnostic information to focus on the
cause of the server failure.

I don't mean to appear unappreciative, but suggestions which
amount to spending many hours making a series of unfocused
modifications to the server, hoping that one of these random
alterations fixes an infrequent problem, doesn't strike me as
useful.  At the other extreme, the suggestions that I not look
for the cause of the system failure and instead replace the
server with one or three servers also doesn't seem to be a
useful diagnostic approach either.

During the next server downtime, I'll re-seat RAM and
cables, check for excess dust, and do normal maintenance
as folks have suggested.  I might also run a memory diag.
I'll also look at the several excellent and appreciated
suggestions (some of which I've already installed) on how
to get a better picture on the state of the server when/if
there is a future failure.

Thanks all!

-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077