[CentOS] Server hangs on CentOS 5.5

Wed Mar 9 17:52:03 UTC 2011
Les Mikesell <lesmikesell at gmail.com>

On 3/9/2011 11:32 AM, Michael Eager wrote:
>
> I'm not particularly interested in a listing of the myriad of
> hypothetical causes absent observable evidence and some of
> which are contradicted by evidence (such as overheating).

Note that overheating can be localized or a bad heat sink mounting or 
fan on a CPU.

> I've encountered my share of bad power supplies, bad RAM,
> poorly seated cards, etc.  I've replaced failing capacitors
> in monitors (never on a motherboard).  I've replaced video
> cards, hard drives, bad cables.  And so forth.  Each of these
> had characteristics which pointed to the problem: kernel oops,
> POST failures, flickering screens, etc.  The problem I have is
> that there is a lack of diagnostic information to focus on the
> cause of the server failure.

Anything that happens quickly isn't going to show up in a log.

> I don't mean to appear unappreciative, but suggestions which
> amount to spending many hours making a series of unfocused
> modifications to the server, hoping that one of these random
> alterations fixes an infrequent problem, doesn't strike me as
> useful.  At the other extreme, the suggestions that I not look
> for the cause of the system failure and instead replace the
> server with one or three servers also doesn't seem to be a
> useful diagnostic approach either.

There's not really a good way to approach intermittent failures.  It may 
only break when you aren't looking.  Major component swaps or taking it 
offline for extended diagnostics hoping to catch a glimpse of the cause 
when it fails is about all you can do.

> During the next server downtime, I'll re-seat RAM and
> cables, check for excess dust, and do normal maintenance
> as folks have suggested.  I might also run a memory diag.
> I'll also look at the several excellent and appreciated
> suggestions (some of which I've already installed) on how
> to get a better picture on the state of the server when/if
> there is a future failure.

Memory diagnostics may take days to catch a problem.  Did you check for 
a newer bios for your MB?  I mentioned before that it seemed strange, 
but I've seen that fix mysterious problems even after the machines had 
previously been reliable for a long time (and even more oddly, all the 
machines in the lot weren't affected).

-- 
   Les Mikesell
     lesmikesell at gmail.com