[CentOS] Server hangs on CentOS 5.5
Les Mikesell
lesmikesell at gmail.com
Wed Mar 9 17:52:03 UTC 2011
On 3/9/2011 11:32 AM, Michael Eager wrote:
>
> I'm not particularly interested in a listing of the myriad of
> hypothetical causes absent observable evidence and some of
> which are contradicted by evidence (such as overheating).
Note that overheating can be localized or a bad heat sink mounting or
fan on a CPU.
> I've encountered my share of bad power supplies, bad RAM,
> poorly seated cards, etc. I've replaced failing capacitors
> in monitors (never on a motherboard). I've replaced video
> cards, hard drives, bad cables. And so forth. Each of these
> had characteristics which pointed to the problem: kernel oops,
> POST failures, flickering screens, etc. The problem I have is
> that there is a lack of diagnostic information to focus on the
> cause of the server failure.
Anything that happens quickly isn't going to show up in a log.
> I don't mean to appear unappreciative, but suggestions which
> amount to spending many hours making a series of unfocused
> modifications to the server, hoping that one of these random
> alterations fixes an infrequent problem, doesn't strike me as
> useful. At the other extreme, the suggestions that I not look
> for the cause of the system failure and instead replace the
> server with one or three servers also doesn't seem to be a
> useful diagnostic approach either.
There's not really a good way to approach intermittent failures. It may
only break when you aren't looking. Major component swaps or taking it
offline for extended diagnostics hoping to catch a glimpse of the cause
when it fails is about all you can do.
> During the next server downtime, I'll re-seat RAM and
> cables, check for excess dust, and do normal maintenance
> as folks have suggested. I might also run a memory diag.
> I'll also look at the several excellent and appreciated
> suggestions (some of which I've already installed) on how
> to get a better picture on the state of the server when/if
> there is a future failure.
Memory diagnostics may take days to catch a problem. Did you check for
a newer bios for your MB? I mentioned before that it seemed strange,
but I've seen that fix mysterious problems even after the machines had
previously been reliable for a long time (and even more oddly, all the
machines in the lot weren't affected).
--
Les Mikesell
lesmikesell at gmail.com
More information about the CentOS
mailing list