[CentOS] Server hangs on CentOS 5.5

Michael Eager wrote:
> John Hodrien wrote:
>> On Wed, 9 Mar 2011, Michael Eager wrote:
>>
>>> The problem with randomly replacing various components, other than
>>> the downtime and nuisance, is that there's no way to know that the
>>> change actually fixed any problem.  When the base rate is one
>>> unknown system hang every few weeks, how many wees should I wait
>>> without a failure to conclude that the replaced component was the
>>> cause?  A failure which happens infrequently isn't really amenable
>>> to a random diagnostic approach.
>>
>> So you pitch the whole thing over to being a test rig, and buy all new
>> hardware?
>
> I'll repeat from my original post:
>
>     I don't see anything in /var/log/messages or elsewhere
>     to indicate any problem or offer any clue why the system
>     was hung.
>
>     Any suggestions where I might look for a clue?
>
> I'm looking for diagnostics to focus on the cause of the crash.
> My thanks for the several suggestions in this area.
>
> I'm not particularly interested in a listing of the myriad of
> hypothetical causes absent observable evidence and some of
> which are contradicted by evidence (such as overheating).
<snip>
Here's one more, off-the-wall thought: do the setterm --powersave off, and
find some way to make it work, so that you can see what's on the screen
when it dies. What may be very important here is I recently had a problem
with a honkin' big server crashing... and it turned out that a user was
running a parallel processing job that kicked off three? four? dozen
threads, and towards the end of the job, every single thread wanted 10G...
on a system with 256G RAM (which size still boggles my mind). The
OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited
what his job requests, and the system hasn't crashed since.

            mark