[CentOS] Server hangs on CentOS 5.5

Wed Mar 9 18:07:41 UTC 2011
Michael Eager <eager at eagerm.com>

m.roth at 5-cent.us wrote:
> Michael Eager wrote:
>> John Hodrien wrote:
>>> On Wed, 9 Mar 2011, Michael Eager wrote:
>>>
>>>> The problem with randomly replacing various components, other than
>>>> the downtime and nuisance, is that there's no way to know that the
>>>> change actually fixed any problem.  When the base rate is one
>>>> unknown system hang every few weeks, how many wees should I wait
>>>> without a failure to conclude that the replaced component was the
>>>> cause?  A failure which happens infrequently isn't really amenable
>>>> to a random diagnostic approach.
>>> So you pitch the whole thing over to being a test rig, and buy all new
>>> hardware?
>> I'll repeat from my original post:
>>
>>     I don't see anything in /var/log/messages or elsewhere
>>     to indicate any problem or offer any clue why the system
>>     was hung.
>>
>>     Any suggestions where I might look for a clue?
>>
>> I'm looking for diagnostics to focus on the cause of the crash.
>> My thanks for the several suggestions in this area.
>>
>> I'm not particularly interested in a listing of the myriad of
>> hypothetical causes absent observable evidence and some of
>> which are contradicted by evidence (such as overheating).
> <snip>
> Here's one more, off-the-wall thought: do the setterm --powersave off, and
> find some way to make it work, so that you can see what's on the screen
> when it dies. 

Yes, I did this.  Switched to console screen.  The correct command
is "setterm -powersave off -blank off", otherwise the screen gets
blanked.  Turned the monitor off.  I hope it shows something
useful on the next fault.

> What may be very important here is I recently had a problem
> with a honkin' big server crashing... and it turned out that a user was
> running a parallel processing job that kicked off three? four? dozen
> threads, and towards the end of the job, every single thread wanted 10G...
> on a system with 256G RAM (which size still boggles my mind). The
> OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited
> what his job requests, and the system hasn't crashed since.

Strange.  OOM-Killer should get priority.  That's what it's for.
Although it usually seems to kill the innocent bystanders before
it gets around to killing the offenders.

-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077