[CentOS] Server hangs on CentOS 5.5

Michael Eager eager at eagerm.com
Wed Mar 9 18:07:41 UTC 2011

m.roth at 5-cent.us wrote:
> Michael Eager wrote:
>> John Hodrien wrote:
>>> On Wed, 9 Mar 2011, Michael Eager wrote:
>>>> The problem with randomly replacing various components, other than
>>>> the downtime and nuisance, is that there's no way to know that the
>>>> change actually fixed any problem.  When the base rate is one
>>>> unknown system hang every few weeks, how many wees should I wait
>>>> without a failure to conclude that the replaced component was the
>>>> cause?  A failure which happens infrequently isn't really amenable
>>>> to a random diagnostic approach.
>>> So you pitch the whole thing over to being a test rig, and buy all new
>>> hardware?
>> I'll repeat from my original post:
>>     I don't see anything in /var/log/messages or elsewhere
>>     to indicate any problem or offer any clue why the system
>>     was hung.
>>     Any suggestions where I might look for a clue?
>> I'm looking for diagnostics to focus on the cause of the crash.
>> My thanks for the several suggestions in this area.
>> I'm not particularly interested in a listing of the myriad of
>> hypothetical causes absent observable evidence and some of
>> which are contradicted by evidence (such as overheating).
> <snip>
> Here's one more, off-the-wall thought: do the setterm --powersave off, and
> find some way to make it work, so that you can see what's on the screen
> when it dies. 

Yes, I did this.  Switched to console screen.  The correct command
is "setterm -powersave off -blank off", otherwise the screen gets
blanked.  Turned the monitor off.  I hope it shows something
useful on the next fault.

> What may be very important here is I recently had a problem
> with a honkin' big server crashing... and it turned out that a user was
> running a parallel processing job that kicked off three? four? dozen
> threads, and towards the end of the job, every single thread wanted 10G...
> on a system with 256G RAM (which size still boggles my mind). The
> OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited
> what his job requests, and the system hasn't crashed since.

Strange.  OOM-Killer should get priority.  That's what it's for.
Although it usually seems to kill the innocent bystanders before
it gets around to killing the offenders.

Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

More information about the CentOS mailing list