m.roth@5-cent.us wrote:
Michael Eager wrote:
John Hodrien wrote:
On Wed, 9 Mar 2011, Michael Eager wrote:
The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.
So you pitch the whole thing over to being a test rig, and buy all new hardware?
I'll repeat from my original post:
I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung. Any suggestions where I might look for a clue?
I'm looking for diagnostics to focus on the cause of the crash. My thanks for the several suggestions in this area.
I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating).
<snip> Here's one more, off-the-wall thought: do the setterm --powersave off, and find some way to make it work, so that you can see what's on the screen when it dies.
Yes, I did this. Switched to console screen. The correct command is "setterm -powersave off -blank off", otherwise the screen gets blanked. Turned the monitor off. I hope it shows something useful on the next fault.
What may be very important here is I recently had a problem with a honkin' big server crashing... and it turned out that a user was running a parallel processing job that kicked off three? four? dozen threads, and towards the end of the job, every single thread wanted 10G... on a system with 256G RAM (which size still boggles my mind). The OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited what his job requests, and the system hasn't crashed since.
Strange. OOM-Killer should get priority. That's what it's for. Although it usually seems to kill the innocent bystanders before it gets around to killing the offenders.