m.roth at 5-cent.us wrote: > Michael Eager wrote: >> John Hodrien wrote: >>> On Wed, 9 Mar 2011, Michael Eager wrote: >>> >>>> The problem with randomly replacing various components, other than >>>> the downtime and nuisance, is that there's no way to know that the >>>> change actually fixed any problem. When the base rate is one >>>> unknown system hang every few weeks, how many wees should I wait >>>> without a failure to conclude that the replaced component was the >>>> cause? A failure which happens infrequently isn't really amenable >>>> to a random diagnostic approach. >>> So you pitch the whole thing over to being a test rig, and buy all new >>> hardware? >> I'll repeat from my original post: >> >> I don't see anything in /var/log/messages or elsewhere >> to indicate any problem or offer any clue why the system >> was hung. >> >> Any suggestions where I might look for a clue? >> >> I'm looking for diagnostics to focus on the cause of the crash. >> My thanks for the several suggestions in this area. >> >> I'm not particularly interested in a listing of the myriad of >> hypothetical causes absent observable evidence and some of >> which are contradicted by evidence (such as overheating). > <snip> > Here's one more, off-the-wall thought: do the setterm --powersave off, and > find some way to make it work, so that you can see what's on the screen > when it dies. Yes, I did this. Switched to console screen. The correct command is "setterm -powersave off -blank off", otherwise the screen gets blanked. Turned the monitor off. I hope it shows something useful on the next fault. > What may be very important here is I recently had a problem > with a honkin' big server crashing... and it turned out that a user was > running a parallel processing job that kicked off three? four? dozen > threads, and towards the end of the job, every single thread wanted 10G... > on a system with 256G RAM (which size still boggles my mind). The > OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited > what his job requests, and the system hasn't crashed since. Strange. OOM-Killer should get priority. That's what it's for. Although it usually seems to kill the innocent bystanders before it gets around to killing the offenders. -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077