Michael Eager wrote: > John Hodrien wrote: >> On Wed, 9 Mar 2011, Michael Eager wrote: >> >>> The problem with randomly replacing various components, other than >>> the downtime and nuisance, is that there's no way to know that the >>> change actually fixed any problem. When the base rate is one >>> unknown system hang every few weeks, how many wees should I wait >>> without a failure to conclude that the replaced component was the >>> cause? A failure which happens infrequently isn't really amenable >>> to a random diagnostic approach. >> >> So you pitch the whole thing over to being a test rig, and buy all new >> hardware? > > I'll repeat from my original post: > > I don't see anything in /var/log/messages or elsewhere > to indicate any problem or offer any clue why the system > was hung. > > Any suggestions where I might look for a clue? > > I'm looking for diagnostics to focus on the cause of the crash. > My thanks for the several suggestions in this area. > > I'm not particularly interested in a listing of the myriad of > hypothetical causes absent observable evidence and some of > which are contradicted by evidence (such as overheating). <snip> Here's one more, off-the-wall thought: do the setterm --powersave off, and find some way to make it work, so that you can see what's on the screen when it dies. What may be very important here is I recently had a problem with a honkin' big server crashing... and it turned out that a user was running a parallel processing job that kicked off three? four? dozen threads, and towards the end of the job, every single thread wanted 10G... on a system with 256G RAM (which size still boggles my mind). The OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited what his job requests, and the system hasn't crashed since. mark