John Hodrien wrote: > On Wed, 9 Mar 2011, Michael Eager wrote: > >> The problem with randomly replacing various components, other than >> the downtime and nuisance, is that there's no way to know that the >> change actually fixed any problem. When the base rate is one >> unknown system hang every few weeks, how many wees should I wait >> without a failure to conclude that the replaced component was the >> cause? A failure which happens infrequently isn't really amenable >> to a random diagnostic approach. > > So you pitch the whole thing over to being a test rig, and buy all new > hardware? I'll repeat from my original post: I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung. Any suggestions where I might look for a clue? I'm looking for diagnostics to focus on the cause of the crash. My thanks for the several suggestions in this area. I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating). I've encountered my share of bad power supplies, bad RAM, poorly seated cards, etc. I've replaced failing capacitors in monitors (never on a motherboard). I've replaced video cards, hard drives, bad cables. And so forth. Each of these had characteristics which pointed to the problem: kernel oops, POST failures, flickering screens, etc. The problem I have is that there is a lack of diagnostic information to focus on the cause of the server failure. I don't mean to appear unappreciative, but suggestions which amount to spending many hours making a series of unfocused modifications to the server, hoping that one of these random alterations fixes an infrequent problem, doesn't strike me as useful. At the other extreme, the suggestions that I not look for the cause of the system failure and instead replace the server with one or three servers also doesn't seem to be a useful diagnostic approach either. During the next server downtime, I'll re-seat RAM and cables, check for excess dust, and do normal maintenance as folks have suggested. I might also run a memory diag. I'll also look at the several excellent and appreciated suggestions (some of which I've already installed) on how to get a better picture on the state of the server when/if there is a future failure. Thanks all! -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077