On 3/9/2011 11:32 AM, Michael Eager wrote: > > I'm not particularly interested in a listing of the myriad of > hypothetical causes absent observable evidence and some of > which are contradicted by evidence (such as overheating). Note that overheating can be localized or a bad heat sink mounting or fan on a CPU. > I've encountered my share of bad power supplies, bad RAM, > poorly seated cards, etc. I've replaced failing capacitors > in monitors (never on a motherboard). I've replaced video > cards, hard drives, bad cables. And so forth. Each of these > had characteristics which pointed to the problem: kernel oops, > POST failures, flickering screens, etc. The problem I have is > that there is a lack of diagnostic information to focus on the > cause of the server failure. Anything that happens quickly isn't going to show up in a log. > I don't mean to appear unappreciative, but suggestions which > amount to spending many hours making a series of unfocused > modifications to the server, hoping that one of these random > alterations fixes an infrequent problem, doesn't strike me as > useful. At the other extreme, the suggestions that I not look > for the cause of the system failure and instead replace the > server with one or three servers also doesn't seem to be a > useful diagnostic approach either. There's not really a good way to approach intermittent failures. It may only break when you aren't looking. Major component swaps or taking it offline for extended diagnostics hoping to catch a glimpse of the cause when it fails is about all you can do. > During the next server downtime, I'll re-seat RAM and > cables, check for excess dust, and do normal maintenance > as folks have suggested. I might also run a memory diag. > I'll also look at the several excellent and appreciated > suggestions (some of which I've already installed) on how > to get a better picture on the state of the server when/if > there is a future failure. Memory diagnostics may take days to catch a problem. Did you check for a newer bios for your MB? I mentioned before that it seemed strange, but I've seen that fix mysterious problems even after the machines had previously been reliable for a long time (and even more oddly, all the machines in the lot weren't affected). -- Les Mikesell lesmikesell at gmail.com