On 3/9/2011 11:32 AM, Michael Eager wrote:
I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating).
Note that overheating can be localized or a bad heat sink mounting or fan on a CPU.
I've encountered my share of bad power supplies, bad RAM, poorly seated cards, etc. I've replaced failing capacitors in monitors (never on a motherboard). I've replaced video cards, hard drives, bad cables. And so forth. Each of these had characteristics which pointed to the problem: kernel oops, POST failures, flickering screens, etc. The problem I have is that there is a lack of diagnostic information to focus on the cause of the server failure.
Anything that happens quickly isn't going to show up in a log.
I don't mean to appear unappreciative, but suggestions which amount to spending many hours making a series of unfocused modifications to the server, hoping that one of these random alterations fixes an infrequent problem, doesn't strike me as useful. At the other extreme, the suggestions that I not look for the cause of the system failure and instead replace the server with one or three servers also doesn't seem to be a useful diagnostic approach either.
There's not really a good way to approach intermittent failures. It may only break when you aren't looking. Major component swaps or taking it offline for extended diagnostics hoping to catch a glimpse of the cause when it fails is about all you can do.
During the next server downtime, I'll re-seat RAM and cables, check for excess dust, and do normal maintenance as folks have suggested. I might also run a memory diag. I'll also look at the several excellent and appreciated suggestions (some of which I've already installed) on how to get a better picture on the state of the server when/if there is a future failure.
Memory diagnostics may take days to catch a problem. Did you check for a newer bios for your MB? I mentioned before that it seemed strange, but I've seen that fix mysterious problems even after the machines had previously been reliable for a long time (and even more oddly, all the machines in the lot weren't affected).