Les Mikesell wrote: > Note that overheating can be localized or a bad heat sink mounting or > fan on a CPU. I'll re-seat the CPU, heatsink, and fan on the next downtime. Heat related problems usually present as a system which fails and will not reboot immediately, but will after they sit for a while to cool down. This system doesn't do that. I'll install sensord to log CPU temps in case this is a problem. > There's not really a good way to approach intermittent failures. It may > only break when you aren't looking. Major component swaps or taking it > offline for extended diagnostics hoping to catch a glimpse of the cause > when it fails is about all you can do. > >> During the next server downtime, I'll re-seat RAM and >> cables, check for excess dust, and do normal maintenance >> as folks have suggested. I might also run a memory diag. >> I'll also look at the several excellent and appreciated >> suggestions (some of which I've already installed) on how >> to get a better picture on the state of the server when/if >> there is a future failure. > > Memory diagnostics may take days to catch a problem. Did you check for > a newer bios for your MB? I mentioned before that it seemed strange, > but I've seen that fix mysterious problems even after the machines had > previously been reliable for a long time (and even more oddly, all the > machines in the lot weren't affected). Yes, most memory diagnostics are not very effective. I'll have to stop the server to find out what the installed bios version is and see whether there is an update. Most bios updates appear to only change supported CPUs. Something else for the next downtime. -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077