on 10:29 Wed 09 Mar, Michael Eager (eager at eagerm.com) wrote:
> Les Mikesell wrote:
> > Note that overheating can be localized or a bad heat sink mounting or 
> > fan on a CPU.
> I'll re-seat the CPU, heatsink, and fan on the next downtime.

Very strongly advised.  It's a simple and very cheap approach.  I'd
check /all/ cables (power, disk) as well.

Visually scan for bad caps while you're doing this.  The pandemic of the
mid 2000s seems to have abated, but they can still ruin your whole day.
> Heat related problems usually present as a system which fails
> and will not reboot immediately, but will after they sit for a
> while to cool down.  This system doesn't do that.

Maybe, maybe not.
> I'll install sensord to log CPU temps in case this is a problem.

Good call.
> > There's not really a good way to approach intermittent failures.  It
> > may only break when you aren't looking.  Major component swaps or
> > taking it offline for extended diagnostics hoping to catch a glimpse
> > of the cause when it fails is about all you can do.

I disagree with this statement:  you start with the bleeding obvious and
easy to do (the cheap diagnostics), same as any garage mechanic or
doctor.  You instrument and increase log scrutiny.  You make damned sure
you're logging remotely as one of the first things a hosed system does
is stop writing to disk.

> Yes, most memory diagnostics are not very effective.
> I'll have to stop the server to find out what the installed bios version
> is and see whether there is an update.  Most bios updates appear to only
> change supported CPUs.  Something else for the next downtime.

You haven't stated who's built this system, but many LOM / OMC systems
will provide basic information such as this.  dmidecode and lshw are
also very helpful here.

