[CentOS] how to debug hardware lockups?

nate centos at linuxpowered.net
Sat Nov 15 18:17:39 UTC 2008

Rudi Ahlers wrote:

> Unfortunately, I can't leave a monitor attached to the server all the
> time. The server is in a shared cabinet @ a 3rd party ISP, and they
> lock the cabinets once we're done working with it. The last lockup was
> about 6 days ago, and previous one about 8 days ago. There's no
> consitancy.
> How can I redirect all console output to a file instead?

Configure a serial console, connect the console to another
system and use something like minicom to log the console to a file.
You can't really log to the local system in this situation as
you likely won't capture the event(if you did you would of
seen the error in the system logs)

In my experience most of these kinds of problems are related
to bad ram.

If your running CentOS 4.x configure netdump to send the kernel
dumps to another server, if your using CentOS 5.x configure
diskdump(?) to store the dump to local disk.

Run memtest86 on the system for a few days, replace the system
with a known working one so you can take the broken system off
site from the ISP for diagnostics.

I like running cerberus http://sourceforge.net/projects/va-ctcs/
as a burn-in tool, if the system can survive that running for
a couple days it should be good. In running against a hundred or
so systems I don't recall it taking longer than a few hours
to crash the system if there was a problem.


