[CentOS] how to debug hardware lockups?

Rudi Ahlers rudiahlers at gmail.com
Sat Nov 15 16:11:04 UTC 2008

On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse <rkarhuse at gmail.com> wrote:
> On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers at gmail.com> wrote:
>> Hi,
>> We have a server which locks up about once a week (for the past 3
>> weeks now), without any warning, and the only way to recover it, is to
>> reset the server. This causes unwanted downtime, and often software
>> loss as well.
>> How do I debug the server, which runs CentOS 5.2 to see why it locks
>> up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel
>> Motherboard
> Attach a local console to the video port and let us know what it says -->
> that will (probably) be very insightful.  E.G., Kernel panic, MCE, ....
> Next, run memtest86+ -- at least overnight.  [Note: I've had less than
> stellar results with memtest86 recently, but if it shows errors, you've got
> a problem big time; if it doesn't show errors, you still not 100% sure that
> memory is good:-):-).]  Is it ECC memory??  If not, why not -- particularly
> given it is a critical server ....
> Are all the fans spinning -- particularly the CPU??  Do you have lm-sensors
> enabled??  Either create a script or using something like munin to track
> things
> and see if fans, temperature, voltages are all stable & within range up to
> death.
> Can you easilhy swap power supplies??  (Is the unit dual powered or just
> one unit?)
> Clearly, just a start, but you get the idea of elementary, 101 problem
> solving ....
Unfortunately, I can't leave a monitor attached to the server all the
time. The server is in a shared cabinet @ a 3rd party ISP, and they
lock the cabinets once we're done working with it. The last lockup was
about 6 days ago, and previous one about 8 days ago. There's no

How can I redirect all console output to a file instead?

I have got lm-sensors installed, but it doesn't pick-up the
motherboard's sensors. All fans are working when I checked last time,
but it's a 1U chassis, so it's got limited air-flow. I don't know if
it get's too hot, or not. When I rebooted it, the temp was about 45
degrees celcius, but the lockup only happened about 6 days later. So,
I can't even sit there 24/7 to see what happens.


