On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse rkarhuse@gmail.com wrote:
On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers rudiahlers@gmail.com wrote:
Hi,
We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well.
How do I debug the server, which runs CentOS 5.2 to see why it locks up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard
Attach a local console to the video port and let us know what it says --> that will (probably) be very insightful. E.G., Kernel panic, MCE, ....
Next, run memtest86+ -- at least overnight. [Note: I've had less than stellar results with memtest86 recently, but if it shows errors, you've got a problem big time; if it doesn't show errors, you still not 100% sure that memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly given it is a critical server ....
Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors enabled?? Either create a script or using something like munin to track things and see if fans, temperature, voltages are all stable & within range up to death.
Can you easilhy swap power supplies?? (Is the unit dual powered or just one unit?)
Clearly, just a start, but you get the idea of elementary, 101 problem solving ....
-rak-
Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy.
How can I redirect all console output to a file instead?
I have got lm-sensors installed, but it doesn't pick-up the motherboard's sensors. All fans are working when I checked last time, but it's a 1U chassis, so it's got limited air-flow. I don't know if it get's too hot, or not. When I rebooted it, the temp was about 45 degrees celcius, but the lockup only happened about 6 days later. So, I can't even sit there 24/7 to see what happens.