On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers at gmail.com> wrote: > Hi, > > We have a server which locks up about once a week (for the past 3 > weeks now), without any warning, and the only way to recover it, is to > reset the server. This causes unwanted downtime, and often software > loss as well. > > How do I debug the server, which runs CentOS 5.2 to see why it locks > up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel > Motherboard > Attach a local console to the video port and let us know what it says --> that will (probably) be very insightful. E.G., Kernel panic, MCE, .... Next, run memtest86+ -- at least overnight. [Note: I've had less than stellar results with memtest86 recently, but if it shows errors, you've got a problem big time; if it doesn't show errors, you still not 100% sure that memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly given it is a critical server .... Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors enabled?? Either create a script or using something like munin to track things and see if fans, temperature, voltages are all stable & within range up to death. Can you easilhy swap power supplies?? (Is the unit dual powered or just one unit?) Clearly, just a start, but you get the idea of elementary, 101 problem solving .... -rak- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20081115/0ba37fe5/attachment-0005.html>