[CentOS] how to debug hardware lockups?

Rudi Ahlers rudiahlers at gmail.com
Thu Nov 20 08:27:35 UTC 2008


On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch
<niftycluster at niftyegg.com> wrote:
> On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:
>> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk at yahoo.co.uk> wrote:
>> > Rudi Ahlers  wrote:
>> >
>> >> We have a server which locks up about once a week (for the
>> >> past 3
> ......
>> >> How do I debug the server, which runs CentOS 5.2 to see why
>> >> it locks
>> >> up?
>
> Jumping in the middle of a long list of good ideas.
> Other things to try --
>   change the run level
>        if 5 switch to 3
>        if 3 switch to 5
>
> Reinstall the processor--
>   remove the processor
>   clean the heat sink and processor of thermal compound
>   correctly apply the best thermal grease you can get (I like Arctic Silver)
>   reinstall the heat sink
>   consider upgrading the processor heat sink if the chassis permits (more Cu is good).
>
> Add thermal spreaders to your RAM.  You want all the chips on a RAM stick at the same temp.
>
> Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.
>
> Turn off any special system monitoring software tools.  Things like I2C serial buses
> do not isolate simple read only activity from things that might modify (shut
> down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC
> installed.   The two tools had overlapping uncoordinated interactions with
> the hardware and would randomly shut down the system.  Very new boards are almost
> never supported well so consider going blind.  Read EDAC info on CentOS and RH sites.
>
> Inspect then tidy all cables they can mess up air flow and cause thermal issues.
>
> Reset the BIOS and check all the BIOS options.  Check for a BIOS update from the vendor.
> When updating the BIOS do a NVRAM reset.  The data structures of the old BIOS and new
> may differ.  The keyboard sequence to reset a BIOS to all defaults may require
> a call to tech support.   Call the vendor.. you have a warranty on a new board.
>
> Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script
> that lets you see memory, processes and the exact time things fail or just "top".
> It is possible to have syslog also log to the pty of a ssh session.
> When you return to the cage plugin a terminal.  If there is no screen saver or
> screen blanking the GFX card may still display the last key bits of info
> so long as X is not running.
>
>
> --
>        T o m  M i t c h e l l
>        Found me a new hat, now what?
>
> _______________________________________________


Thanx Tom,

You gave some good ideas, and I've been through all of them. As a
general rule of thumb, I only purchase RAM with factory fitted
heatsinks attached to them. The chassis is a 1U chassis, so space is
limited, and only the necessary cables are installed & tidied up
already.

After spending another 2 days in the datacentre trying to figure this
one out, I thought I'd take the machine to the office instead. It's
just so much nicer working in the office :)

Top didn't help much, since I couldn't see what's wrong. But, sitting
at my desk and running some tests & noticed that the fan was running
so load at times, that I couldn't even talk to someone on the phone.
This is when I realized that the Q9300 CPU could be too big a
processor for the fan that I have installed.

The fan that I have, is:
http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165

So, it looks like it's not really made for a Q9300 CPU, although their
specs say it is.


-- 

Kind Regards
Rudi Ahlers



More information about the CentOS mailing list