[CentOS] how to debug hardware lockups?

Rudi Ahlers rudiahlers at gmail.com
Thu Nov 20 08:30:53 UTC 2008


On Thu, Nov 20, 2008 at 10:27 AM, Rudi Ahlers <rudiahlers at gmail.com> wrote:
> On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch
> <niftycluster at niftyegg.com> wrote:
>> On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:
>>> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk at yahoo.co.uk> wrote:
>>> > Rudi Ahlers  wrote:
>>> >
>>> >> We have a server which locks up about once a week (for the
>>> >> past 3
>> ......
>>> >> How do I debug the server, which runs CentOS 5.2 to see why
>>> >> it locks
>>> >> up?
>>
>> Jumping in the middle of a long list of good ideas.
>> Other things to try --
>>   change the run level
>>        if 5 switch to 3
>>        if 3 switch to 5
>>
>> Reinstall the processor--
>>   remove the processor
>>   clean the heat sink and processor of thermal compound
>>   correctly apply the best thermal grease you can get (I like Arctic Silver)
>>   reinstall the heat sink
>>   consider upgrading the processor heat sink if the chassis permits (more Cu is good).
>>
>> Add thermal spreaders to your RAM.  You want all the chips on a RAM stick at the same temp.
>>
>> Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.
>>
>> Turn off any special system monitoring software tools.  Things like I2C serial buses
>> do not isolate simple read only activity from things that might modify (shut
>> down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC
>> installed.   The two tools had overlapping uncoordinated interactions with
>> the hardware and would randomly shut down the system.  Very new boards are almost
>> never supported well so consider going blind.  Read EDAC info on CentOS and RH sites.
>>
>> Inspect then tidy all cables they can mess up air flow and cause thermal issues.
>>
>> Reset the BIOS and check all the BIOS options.  Check for a BIOS update from the vendor.
>> When updating the BIOS do a NVRAM reset.  The data structures of the old BIOS and new
>> may differ.  The keyboard sequence to reset a BIOS to all defaults may require
>> a call to tech support.   Call the vendor.. you have a warranty on a new board.
>>
>> Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script
>> that lets you see memory, processes and the exact time things fail or just "top".
>> It is possible to have syslog also log to the pty of a ssh session.
>> When you return to the cage plugin a terminal.  If there is no screen saver or
>> screen blanking the GFX card may still display the last key bits of info
>> so long as X is not running.
>>
>>
>> --
>>        T o m  M i t c h e l l
>>        Found me a new hat, now what?
>>
>> _______________________________________________
>
>
> Thanx Tom,
>
> You gave some good ideas, and I've been through all of them. As a
> general rule of thumb, I only purchase RAM with factory fitted
> heatsinks attached to them. The chassis is a 1U chassis, so space is
> limited, and only the necessary cables are installed & tidied up
> already.
>
> After spending another 2 days in the datacentre trying to figure this
> one out, I thought I'd take the machine to the office instead. It's
> just so much nicer working in the office :)
>
> Top didn't help much, since I couldn't see what's wrong. But, sitting
> at my desk and running some tests & noticed that the fan was running
> so load at times, that I couldn't even talk to someone on the phone.
> This is when I realized that the Q9300 CPU could be too big a
> processor for the fan that I have installed.
>
> The fan that I have, is:
> http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165
>
> So, it looks like it's not really made for a Q9300 CPU, although their
> specs say it is.
>
>
> --
>

As an interesting side note, with all the other servers & cabinets in
the datacentre, the DB level is so high that it's difficult to pickup
a fan that's blowing at full force the whole time. Only when I was at
the office, I could hear it. My own PC is totally fan & noise free, so
could easily hear when the fan was running fine, and when it was
running at full speed. And that also only when I started the VPS's on
the server, and couldn't ping / SSH it over the network. Top reported
load to be 12 - 15, which is normally still workable, but with the
overheating CPU, I couldn't do a thing.


-- 

Kind Regards
Rudi Ahlers



More information about the CentOS mailing list