[CentOS] how to debug hardware lockups?
rudiahlers at gmail.com
Thu Nov 20 08:27:35 UTC 2008
On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch
<niftycluster at niftyegg.com> wrote:
> On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:
>> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk at yahoo.co.uk> wrote:
>> > Rudi Ahlers wrote:
>> >> We have a server which locks up about once a week (for the
>> >> past 3
>> >> How do I debug the server, which runs CentOS 5.2 to see why
>> >> it locks
>> >> up?
> Jumping in the middle of a long list of good ideas.
> Other things to try --
> change the run level
> if 5 switch to 3
> if 3 switch to 5
> Reinstall the processor--
> remove the processor
> clean the heat sink and processor of thermal compound
> correctly apply the best thermal grease you can get (I like Arctic Silver)
> reinstall the heat sink
> consider upgrading the processor heat sink if the chassis permits (more Cu is good).
> Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp.
> Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.
> Turn off any special system monitoring software tools. Things like I2C serial buses
> do not isolate simple read only activity from things that might modify (shut
> down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC
> installed. The two tools had overlapping uncoordinated interactions with
> the hardware and would randomly shut down the system. Very new boards are almost
> never supported well so consider going blind. Read EDAC info on CentOS and RH sites.
> Inspect then tidy all cables they can mess up air flow and cause thermal issues.
> Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor.
> When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new
> may differ. The keyboard sequence to reset a BIOS to all defaults may require
> a call to tech support. Call the vendor.. you have a warranty on a new board.
> Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script
> that lets you see memory, processes and the exact time things fail or just "top".
> It is possible to have syslog also log to the pty of a ssh session.
> When you return to the cage plugin a terminal. If there is no screen saver or
> screen blanking the GFX card may still display the last key bits of info
> so long as X is not running.
> T o m M i t c h e l l
> Found me a new hat, now what?
You gave some good ideas, and I've been through all of them. As a
general rule of thumb, I only purchase RAM with factory fitted
heatsinks attached to them. The chassis is a 1U chassis, so space is
limited, and only the necessary cables are installed & tidied up
After spending another 2 days in the datacentre trying to figure this
one out, I thought I'd take the machine to the office instead. It's
just so much nicer working in the office :)
Top didn't help much, since I couldn't see what's wrong. But, sitting
at my desk and running some tests & noticed that the fan was running
so load at times, that I couldn't even talk to someone on the phone.
This is when I realized that the Q9300 CPU could be too big a
processor for the fan that I have installed.
The fan that I have, is:
So, it looks like it's not really made for a Q9300 CPU, although their
specs say it is.
More information about the CentOS