[CentOS] how to debug hardware lockups?
rudiahlers at gmail.com
Thu Nov 20 08:30:53 UTC 2008
On Thu, Nov 20, 2008 at 10:27 AM, Rudi Ahlers <rudiahlers at gmail.com> wrote:
> On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch
> <niftycluster at niftyegg.com> wrote:
>> On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:
>>> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk at yahoo.co.uk> wrote:
>>> > Rudi Ahlers wrote:
>>> >> We have a server which locks up about once a week (for the
>>> >> past 3
>>> >> How do I debug the server, which runs CentOS 5.2 to see why
>>> >> it locks
>>> >> up?
>> Jumping in the middle of a long list of good ideas.
>> Other things to try --
>> change the run level
>> if 5 switch to 3
>> if 3 switch to 5
>> Reinstall the processor--
>> remove the processor
>> clean the heat sink and processor of thermal compound
>> correctly apply the best thermal grease you can get (I like Arctic Silver)
>> reinstall the heat sink
>> consider upgrading the processor heat sink if the chassis permits (more Cu is good).
>> Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp.
>> Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.
>> Turn off any special system monitoring software tools. Things like I2C serial buses
>> do not isolate simple read only activity from things that might modify (shut
>> down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC
>> installed. The two tools had overlapping uncoordinated interactions with
>> the hardware and would randomly shut down the system. Very new boards are almost
>> never supported well so consider going blind. Read EDAC info on CentOS and RH sites.
>> Inspect then tidy all cables they can mess up air flow and cause thermal issues.
>> Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor.
>> When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new
>> may differ. The keyboard sequence to reset a BIOS to all defaults may require
>> a call to tech support. Call the vendor.. you have a warranty on a new board.
>> Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script
>> that lets you see memory, processes and the exact time things fail or just "top".
>> It is possible to have syslog also log to the pty of a ssh session.
>> When you return to the cage plugin a terminal. If there is no screen saver or
>> screen blanking the GFX card may still display the last key bits of info
>> so long as X is not running.
>> T o m M i t c h e l l
>> Found me a new hat, now what?
> Thanx Tom,
> You gave some good ideas, and I've been through all of them. As a
> general rule of thumb, I only purchase RAM with factory fitted
> heatsinks attached to them. The chassis is a 1U chassis, so space is
> limited, and only the necessary cables are installed & tidied up
> After spending another 2 days in the datacentre trying to figure this
> one out, I thought I'd take the machine to the office instead. It's
> just so much nicer working in the office :)
> Top didn't help much, since I couldn't see what's wrong. But, sitting
> at my desk and running some tests & noticed that the fan was running
> so load at times, that I couldn't even talk to someone on the phone.
> This is when I realized that the Q9300 CPU could be too big a
> processor for the fan that I have installed.
> The fan that I have, is:
> So, it looks like it's not really made for a Q9300 CPU, although their
> specs say it is.
As an interesting side note, with all the other servers & cabinets in
the datacentre, the DB level is so high that it's difficult to pickup
a fan that's blowing at full force the whole time. Only when I was at
the office, I could hear it. My own PC is totally fan & noise free, so
could easily hear when the fan was running fine, and when it was
running at full speed. And that also only when I started the VPS's on
the server, and couldn't ping / SSH it over the network. Top reported
load to be 12 - 15, which is normally still workable, but with the
overheating CPU, I couldn't do a thing.
More information about the CentOS