[CentOS] how to debug hardware lockups?

Nifty Cluster Mitch niftycluster at niftyegg.com
Thu Nov 20 08:09:00 UTC 2008


On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:
> On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk at yahoo.co.uk> wrote:
> > Rudi Ahlers  wrote:
> >
> >> We have a server which locks up about once a week (for the
> >> past 3
......
> >> How do I debug the server, which runs CentOS 5.2 to see why
> >> it locks
> >> up?

Jumping in the middle of a long list of good ideas.
Other things to try --
   change the run level 
	if 5 switch to 3
	if 3 switch to 5

Reinstall the processor--
   remove the processor
   clean the heat sink and processor of thermal compound
   correctly apply the best thermal grease you can get (I like Arctic Silver)
   reinstall the heat sink 
   consider upgrading the processor heat sink if the chassis permits (more Cu is good).

Add thermal spreaders to your RAM.  You want all the chips on a RAM stick at the same temp.

Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.

Turn off any special system monitoring software tools.  Things like I2C serial buses
do not isolate simple read only activity from things that might modify (shut
down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC 
installed.   The two tools had overlapping uncoordinated interactions with 
the hardware and would randomly shut down the system.  Very new boards are almost
never supported well so consider going blind.  Read EDAC info on CentOS and RH sites.

Inspect then tidy all cables they can mess up air flow and cause thermal issues.

Reset the BIOS and check all the BIOS options.  Check for a BIOS update from the vendor.
When updating the BIOS do a NVRAM reset.  The data structures of the old BIOS and new
may differ.  The keyboard sequence to reset a BIOS to all defaults may require
a call to tech support.   Call the vendor.. you have a warranty on a new board.

Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script
that lets you see memory, processes and the exact time things fail or just "top".
It is possible to have syslog also log to the pty of a ssh session.
When you return to the cage plugin a terminal.  If there is no screen saver or
screen blanking the GFX card may still display the last key bits of info
so long as X is not running.   


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?




More information about the CentOS mailing list