On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch niftycluster@niftyegg.com wrote:
On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:
On Sat, Nov 15, 2008 at 7:26 PM, Vandaman vandaman2002-sk@yahoo.co.uk wrote:
Rudi Ahlers wrote:
We have a server which locks up about once a week (for the past 3
......
How do I debug the server, which runs CentOS 5.2 to see why it locks up?
Jumping in the middle of a long list of good ideas. Other things to try -- change the run level if 5 switch to 3 if 3 switch to 5
Reinstall the processor-- remove the processor clean the heat sink and processor of thermal compound correctly apply the best thermal grease you can get (I like Arctic Silver) reinstall the heat sink consider upgrading the processor heat sink if the chassis permits (more Cu is good).
Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp.
Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.
Turn off any special system monitoring software tools. Things like I2C serial buses do not isolate simple read only activity from things that might modify (shut down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC installed. The two tools had overlapping uncoordinated interactions with the hardware and would randomly shut down the system. Very new boards are almost never supported well so consider going blind. Read EDAC info on CentOS and RH sites.
Inspect then tidy all cables they can mess up air flow and cause thermal issues.
Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor. When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new may differ. The keyboard sequence to reset a BIOS to all defaults may require a call to tech support. Call the vendor.. you have a warranty on a new board.
Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script that lets you see memory, processes and the exact time things fail or just "top". It is possible to have syslog also log to the pty of a ssh session. When you return to the cage plugin a terminal. If there is no screen saver or screen blanking the GFX card may still display the last key bits of info so long as X is not running.
-- T o m M i t c h e l l Found me a new hat, now what?
Thanx Tom,
You gave some good ideas, and I've been through all of them. As a general rule of thumb, I only purchase RAM with factory fitted heatsinks attached to them. The chassis is a 1U chassis, so space is limited, and only the necessary cables are installed & tidied up already.
After spending another 2 days in the datacentre trying to figure this one out, I thought I'd take the machine to the office instead. It's just so much nicer working in the office :)
Top didn't help much, since I couldn't see what's wrong. But, sitting at my desk and running some tests & noticed that the fan was running so load at times, that I couldn't even talk to someone on the phone. This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed.
The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165
So, it looks like it's not really made for a Q9300 CPU, although their specs say it is.