[CentOS] Server hangs on CentOS 5.5

Thu Mar 10 16:11:00 UTC 2011
Michael Eager <eager at eagerm.com>

Simon Matter wrote:

>> One fan is listed as 0 rpm.   Something to look into.
> 
> Hmm, much has been said now in this thread and I know how difficult it can
> be to find such an issue. However, I suggest not to throw in too many new
> tools in parallel. And, be careful of how to interpret any information
> gathered by tools like lm_sensors. They can only report as good as the
> mainboard and it's sensors were designed and built, both can be
> suboptimal. I've seen all kind of things like temp sensors not mounted
> where they should. Of course, builtin sensors like thiose of a CPU should
> be taken very serious.

Thanks for the suggestions.

> So, may I give some more tips how I'd try to find what is wrong:
> - Take a vacuum cleaner and *carefully* clean the whole box. Dust can
> really do bad things because it is not a perfect insulator.
> - If you feel you have to remove any device like CPU, make sure you up
> everything, have a good quality heat sink paste at hand and make sure
> everything is seated well after mounting it again.
> - For the memory part, do you have ECC? If not, then it is really a
> problem and if the box is used as a server, ECC is a must, if yes, then
> most errors will be corrected by ECC but what is more important, memory
> errors are usually logged. You should be able to find a list of those
> errors in the BIOS, you may see how many times errors occur and where,
> does something like that exist?

The MB docs/website don't mention ECC support, but I presume it is as part
of the DDR2 spec.  I'll check whether the memory has ECC.  If not, this is
a reasonable upgrade.

> - For the temparatures, 87C is not so uncommon, but yes, it looks a little
> bit high. Someone else posted 80C to be the max for your CPU, that seems
> correct, at least our 12core Opterons have "Caution: 75C; Critical: 80C"
> but they usually run at 45C-55C under normal load. So if 87C is really
> correct, under normal load, that may be already too much, and then
> consider what happens at peak times?

The most recent crash was overnight and not discovered until morning.
Probably not related to load.  But if it really is running over temp,
then almost anything can happen.

> - When you look at the lm_sensors values, do they correspund with what is
> shown in the BIOS (if is has this kind of diagnostics)?

Something I'll check when the system is taken down.

-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077