[CentOS] Server hangs on CentOS 5.5

Thu Mar 10 10:04:23 UTC 2011
Simon Matter <simon.matter at invoca.ch>

> compdoc wrote:
>>> According to the man page, it apparently needs a kernel driver
>>> named OpenIMPI, which it claims is installed in standard
>>> distributions.  I don't find it on my system.
>>
>>
>> lm_sensors is another, and I think installs ready to use from the repos.
>
> sensors says that the three temp sensors read +36C, +39C, and +87C.
> These appear to be AMD K10 temp sensors, although I might be
> misreading sensors-detect.  Low/highs are (+127/+127, +127/+90,
> +127/+127) respectively.  (I'm not sure if these are alarm set
> points or something else.)
>
> One fan is listed as 0 rpm.   Something to look into.

Hmm, much has been said now in this thread and I know how difficult it can
be to find such an issue. However, I suggest not to throw in too many new
tools in parallel. And, be careful of how to interpret any information
gathered by tools like lm_sensors. They can only report as good as the
mainboard and it's sensors were designed and built, both can be
suboptimal. I've seen all kind of things like temp sensors not mounted
where they should. Of course, builtin sensors like thiose of a CPU should
be taken very serious.

So, may I give some more tips how I'd try to find what is wrong:
- Take a vacuum cleaner and *carefully* clean the whole box. Dust can
really do bad things because it is not a perfect insulator.
- If you feel you have to remove any device like CPU, make sure you up
everything, have a good quality heat sink paste at hand and make sure
everything is seated well after mounting it again.
- For the memory part, do you have ECC? If not, then it is really a
problem and if the box is used as a server, ECC is a must, if yes, then
most errors will be corrected by ECC but what is more important, memory
errors are usually logged. You should be able to find a list of those
errors in the BIOS, you may see how many times errors occur and where,
does something like that exist?
- For the temparatures, 87C is not so uncommon, but yes, it looks a little
bit high. Someone else posted 80C to be the max for your CPU, that seems
correct, at least our 12core Opterons have "Caution: 75C; Critical: 80C"
but they usually run at 45C-55C under normal load. So if 87C is really
correct, under normal load, that may be already too much, and then
consider what happens at peak times?
- When you look at the lm_sensors values, do they correspund with what is
shown in the BIOS (if is has this kind of diagnostics)?

Simon