On Wed, 2006-01-18 at 13:38, Fong Vang wrote:
I have a total of 20 CentOS 4.1 systems running on fairly new hardware. About 6 of them are experiencing strange hangs without any indication -- nothing in /var/log/messages nor on the console -- sometime within 10-30 minutes after a reboot. The systems still responds to ping but you can't ssh to it. At the console, you could type "root" at the user prompt but it hangs immediately after hitting enter.
Memory scan of all systems show no error.
Any idea how to troubleshoot this problem. The system's not responsive to do any troubleshooting and nothing abnormal is in the log.
We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.
My first guess would be that something is consuming all possible memory and pushing everything else into swap. The system may not be completely hung, but it can't respond in a reasonable amount of time. If the logs for whatever services you run don't show anything, I'd watch with top over a period of time to see if a single program is doing it and frequent "ps ax" check to see if a large number of small processes are accumulating. You can get a hint about how fast new processes are being started by looking at the process id of the ps process when you run it repeatedly. I assume from the fact that you have 20 boxes that you are doing something that causes substantial load - perhaps it needs to be distributed better.