[CentOS] System hangs silently

Wed Jan 18 20:24:51 UTC 2006
Fong Vang <sudoyang at gmail.com>

On 1/18/06, Les Mikesell <lesmikesell at gmail.com> wrote:
> On Wed, 2006-01-18 at 13:38, Fong Vang wrote:
> > I have a total of 20 CentOS 4.1 systems running on fairly new
> > hardware.  About 6 of them are experiencing strange hangs without any
> > indication -- nothing in /var/log/messages nor on the console --
> > sometime within 10-30 minutes after a reboot.  The systems still
> > responds to ping but you can't ssh to it.  At the console, you could
> > type "root" at the user prompt but it hangs immediately after hitting
> > enter.
> >
> > Memory scan of all systems show no error.
> >
> > Any idea how to troubleshoot this problem.  The system's not
> > responsive to do any troubleshooting and nothing abnormal is in the
> > log.
> >
> > We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.
> My first guess would be that something is consuming all possible
> memory and pushing everything else into swap.  The system may
> not be completely hung, but it can't respond in a reasonable
> amount of time.  If the logs for whatever services you run
> don't show anything, I'd watch with top over a period of
> time to see if a single program is doing it and frequent
> "ps ax" check to see if a large number of small processes
> are accumulating.  You can get a hint about how fast new
> processes are being started by looking at the process id
> of the ps process when you run it repeatedly.  I assume from
> the fact that you have 20 boxes that you are doing something
> that causes substantial load - perhaps it needs to be distributed
> better.

These systems will be doing a lot once we turn on the service, but
we're still in the setup mode.

So far, the only thing we've done is kicked these systems from the
same image/profile.  We've turned off all services with almost nothing
running on them at all.  That's what's baffling about this.  The hang
is so silent making it very difficult to trouble shoot (again, the
system responds to ping.  load avergage is normal.  context switch is
normal.  swap is normal.  network and io is normal.)

We'll have to look at the hardware next to determine if they are
indeed the same.

> --
>   Les Mikesell
>     lesmikesell at gmail.com
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos