System hangs silently

List overview All Threads
Download

newer

older

LVM

grub.conf LABEL location

Fong Vang

18 Jan 2006 18 Jan '06

7:38 p.m.

I have a total of 20 CentOS 4.1 systems running on fairly new hardware. About 6 of them are experiencing strange hangs without any indication -- nothing in /var/log/messages nor on the console -- sometime within 10-30 minutes after a reboot. The systems still responds to ping but you can't ssh to it. At the console, you could type "root" at the user prompt but it hangs immediately after hitting enter.

Memory scan of all systems show no error.

Any idea how to troubleshoot this problem. The system's not responsive to do any troubleshooting and nothing abnormal is in the log.

We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

Thanks for any help.

Show replies by date

Robert Hanson

18 Jan 18 Jan

7:51 p.m.

} } I have a total of 20 CentOS 4.1 systems running on fairly new } hardware. About 6 of them are experiencing strange hangs without any } indication -- nothing in /var/log/messages nor on the console -- } sometime within 10-30 minutes after a reboot. The systems still } responds to ping but you can't ssh to it. At the console, you could } type "root" at the user prompt but it hangs immediately after hitting } enter. } } Memory scan of all systems show no error. } } Any idea how to troubleshoot this problem. The system's not } responsive to do any troubleshooting and nothing abnormal is in the } log. } } We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm. } } Thanks for any help.

greetings

im quite sure you are most intelligent so you have pry done these things already..

the first two things that come to mind are... do you have the latest stable "firmware" on those machines

are they all the same or is there a common denominator besides CentOS 4.1 ?

and have you tried to install the latest kernels and such... there was recent publishing of them

if they are connected to the internet, unplug for testing??

- rh

-- Robert Hanson - Abba Communications Computer & Internet Services (509) 624-7159 - www.abbacomm.net

Fong Vang

8:11 p.m.

On 1/18/06, Robert Hanson roberth@abbacomm.net wrote:

...

} } I have a total of 20 CentOS 4.1 systems running on fairly new } hardware. About 6 of them are experiencing strange hangs without any } indication -- nothing in /var/log/messages nor on the console -- } sometime within 10-30 minutes after a reboot. The systems still } responds to ping but you can't ssh to it. At the console, you could } type "root" at the user prompt but it hangs immediately after hitting } enter. } } Memory scan of all systems show no error. } } Any idea how to troubleshoot this problem. The system's not } responsive to do any troubleshooting and nothing abnormal is in the } log. } } We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm. } } Thanks for any help.

greetings

im quite sure you are most intelligent so you have pry done these things already..

the first two things that come to mind are... do you have the latest stable "firmware" on those machines

I haven't double checked this, yet. We have a person from the hardware vendor here on site so I'll have him double check that.

...

are they all the same or is there a common denominator besides CentOS 4.1 ?

and have you tried to install the latest kernels and such... there was recent publishing of them

if they are connected to the internet, unplug for testing??

These systems are ordered from the same batch (same PO/build spec). They're all using the same kernel -- the latest of what CentOS 4.1 provided at that time.

...

-- Robert Hanson - Abba Communications Computer & Internet Services (509) 624-7159 - www.abbacomm.net

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Leonard Isham

10:01 p.m.

On 1/18/06, Fong Vang sudoyang@gmail.com wrote:

...

On 1/18/06, Robert Hanson roberth@abbacomm.net wrote:

...
} } I have a total of 20 CentOS 4.1 systems running on fairly new } hardware. About 6 of them are experiencing strange hangs without any } indication -- nothing in /var/log/messages nor on the console -- } sometime within 10-30 minutes after a reboot. The systems still } responds to ping but you can't ssh to it. At the console, you could } type "root" at the user prompt but it hangs immediately after hitting } enter. } } Memory scan of all systems show no error. } } Any idea how to troubleshoot this problem. The system's not } responsive to do any troubleshooting and nothing abnormal is in the } log. } } We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm. } } Thanks for any help.

greetings

im quite sure you are most intelligent so you have pry done these things already..

the first two things that come to mind are... do you have the latest stable "firmware" on those machines

I haven't double checked this, yet. We have a person from the hardware vendor here on site so I'll have him double check that.

...
are they all the same or is there a common denominator besides CentOS 4.1 ?

and have you tried to install the latest kernels and such... there was recent publishing of them

if they are connected to the internet, unplug for testing??

These systems are ordered from the same batch (same PO/build spec). They're all using the same kernel -- the latest of what CentOS 4.1 provided at that time.

I hate to say this, but I have found that this is not a guarantee of 100% duplication of the internals. Not even when the systems have the same model numbers. I won't mention a well known computer company with three letters... or big... or blue...

I've been bitten by this.

-- Leonard Isham, CISSP Ostendo non ostento.

Rodrigo Barbosa

10:07 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Jan 18, 2006 at 05:01:15PM -0500, Leonard Isham wrote:

...

...
These systems are ordered from the same batch (same PO/build spec). They're all using the same kernel -- the latest of what CentOS 4.1 provided at that time.

I hate to say this, but I have found that this is not a guarantee of 100% duplication of the internals. Not even when the systems have the same model numbers. I won't mention a well known computer company with three letters... or big... or blue...

I've been bitten by this.

I have to agree. Version numbers mean nothing. Most of the time, tho, the Part Number will tell the truth.

With such a batch of machine, it would be interesting to try isolating the specifics of the ones giving problems, starting by the processors (check the P/N) and then the northbridge, which are the two most likely to be the culprid.

We have been discussing this issue on-and-off on the linux-practices mailing list so, if you want to go there with some extra info, we might be able to help you on this without having people screaming "OFF TOPIC!" here on this list :)

Best Regards,

- -- Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDzrw5pdyWzQ5b5ckRAjrDAJ4gp9PGUGPd0ZsxN1hBDBea6v4IlwCcDCaE AuOOC8qS+9X3cHnUs7LBrvA= =nBAU -----END PGP SIGNATURE-----

Fong Vang

19 Jan 19 Jan

12:19 a.m.

On 1/18/06, Rodrigo Barbosa rodrigob@suespammers.org wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Jan 18, 2006 at 05:01:15PM -0500, Leonard Isham wrote:

...
...
These systems are ordered from the same batch (same PO/build spec). They're all using the same kernel -- the latest of what CentOS 4.1 provided at that time.

I hate to say this, but I have found that this is not a guarantee of 100% duplication of the internals. Not even when the systems have the same model numbers. I won't mention a well known computer company with three letters... or big... or blue...

I've been bitten by this.

I have to agree. Version numbers mean nothing. Most of the time, tho, the Part Number will tell the truth.

With such a batch of machine, it would be interesting to try isolating the specifics of the ones giving problems, starting by the processors (check the P/N) and then the northbridge, which are the two most likely to be the culprid.

We have been discussing this issue on-and-off on the linux-practices mailing list so, if you want to go there with some extra info, we might be able to help you on this without having people screaming "OFF TOPIC!" here on this list :)

Hopefully we're not wondering off topic here. I have more information to share. Here's what I have learned since then"

* when the system appears to hang, you can't ssh to it but if you already have a connection it works fine. * high load average (~25) * vmstat reports no heavy context switching, swapping, cpu utilization, paging, etc. * iostat activity is normal (no long iowait or service time) * netstat/ifconfig is normal (no collision, error, etc.) * more than a dozen crond process. It seems to start every 10 minutes to run sar. strace of crond shows it doing setup(). Shutting down crond caused it to hang more than 20 minutes before it came back.

Anyway, I'm having two systems shipped back here from a remote data center for further analysis.

thank you all for your help and suggestion.

...

Best Regards,

Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDzrw5pdyWzQ5b5ckRAjrDAJ4gp9PGUGPd0ZsxN1hBDBea6v4IlwCcDCaE AuOOC8qS+9X3cHnUs7LBrvA= =nBAU -----END PGP SIGNATURE----- _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Rodrigo Barbosa

18 Jan 18 Jan

7:55 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Jan 18, 2006 at 11:38:38AM -0800, Fong Vang wrote:

...

We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

I have a fairly good idea what it is: context switch storm.

We have been seeing it for quite some time. Always Intel hardware, usually Xeons, but sometimes P4 HTs too. It is a known condition, even tho the bug itself is still elusive. Could be either related to the processor or the northbridge.

So far, the only way to stop the problem is switching to a non-smp Kernel.

Best Regards,

- -- Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDzp1PpdyWzQ5b5ckRAvfUAJsEKplmubip/tvCCLy2fuDc75WvCgCggvFS 0kkwIFpeiVIEKPWXEG7bbIo= =CMTD -----END PGP SIGNATURE-----

Fong Vang

8:15 p.m.

On 1/18/06, Rodrigo Barbosa rodrigob@suespammers.org wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Jan 18, 2006 at 11:38:38AM -0800, Fong Vang wrote:

...
We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

I have a fairly good idea what it is: context switch storm.

We have been seeing it for quite some time. Always Intel hardware, usually Xeons, but sometimes P4 HTs too. It is a known condition, even tho the bug itself is still elusive. Could be either related to the processor or the northbridge.

Are you supposed to see this abnomally when using vmstat to see context switch rate? I'm only see < 150 on these machines (but they're functioning now so maybe we just haven't trigger the abnomally?).

We are using dual Xeons.

Which kernel would you recommend? I guess we'll have to try the kernel from CentOS 4.2.

...

So far, the only way to stop the problem is switching to a non-smp Kernel.

Best Regards,

Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDzp1PpdyWzQ5b5ckRAvfUAJsEKplmubip/tvCCLy2fuDc75WvCgCggvFS 0kkwIFpeiVIEKPWXEG7bbIo= =CMTD -----END PGP SIGNATURE----- _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Les Mikesell

8:19 p.m.

On Wed, 2006-01-18 at 14:15, Fong Vang wrote:

...

Which kernel would you recommend? I guess we'll have to try the kernel from CentOS 4.2.

Did you have some reason to not do a full update to 4.2?

-- Les Mikesell lesmikesell@gmail.com

Fong Vang

8:29 p.m.

On 1/18/06, Les Mikesell lesmikesell@gmail.com wrote:

...

On Wed, 2006-01-18 at 14:15, Fong Vang wrote:

...
Which kernel would you recommend? I guess we'll have to try the kernel from CentOS 4.2.

Did you have some reason to not do a full update to 4.2?

We have to stick with 4.1 until the next release because our process requires it. A release starts with Engineering, then it goes into in QA, and eventually hits Production. We can't simply switch to 4.2 without going through the formal process (which can take a while), but I can certainly try out the new kernel on one system and revert afterward.

...

-- Les Mikesell lesmikesell@gmail.com

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Rodrigo Barbosa

8:54 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Jan 18, 2006 at 12:29:48PM -0800, Fong Vang wrote:

...

On 1/18/06, Les Mikesell lesmikesell@gmail.com wrote:

...
On Wed, 2006-01-18 at 14:15, Fong Vang wrote:

...
Which kernel would you recommend? I guess we'll have to try the kernel from CentOS 4.2.

Did you have some reason to not do a full update to 4.2?

We have to stick with 4.1 until the next release because our process requires it. A release starts with Engineering, then it goes into in QA, and eventually hits Production. We can't simply switch to 4.2 without going through the formal process (which can take a while), but I can certainly try out the new kernel on one system and revert afterward.

I would recomend against that.

Simply the the non-smp version of the kernel you are already running:

kernel-2.6.9-11.EL.i686.rpm

If you start changing too many things, or making too great a change, you will never be able to know what fixed the issue, let alone document it.

- -- Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDzqrwpdyWzQ5b5ckRAufVAJ0b+QuhTg7AfqF9DgQlTCzqtTcsZACeLTwY iZyRXjBDXh5Actcgld2zaDM= =wtSB -----END PGP SIGNATURE-----

Rodrigo Barbosa

8:20 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, Jan 18, 2006 at 12:15:41PM -0800, Fong Vang wrote:

...

...
...
We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

I have a fairly good idea what it is: context switch storm.

We have been seeing it for quite some time. Always Intel hardware, usually Xeons, but sometimes P4 HTs too. It is a known condition, even tho the bug itself is still elusive. Could be either related to the processor or the northbridge.

Are you supposed to see this abnomally when using vmstat to see context switch rate? I'm only see < 150 on these machines (but they're functioning now so maybe we just haven't trigger the abnomally?).

We are using dual Xeons.

Which kernel would you recommend? I guess we'll have to try the kernel from CentOS 4.2.

Yes, when the problem starts, your cs rate should go to at least 2K. Sometimes you can get it past 100K (reported, not seen it myself).

Any non-smp kernel should solve your problem.

- -- Rodrigo Barbosa rodrigob@suespammers.org "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDzqMipdyWzQ5b5ckRAvIdAJwObj5DV0oYM24HQR/jPWiQMsRn+wCgpuBG Fp7rStZvmzSekMMZ4MBdBDc= =+J6c -----END PGP SIGNATURE-----

Les Mikesell

8 p.m.

On Wed, 2006-01-18 at 13:38, Fong Vang wrote:

...

I have a total of 20 CentOS 4.1 systems running on fairly new hardware. About 6 of them are experiencing strange hangs without any indication -- nothing in /var/log/messages nor on the console -- sometime within 10-30 minutes after a reboot. The systems still responds to ping but you can't ssh to it. At the console, you could type "root" at the user prompt but it hangs immediately after hitting enter.

Memory scan of all systems show no error.

Any idea how to troubleshoot this problem. The system's not responsive to do any troubleshooting and nothing abnormal is in the log.

We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

My first guess would be that something is consuming all possible memory and pushing everything else into swap. The system may not be completely hung, but it can't respond in a reasonable amount of time. If the logs for whatever services you run don't show anything, I'd watch with top over a period of time to see if a single program is doing it and frequent "ps ax" check to see if a large number of small processes are accumulating. You can get a hint about how fast new processes are being started by looking at the process id of the ps process when you run it repeatedly. I assume from the fact that you have 20 boxes that you are doing something that causes substantial load - perhaps it needs to be distributed better.

-- Les Mikesell lesmikesell@gmail.com

Fong Vang

8:24 p.m.

On 1/18/06, Les Mikesell lesmikesell@gmail.com wrote:

...

On Wed, 2006-01-18 at 13:38, Fong Vang wrote:

...
I have a total of 20 CentOS 4.1 systems running on fairly new hardware. About 6 of them are experiencing strange hangs without any indication -- nothing in /var/log/messages nor on the console -- sometime within 10-30 minutes after a reboot. The systems still responds to ping but you can't ssh to it. At the console, you could type "root" at the user prompt but it hangs immediately after hitting enter.

Memory scan of all systems show no error.

Any idea how to troubleshoot this problem. The system's not responsive to do any troubleshooting and nothing abnormal is in the log.

We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

My first guess would be that something is consuming all possible memory and pushing everything else into swap. The system may not be completely hung, but it can't respond in a reasonable amount of time. If the logs for whatever services you run don't show anything, I'd watch with top over a period of time to see if a single program is doing it and frequent "ps ax" check to see if a large number of small processes are accumulating. You can get a hint about how fast new processes are being started by looking at the process id of the ps process when you run it repeatedly. I assume from the fact that you have 20 boxes that you are doing something that causes substantial load - perhaps it needs to be distributed better.

These systems will be doing a lot once we turn on the service, but we're still in the setup mode.

So far, the only thing we've done is kicked these systems from the same image/profile. We've turned off all services with almost nothing running on them at all. That's what's baffling about this. The hang is so silent making it very difficult to trouble shoot (again, the system responds to ping. load avergage is normal. context switch is normal. swap is normal. network and io is normal.)

We'll have to look at the hardware next to determine if they are indeed the same.

...

-- Les Mikesell lesmikesell@gmail.com

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Maciej Żenczykowski

8:13 p.m.

...

We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.

I'd suggest updating to CentOS 4.2 and the newest kernel-smp-2.6.9-22.0.2.EL.i686 and verifying whether firmware/BIOS is up2date. Do the same machines always crash? Common hardware denominator?

Cheers, MaZe.

Paul Heinlein

8:59 p.m.

On Wed, 18 Jan 2006, Fong Vang wrote:

...

I have a total of 20 CentOS 4.1 systems running on fairly new hardware. About 6 of them are experiencing strange hangs without any indication -- nothing in /var/log/messages nor on the console -- sometime within 10-30 minutes after a reboot. The systems still responds to ping but you can't ssh to it. At the console, you could type "root" at the user prompt but it hangs immediately after hitting enter.

Memory scan of all systems show no error.

Any idea how to troubleshoot this problem. The system's not responsive to do any troubleshooting and nothing abnormal is in the log.

Other folks have hit on the best starting points. For diagnosis, however, you might want to cobble up a cron script that can run every minute:

#!/bin/sh # # season to taste... ( top -n 1 -b # also provides a timestamp vmstat iostat ps axf ) >> /var/log/troubleshooting.log 2>&1

The resulting log will be verbose and will grow quickly, but it'll likely contain strong hints of any process-related problems. What it won't do, of course, is provide indications of hardware faults.

-- Paul Heinlein <> heinlein@madboa.com <> www.madboa.com

7293

Age (days ago)

7294

Last active (days ago)

discuss@lists.centos.org

15 comments

7 participants

tags (0)

participants (7)

Fong Vang
Leonard Isham
Les Mikesell
Maciej Żenczykowski
Paul Heinlein
Robert Hanson
Rodrigo Barbosa