Hello!
I am running CentOS-5 with latest kernel available by deault (2.6.23). I installed it on a Dell XPS machine having Intel Quad processors (4 parallel cpus). I use it to run a computational program and I need to keep the program running for 1-2 months continuously. I generally boot it in runlevel-3 with network ON without X and use ssh from another machine to connect and run the program using the "nohup" utility.
However, the system automatically gets suspended (the computational program stops, ssh stops working, whole the OS seems to be freezing) after 4-5 hours. I have stopped the "acpid" daemon and boot the kernel with "acpi=off" option in "grub.conf" but no help. The kernel log ( /var/log/messages) doesn't show anything special. After the instant of suspension, kernel also stops logging into "/var/log/messages".
Please help me out. I think there is a kernel problem. I have run programs for days and days continuously using FC5 (which had older kernel). I can't use FC5 or older version of CentOS because I need GCC-4.1.2+ to compile parallel OpenMP program.
Thank you,
Chandra
On Tue, Feb 05, 2008 at 04:31:57PM +0900, Chandra wrote:
Hello!
I am running CentOS-5 with latest kernel available by deault (2.6.23).
NO! You are no longer running CentOS-5 if you change your kernel for your own version...
Tru
I am running CentOS-5 with latest kernel available by deault (2.6.23).
NO! You are no longer running CentOS-5 if you change your kernel for your own version...
Dear Tru, Thank you for your mail. I didn't change anything at all. It is just the default installation. However, I AM running CentOS-5. Infact, when I did "#rpm -q kernel", it didnt give me any result. I didn't went into details of it assuming that there may be something missing at the trailing end i.e. kernel-<version-no> in my command.
Anyway, I would appreciate if you can help me to get my problem solved.
Thanks a lot,
Chandra
Chandra wrote:
I am running CentOS-5 with latest kernel available by deault (2.6.23).
NO! You are no longer running CentOS-5 if you change your kernel for your own version...
Dear Tru, Thank you for your mail. I didn't change anything at all. It is just the default installation. However, I AM running CentOS-5. Infact, when I did "#rpm -q kernel", it didnt give me any result. I didn't went into details of it assuming that there may be something missing at the trailing end i.e. kernel-<version-no> in my command.
So what does "uname -a" say? We never shipped a 2.6.23 kernel.
Cheers,
Ralph
On Tue, Feb 05, 2008 at 06:50:52PM +0900, Chandra wrote:
I am running CentOS-5 with latest kernel available by deault (2.6.23).
NO! You are no longer running CentOS-5 if you change your kernel for your own version...
Dear Tru, Thank you for your mail. I didn't change anything at all. It is just the default installation.
The person who installed it for you lied.
However, I AM running CentOS-5. Infact, when I did "#rpm -q kernel", it didnt give me any result. I didn't went into details of it assuming that there may be something missing at the trailing end i.e. kernel-<version-no> in my command.
[tru@blackwilson ~]$ rpm -q kernel kernel-2.6.18-53.1.6.el5 [tru@blackwilson ~]$ rpm -q centos-release centos-release-5-1.0.el5.centos.1 [tru@blackwilson ~]$ uname -a Linux blackwilson.xxx 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:28:47 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
That what you should have for a current CentOS-5 x86_64 machine. There is no kernel 2.6.23 from CentOS.
try to install the default kernel with: # yum install kernel
And report any error once you have rebooted with the CentOS kernel.
Anyway, I would appreciate if you can help me to get my problem solved.
There is no support for non CentOS kernel, sorry.
Tru
Sorry for not writing very clearly. here are the details:
$ kernel -qa|grep kernel kernel-PAE-2.6.18-53.el5
$ uname -a Linux localhost.localdomain 2.6.18-53.el5PAE #1 SMP Mon Nov 12 02:55:09 EST 2007 i686 i686 i386 GNU/Linux
$ rpm -q centos-release centos-release-5-1.0.el5.centos.1
Also, I checked that if I reboot the system without any external attachment (i.e. no mouse, no keyboard, no monitor) and attach these elements later, none of these are recognized. What I want is just a network connected CPU to run the program. That's all. I can control the program running on it using ssh from another machine. I would like to add one more thing is this: when I log into a gui-session (gnome), and run my program on a terminal there, it lasts for 1-2 days before suspension. However, if I log into console-mode (no gui), it lasts only for 4-5 hours before suspension. I again think it is entirely a kernel problem. Would recompiling the kernel with some specific options help?
Thanks for allocating your time!!
I'll let others work with you on the kernel version .... (which we'll assume is OK and a true CentOS install).
I would put up a console on the local KVM port to capture the last set of messages before the system hangs -- which might help isolate the problem.
From what we've seen so far, it sounds like you might
have a hardware problem. The things that I would check are:
- power supply (aka losing voltage) - all the system fans (aka thermal shutdown) - memory (run memtest86+ overnight [or longer])
If not that (and still a hardware problem), it is a lot more subtle and will be "fun" to diagnose ....
Hope this helps (a little) ...
-rak-
| - power supply (aka losing voltage) | - all the system fans (aka thermal shutdown) | - memory (run memtest86+ overnight [or longer])
Thanks for your reply.
At first, I rule out any fluctuation in power-supply.
As far as thermal shutdown is concerned, I don't know how to know if it is the case. However, it is a brand new PC and so, I hope it should be working. But I will check it out properly.
memtest86+: I will run it overnight.
As David asked: | What do you have to do to get the box out of "suspend?" If the system | is frozen and you have to reboot the box to "unfreeze" it, I'd guess | it's a heat issue.
yes, I need to push the power-button to restart it.
Thanks once again. I will report the outcome as soon.
- Chandra
on 2/5/2008 7:48 AM Chandra spake the following:
| - power supply (aka losing voltage) | - all the system fans (aka thermal shutdown) | - memory (run memtest86+ overnight [or longer])
Thanks for your reply.
At first, I rule out any fluctuation in power-supply.
As far as thermal shutdown is concerned, I don't know how to know if it is the case. However, it is a brand new PC and so, I hope it should be working. But I will check it out properly.
memtest86+: I will run it overnight.
As David asked: | What do you have to do to get the box out of "suspend?" If the system | is frozen and you have to reboot the box to "unfreeze" it, I'd guess | it's a heat issue.
yes, I need to push the power-button to restart it.
Thanks once again. I will report the outcome as soon.
- Chandra
Do you have any power saving settings turned on in the bios?
Do you have any power saving settings turned on in the bios?
This is a dell computer and it has a power managment option in BIOS. However, I changed it to not-to-save energy mode. After this, I found that the computer restarts rather than hanging.
=========================================================== AN ERROR IS SHOWING UP AT BOOT TIME. It seems to be a BUG: ============================================================ Memory for crash kernel (0x0 to 0x0) notwithin permissible range ..MP-BIOS bug: 8254 timer not connected to IO-APIC Red Hat nash version 5.1.19.6 starting Welcome to CentOS release 5 (Final) .... ..... and continues normal booting.
Any idea how to deal with it. Please not that it has 4 CPUs.
Thanks a lot, - Chandra
=========================================================== AN ERROR IS SHOWING UP AT BOOT TIME. It seems to be a BUG: ============================================================ Memory for crash kernel (0x0 to 0x0) notwithin permissible range ..MP-BIOS bug: 8254 timer not connected to IO-APIC Red Hat nash version 5.1.19.6 starting Welcome to CentOS release 5 (Final) .... ..... and continues normal booting.
Any idea how to deal with it. Please not that it has 4 CPUs.
Thanks a lot,
- Chandra
Check the Release Notes. It is apparently harmless. I see it on all my CentOS 5.1 machines.
B.J.
Ubuntu 7.10, Linux 2.6.22-14-generic unknown 07:57:57 up 21:37, 2 users, load average: 0.15, 0.15, 0.13
On Wed, 2008-02-06 at 21:48 +0900, Chandra wrote:
=========================================================== AN ERROR IS SHOWING UP AT BOOT TIME. It seems to be a BUG: ============================================================ Memory for crash kernel (0x0 to 0x0) notwithin permissible range ..MP-BIOS bug: 8254 timer not connected to IO-APIC Red Hat nash version 5.1.19.6 starting Welcome to CentOS release 5 (Final) .... ..... and continues normal booting.
Any idea how to deal with it. Please not that it has 4 CPUs.
Thanks a lot,
- Chandra
Check the Release Notes. It is apparently harmless. I see it on all my CentOS 5.1 machines.
B.J.
Ubuntu 7.10, Linux 2.6.22-14-generic unknown 08:02:44 up 21:42, 2 users, load average: 0.15, 0.22, 0.16
On Wed, Feb 06, 2008 at 09:48:35PM +0900, Chandra wrote:
Do you have any power saving settings turned on in the bios?
This is a dell computer and it has a power managment option in BIOS. However, I changed it to not-to-save energy mode. After this, I found that the computer restarts rather than hanging.
Looks like some hardware crash to me, otherwise you would have some logs for oops/hangs.
Can you make available somewhere your /var/log/messages (don't send a few MB file to the list) and the /proc/cmdline content ?
You said you used "acpi=off" and acpid disabled is it still the case?
~> chkconfig --list cpuspeed cpuspeed 0:off 1:on 2:off 3:off 4:off 5:off 6:off
Try a burnout test with tools like prime95: ftp://mersenne.org/gimps/mprime2414.tar.gz
ref: http://www.playtool.com/pages/prime95/prime95.html http://www.mersenne.org/freesoft.htm
Tru