I've got an odd situation that I need some advise on. I have two computers that I am planning to use as a cluster. I initially started with some left over Compaq Presairos with 667MHz CPUs. I loaded CentOS 4.3 and later updated to 4.4. Things ran normally, albeit slowly. I had an opportunity to upgrade to a pair of IBM Netvistas with 2.26 GHz CPUs, I did this by transferring the 160GB Western Digital IDE disks and NICs but did not re-install the OS, just migrated the disks. Since then they have had the following symptoms:
-Systems frequently boot faster than the disks can be spun up and have to be soft booted to recognize and boot from the disks.
-Systems will bog down critically after approximately 24 hours loosing system time at an increasing rate, ofrr instance a for loop that runs date, hwclock ans then sleeps for 10 minutes will show time in sync for the first few hours then the system time will begin to fall behind at an increasing rate, after 24 hours the system time essentially stops elapsing. It almost feels like the box has trouble processing interrupts. Once it gets to this state performance becomes very sluggish, for instance top will take up to 90 seconds to display it's first screen and will not update on it's own, only when Enter is depressed. At times top will show 0's across the utilization line for everything including idle. I have gone as far as to boot one of the boxes into single user mode and run the date/hwclock loop and even in that state the system will bog down and gradually stop elapsing time after 18-24 hours. Even shutting down is impacted. A reboot will take well over an hour to process.
I had a copy of Ubuntu on my desk and have booted into that distro from cd and it passes the date/hwclock test (actually lost 2 seconds over a 24 hour period but I can live with that via ntp). I'm downloading a copy of the CentOS live 4.4 cd and will try this with that as well but at this time does anyone see that this could be something other than a disk incompatibility with the newer systems? Should I try re-installing? If it is the disks, any thoughts on something I could try to avoid buying new disks (I have tried setting the BIOS to both the high performance and legacy disk modes (not entirely sure what's behind that IBMism)).
Regards, Chuck
On Sun, 2007-02-18 at 08:52 -0500, Chuck Mattern wrote:
I've got an odd situation that I need some advise on. I have two computers that I am planning to use as a cluster. I initially started with some left over Compaq Presairos with 667MHz CPUs. I loaded CentOS 4.3 and later updated to 4.4. Things ran normally, albeit slowly. I had an opportunity to upgrade to a pair of IBM Netvistas with 2.26 GHz CPUs, I did this by transferring the 160GB Western Digital IDE disks and NICs but did not re-install the OS, just migrated the disks. Since then they have had the following symptoms:
-Systems frequently boot faster than the disks can be spun up and have to be soft booted to recognize and boot from the disks.
I doubt this is related, but I had a similar situation with a couple brand new disks that I installed. Thought I would mention, JIC. Us older folks used to store unused jumpers on pins on the HDS. A ground-to- ground connection never did any harm. I used this same scheme on the new disks. The bootable master had no problems. The secondary on IDE-2 did exactly what you described. One day I got sick of it, popped it out, noted that the pins were "undocumented" and removed the jumper.
Problem solved. *sigh* Back to Scotch taping spare caps to the HD case and replacing the deteriorated tape every once-in-a-while.
Anyway, could you have a similar situation that is causing some long- term effect that I did not see (either it wasn't there or my load didn't cause it to become noticeable or I am unobservant).
<snip>
HTH -- Bill
On 2/18/07, Chuck Mattern camattern@acm.org wrote:
-Systems will bog down critically after approximately 24 hours loosing system time at an increasing rate,
We have several IBM boxes (NetVista mostly) @work that would exhibit similar behavior - run normally at first, then after a few hours the system clock practically stops; I measured 2 minutes of wall-clock time for a "sleep 1" to return, and up to 20 seconds for "usleep 1"... Tried updating BIOS, kernel (4.3, 4.4, updates), some combinations of boot time parameters (as in: clock={pit|pmtmr|..}, noapic, acpi=off and the like), without improvement. We just gave up on those due to the lack of time.
See also: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=203818 Solution from comment 18 may help (that comment was submitted after we've given up, not sure if it was tried).
Cheers, Zoran
First a grateful thank you to the individuals who posted responses and an apology for not turning around faster. My day job is plagued with a project that distinctly resembles the maiden voyage of HMS Titanic and has been getting the best of me.
I'm trying the later fixes suggested in the bugzilla entry pointed to by Zoran below (booting with " noapic noacpi apic=off acpi=off"). Interestingly I had noted several of the issues covered in the responses, different irq handling, ide disk involvement (the onset of time problems seems to be hastened by high disk io). I'll report back on the results.
Regards, Chuck
Zoran Milojevic wrote:
On 2/18/07, Chuck Mattern camattern@acm.org wrote:
-Systems will bog down critically after approximately 24 hours loosing system time at an increasing rate,
We have several IBM boxes (NetVista mostly) @work that would exhibit similar behavior - run normally at first, then after a few hours the system clock practically stops; I measured 2 minutes of wall-clock time for a "sleep 1" to return, and up to 20 seconds for "usleep 1"... Tried updating BIOS, kernel (4.3, 4.4, updates), some combinations of boot time parameters (as in: clock={pit|pmtmr|..}, noapic, acpi=off and the like), without improvement. We just gave up on those due to the lack of time.
See also: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=203818 Solution from comment 18 may help (that comment was submitted after we've given up, not sure if it was tried).
Cheers, Zoran
I now have both servers booting with no apic and no acpi and keeping perfect time (with ntp running, without there was a minor drift of a couple of seconds over the course of a day). What is the down side to running this way? I noticed that certain interrupts are handled differently. One of my co-workers got the impression that this is part of a hardware abstraction layer that could lead to virtualization opportunities. Any idea what the purpose of moving interrupts (in my case for network cards and USB) to virtual interrupts is? sample below:
[root@remus ~]# cat /proc/interrupts CPU0 0: 36662232 IO-APIC-edge timer 1: 9 IO-APIC-edge i8042 8: 26 IO-APIC-edge rtc 9: 0 IO-APIC-level acpi 12: 67 IO-APIC-edge i8042 14: 22214 IO-APIC-edge ide0 15: 329561 IO-APIC-edge ide1 177: 0 IO-APIC-level uhci_hcd 185: 218 IO-APIC-level uhci_hcd 193: 0 IO-APIC-level uhci_hcd 201: 2 IO-APIC-level ehci_hcd 217: 26007 IO-APIC-level eth0, eth1 225: 1030795 IO-APIC-level eth2 NMI: 0 LOC: 36663881 ERR: 0 MIS: 0
Regards, Chuck
Chuck Mattern wrote:
First a grateful thank you to the individuals who posted responses and an apology for not turning around faster. My day job is plagued with a project that distinctly resembles the maiden voyage of HMS Titanic and has been getting the best of me. I'm trying the later fixes suggested in the bugzilla entry pointed to by Zoran below (booting with " noapic noacpi apic=off acpi=off"). Interestingly I had noted several of the issues covered in the responses, different irq handling, ide disk involvement (the onset of time problems seems to be hastened by high disk io). I'll report back on the results.
Regards, Chuck
Zoran Milojevic wrote:
On 2/18/07, Chuck Mattern camattern@acm.org wrote:
-Systems will bog down critically after approximately 24 hours loosing system time at an increasing rate,
We have several IBM boxes (NetVista mostly) @work that would exhibit similar behavior - run normally at first, then after a few hours the system clock practically stops; I measured 2 minutes of wall-clock time for a "sleep 1" to return, and up to 20 seconds for "usleep 1"... Tried updating BIOS, kernel (4.3, 4.4, updates), some combinations of boot time parameters (as in: clock={pit|pmtmr|..}, noapic, acpi=off and the like), without improvement. We just gave up on those due to the lack of time.
See also: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=203818 Solution from comment 18 may help (that comment was submitted after we've given up, not sure if it was tried).
Cheers, Zoran
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Don't know if this related but I have a similar problem as well on my NetVista 2.53 Ghz. However, it seems to be kernel related. I can run 2.6.9-34.0.2 with no problems. Any kernel after this and I experience the similar problems as you. Sorry I can't help but just letting you know you're not alone. I have the 'latest' bios installed as well.
-Eddie
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Chuck Mattern Sent: Sunday, February 18, 2007 8:53 AM To: centos@centos.org Subject: [CentOS] CentOS 4.4-IBM Netvista Performace Problems, help needed.
I've got an odd situation that I need some advise on. I have two computers that I am planning to use as a cluster. I initially started with some left over Compaq Presairos with 667MHz CPUs. I loaded CentOS 4.3 and later updated to 4.4. Things ran normally, albeit slowly. I had an opportunity to upgrade to a pair of IBM Netvistas with 2.26 GHz CPUs, I did this by transferring the 160GB Western Digital IDE disks and NICs but did not re-install the OS, just migrated the disks. Since then they have had the following symptoms:
-Systems frequently boot faster than the disks can be spun up and have to be soft booted to recognize and boot from the disks.
-Systems will bog down critically after approximately 24 hours loosing system time at an increasing rate, ofrr instance a for loop that runs date, hwclock ans then sleeps for 10 minutes will show time in sync for the first few hours then the system time will begin to fall behind at an increasing rate, after 24 hours the system time essentially stops elapsing. It almost feels like the box has trouble processing interrupts. Once it gets to this state performance becomes very sluggish, for instance top will take up to 90 seconds to display it's first screen and will not update on it's own, only when Enter is depressed. At times top will show 0's across the utilization line for everything including idle. I have gone as far as to boot one of the boxes into single user mode and run the date/hwclock loop and even in that state the system will bog down and gradually stop elapsing time after 18-24 hours. Even shutting down is impacted. A reboot will take well over an hour to process.
I had a copy of Ubuntu on my desk and have booted into that distro from cd and it passes the date/hwclock test (actually lost 2 seconds over a 24 hour period but I can live with that via ntp). I'm downloading a copy of the CentOS live 4.4 cd and will try this with that as well but at this time does anyone see that this could be something other than a disk incompatibility with the newer systems? Should I try re-installing? If it is the disks, any thoughts on something I could try to avoid buying new disks (I have tried setting the BIOS to both the high performance and legacy disk modes (not entirely sure what's behind that IBMism)).
Regards, Chuck _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos