I have two HP dc7800 convertible minitowers that are exhibiting the following issue: every 5-10 minutes, they will "freeze" for about 30 seconds, and then pick right back up again. During the freeze, it seems that nothing at all happens on the system; the clock doesn't even advance (it just picks up again with the next second, and that 30-or-so seconds are lost).
I've tried both CentOS 5.8 and 5.7, thinking it was a kernel incompatibility, but the problem happened with both versions. I have tried different hard drives, different memory, even swapped the entire machine, and the problem exists everywhere. I have tried adding "pci=nommconf" to the kernel line, as that was reported as being necessary back with 5.2 on these machines, but that made no difference (and shouldn't be necessary now, anyway, as I believe the issue has either been fixed or worked-around).
I am stuck, and can't figure out where to even suspect the problem might actually be. There are no errors getting logged anywhere that I can find, probably because everything just "stops" temporarily, so there's nothing for the system to log.
Does anyone have any idea where I could look to fix this? I think I am next going to go back to 5.2, where the pci=nommconf is necessary, because at least back that far it appears to have been working for other people. However, I really would like to have this running 5.8.
Thanks!
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
Vanhorn, Mike wrote:
I have two HP dc7800 convertible minitowers that are exhibiting the following issue: every 5-10 minutes, they will "freeze" for about 30 seconds, and then pick right back up again. During the freeze, it seems that nothing at all happens on the system; the clock doesn't even advance (it just picks up again with the next second, and that 30-or-so seconds are lost).
I've tried both CentOS 5.8 and 5.7, thinking it was a kernel incompatibility, but the problem happened with both versions. I have tried different hard drives, different memory, even swapped the entire machine, and the problem exists everywhere. I have tried adding "pci=nommconf" to the kernel line, as that was reported as being necessary back with 5.2 on these machines, but that made no difference (and shouldn't be necessary now, anyway, as I believe the issue has either been fixed or worked-around).
<snip> When you say "swapped the entire machine", what did you do? Also, what's running on them? Have you tried running top -d 10 or smaller (that will update the screen every 10 secs; I only recently found that current top allows tenths of a second.
If you don't see anything, I'd suggest you call HP, assuming they're still under warranty.
mark
From: "m.roth@5-cent.us" m.roth@5-cent.us
Vanhorn, Mike wrote:
I have two HP dc7800 convertible minitowers that are exhibiting the following issue: every 5-10 minutes, they will "freeze" for about 30 seconds, and then pick right back up again. During the freeze, it seems that nothing at all happens on the system; the clock doesn't even advance (it just picks up again with the next second, and that 30-or-so seconds are lost).
If you don't see anything, I'd suggest you call HP, assuming they're still under warranty.
They apparently do not support Linux on these models... So you might not get any help from HP support. Do you have the latest BIOS? Did you get a CD to run tests (like Insight Diagnostics Offline)?
JD
On 7/25/12 12:07 PM, "John Doe" jdmls@yahoo.com wrote:
Do you have the latest BIOS?
Yes.
Did you get a CD to run tests (like Insight Diagnostics Offline)?
Yes, I used my copy of the UBCD to run memory and hard drive diagnostics, and both passed.
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
On 7/25/12 11:24 AM, "m.roth@5-cent.us" m.roth@5-cent.us wrote:
When you say "swapped the entire machine", what did you do?
I have two of them, and thinking it was the hardware on the one, I moved the hard drive to the second, but the problem existed there, too. That points to something with the software, but, well, I haven't found anything yet.
Also, what's running on them? Have you tried running top -d 10 or smaller (that will update the screen every 10 secs; I only recently found that current top allows tenths of a second.
I haven't tried top, but that's a good idea. I usually have one window open that is running uptime every second in a continuous loop, mainly to tell me when exactly it happens. Originally, when the problem was first noticed, we had VLSI software being run on it, but at this point, the only thing I have on the machine is the operating system, and I'm going through my step-by-step configuration until I notice the problem occurring.
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
On 07/25/2012 04:34 PM, Vanhorn, Mike wrote:
I have two HP dc7800 convertible minitowers that are exhibiting the following issue: every 5-10 minutes, they will "freeze" for about 30 seconds, and then pick right back up again. During the freeze, it seems that nothing at all happens on the system; the clock doesn't even advance (it just picks up again with the next second, and that 30-or-so seconds are lost).
I've several HP dc7x00 machines, and I've never seen that problem with centos 5 or 6.
Do you also see the problem if you boot in runlevel 3, i.e. without X?
Mogens
On 7/25/12 12:04 PM, "Mogens Kjaer" mk@lemo.dk wrote:
I've several HP dc7x00 machines, and I've never seen that problem with centos 5 or 6.
I do, too. Things are fine on our 7900s, and the 8000-series machines we have. I'm only seeing it on these two 7800s.
Do you also see the problem if you boot in runlevel 3, i.e. without X?
Yes. I was thinking it maybe had something to do with the graphics card, so I left it in runlevel 3, but the problem still persisted. It still may be the graphics card, though, come to think of it, so I may need to try taking it out.
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
On Wed, 25 Jul 2012, Vanhorn, Mike wrote:
*snip*
I am stuck, and can't figure out where to even suspect the problem might actually be. There are no errors getting logged anywhere that I can find, probably because everything just "stops" temporarily, so there's nothing for the system to log.
Does anyone have any idea where I could look to fix this? I think I am next going to go back to 5.2, where the pci=nommconf is necessary, because at least back that far it appears to have been working for other people. However, I really would like to have this running 5.8.
Hi Mike. Are you on 32 or 64 bits ?
If 32 bit you might like to take a look at this here, which I compiled and packaged for Centos 5.5 32 bit - works on 5.8 OK as well:
Package Signing Key: www.karsites.net/centos/downloads/5.6/karsites-GPG-public-key-2011-03-18.asc
32 bit binary RPM: www.karsites.net/centos/downloads/5.6/qps-1.9.18.6-1.i386.rpm
Fedora 6 source code I rebuilt qps from: www.karsites.net/centos/downloads/5.6/qps-1.9.18.6-1.fc6.src.rpm
If you click on the %MEM or %CPU headings, this will toggle the sort order of the running processes by highest to lowest and v/v for those headings - same applies to the other headings.
Kind Regards,
Keith Roberts
----------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk
All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------
On 7/25/12 12:22 PM, "Keith Roberts" keith@karsites.net wrote:
Hi Mike. Are you on 32 or 64 bits ?
64. I have thought of trying 32 bit, just to see if it made a difference, but if it does, that won't help me because we need 64 bits for the software we're running, anyway.
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
On 7/25/12 10:34 AM, "Vanhorn, Mike" michael.vanhorn@wright.edu wrote:
I have two HP dc7800 convertible minitowers that are exhibiting the following issue: every 5-10 minutes, they will "freeze" for about 30 seconds, and then pick right back up again. During the freeze, it seems that nothing at all happens on the system; the clock doesn't even advance (it just picks up again with the next second, and that 30-or-so seconds are lost).
I've tried both CentOS 5.8 and 5.7, thinking it was a kernel incompatibility, but the problem happened with both versions. I have tried different hard drives, different memory, even swapped the entire machine, and the problem exists everywhere. I have tried adding "pci=nommconf" to the kernel line, as that was reported as being necessary back with 5.2 on these machines, but that made no difference (and shouldn't be necessary now, anyway, as I believe the issue has either been fixed or worked-around).
I am stuck, and can't figure out where to even suspect the problem might actually be. There are no errors getting logged anywhere that I can find, probably because everything just "stops" temporarily, so there's nothing for the system to log.
Does anyone have any idea where I could look to fix this? I think I am next going to go back to 5.2, where the pci=nommconf is necessary, because at least back that far it appears to have been working for other people. However, I really would like to have this running 5.8.
Thanks!
As a followup, I've determined that it is network related, but I'm still not sure what the problem is. I did go back to CentOS 5.2, but the problem still exists with that version, too.
Basically, what seems to be happening is that the network freezes around 30 seconds, and then picks right back up. There are no errors in any logs that I can find, and process that are running locally and that only depend on local resources keep right on going and don't have a problem.
I have tried using a different network card (as opposed to the one on the motherboard), but the problem happens with that, too. It almost has to be a configuration issue, or a BIOS settting, but I don't get it.
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
From: "Vanhorn, Mike" michael.vanhorn@wright.edu
As a followup, I've determined that it is network related, but I'm still not sure what the problem is. I did go back to CentOS 5.2, but the problem still exists with that version, too.
Basically, what seems to be happening is that the network freezes around 30 seconds, and then picks right back up. There are no errors in any logs that I can find, and process that are running locally and that only depend on local resources keep right on going and don't have a problem.
I have tried using a different network card (as opposed to the one on the motherboard), but the problem happens with that, too. It almost has to be a configuration issue, or a BIOS settting, but I don't get it.
Some kind of power saving "feature" or wake on lan...?
JD
On 07/27/2012 07:23 AM, Vanhorn, Mike wrote:
As a followup, I've determined that it is network related, but I'm still not sure what the problem is. I did go back to CentOS 5.2, but the problem still exists with that version, too.
Basically, what seems to be happening is that the network freezes around 30 seconds, and then picks right back up. There are no errors in any logs that I can find, and process that are running locally and that only depend on local resources keep right on going and don't have a problem.
I have tried using a different network card (as opposed to the one on the motherboard), but the problem happens with that, too. It almost has to be a configuration issue, or a BIOS settting, but I don't get it.
That sounds like a timeout of some kind. Do you have many (thousands per minute) of transient network connections in normal operation? If so, you might be running into the open file limits if you haven't bumped up the limits.
Look at /etc/security/limits.conf and try adding
* - nofile 64000
It's not necessarily network hw or sw that's at fault. I once had a similar problem caused by the (3rd party) driver of the onboard "RAID" controller. Newer driver version fixed it.
It turned out to be something very simple, but which wasn't obvious to check to begin with. There was another computer (a Windows machine) that was supposed to have been taken out of service a long time ago, but someone has recently put it back on the network. Because it was supposed to have been no longer used, it's IP address was re-allocated (a year and a half ago!) to the machine that I have been agonizing over all week.
On someone's suggestion, I decided to put the problem PC on a different subnet, because we thought it might be something amiss with the new networking hardware that was installed a month or so ago, and suddenly the problem went away. Some more investigation, and we discovered that the IP address was still being used, and, thus, stumbled across the actual problem.
Thank you to all who responded!
It's always the simplest things, in the last place you look...
--- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanhorn@wright.edu http://www.cecs.wright.edu/~mvanhorn/
Vanhorn, Mike wrote:
It turned out to be something very simple, but which wasn't obvious to check to begin with. There was another computer (a Windows machine) that was supposed to have been taken out of service a long time ago, but someone has recently put it back on the network. Because it was supposed to have been no longer used, it's IP address was re-allocated (a year and a half ago!) to the machine that I have been agonizing over all week.
On someone's suggestion, I decided to put the problem PC on a different subnet, because we thought it might be something amiss with the new networking hardware that was installed a month or so ago, and suddenly the problem went away. Some more investigation, and we discovered that the IP address was still being used, and, thus, stumbled across the actual problem.
Thank you to all who responded!
Glad you got it. I've been *really* busy all morning - shutdown of chilled water at 0-dark-30 meant shutting all the servers down yesterday, then bringing them up once the chiller water came on - but when I saw that you'd found it was a network issue, I was literally about to respond that it might not be your system's problem, but something on the network, when I saw your subject of SOLVED.
Congrats. <snip> mark
On Fri, 27 Jul 2012, Vanhorn, Mike wrote:
To: CentOS mailing list centos@centos.org From: "Vanhorn, Mike" michael.vanhorn@wright.edu Subject: [CentOS] [SOLVED] Re: problem with machine "freezing" for short periods
It turned out to be something very simple, but which wasn't obvious to check to begin with. There was another computer (a Windows machine) that was supposed to have been taken out of service a long time ago, but someone has recently put it back on the network. Because it was supposed to have been no longer used, it's IP address was re-allocated (a year and a half ago!) to the machine that I have been agonizing over all week.
On someone's suggestion, I decided to put the problem PC on a different subnet, because we thought it might be something amiss with the new networking hardware that was installed a month or so ago, and suddenly the problem went away. Some more investigation, and we discovered that the IP address was still being used, and, thus, stumbled across the actual problem.
Thank you to all who responded!
It's always the simplest things, in the last place you look...
Hi Mike.
I'm pleased you got this figured out now OK.
As you mentioned earlier it could be a network problem, I was going to suggest using Wireshark, which *could* have identified this problem for you pretty quick.
http://en.wikipedia.org/wiki/Wireshark Wireshark is a free and open-source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education. Originally named Ethereal, in May 2006 the project was renamed Wireshark due to trademark issues.
Name : wireshark Arch : i386 Version : 1.0.15 Release : 1.el5_6.4 Size : 40 M Repo : installed (in updates repo) Summary : Network traffic analyzer URL : http://www.wireshark.org/ License : GPL
Name : wireshark-gnome Arch : i386 Version : 1.0.15 Release : 1.el5_6.4 Size : 1.6 M Repo : installed Summary : Gnome desktop integration for wireshark and : wireshark-usermode URL : http://www.wireshark.org/ License : GPL Description: Contains wireshark for Gnome 2 and desktop : integration file
Maybe you could recreate this problem (2 machines using the same IP address on the same network ) and then start Wireshark GUI, and see if it spots this and complains with a very informative error message?
Kind Regards,
Keith Roberts
----------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk
All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------