I updated my home server with the 6.4 CR packages, and I've experienced 3 or 4 hard lockups since. The server is a fanless VIA C7 "CentaurHauls" system with a 1GHz CPU underclocked to 800MHz and 1GB of RAM. It has a dual-port Intel 82546GB NIC in its single PCI slot. (It also has an on-board Realtek RTL-8110SC/8169SC NIC that is plugged in, but doesn't currently have an IP address configured.)
This server provides a number of services -- DNS, DHCP, routing between VLANs, DLNA media server, CUPS, etc. Most importantly, it runs Asterisk and manages all of the phones in the house.
There's absolutely nothing in the logs related to the lockup. The system simply becomes totally unresponsive, to the point that the console cursor stops blinking. A hard reset is required to bring it back.
kernel-2.6.32-279.22.1.el6.i686 seems to be completely stable.
I don't really expect to be able to figure this out, but I thought I'd post here to see if anyone else is experiencing anything like this with this kernel.
Thanks!
On Sun, Mar 3, 2013 at 11:02 PM, Ian Pilcher arequipeno@gmail.com wrote:
I updated my home server with the 6.4 CR packages, and I've experienced 3 or 4 hard lockups since. The server is a fanless VIA C7 "CentaurHauls" system with a 1GHz CPU underclocked to 800MHz and 1GB of RAM. It has a dual-port Intel 82546GB NIC in its single PCI slot. (It also has an on-board Realtek RTL-8110SC/8169SC NIC that is plugged in, but doesn't currently have an IP address configured.)
This server provides a number of services -- DNS, DHCP, routing between VLANs, DLNA media server, CUPS, etc. Most importantly, it runs Asterisk and manages all of the phones in the house.
There's absolutely nothing in the logs related to the lockup. The system simply becomes totally unresponsive, to the point that the console cursor stops blinking. A hard reset is required to bring it back.
I'm running 2.6.32-358.0.1 on a KVM virtual machine and not seeing any issues. I've not yet ran that kernel on physical hardware yet though.
kernel-2.6.32-279.22.1.el6.i686 seems to be completely stable.
I don't really expect to be able to figure this out, but I thought I'd post here to see if anyone else is experiencing anything like this with this kernel.
Thanks!
--
Ian Pilcher arequipeno@gmail.com Sometimes there's nothing left to do but crash and burn...or die trying. ========================================================================
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Sun, Mar 3, 2013 at 11:02 PM, Ian Pilcher arequipeno@gmail.com wrote:
I updated my home server with the 6.4 CR packages, and I've experienced 3 or 4 hard lockups since. The server is a fanless VIA C7 "CentaurHauls" system with a 1GHz CPU underclocked to 800MHz and 1GB of RAM. It has a dual-port Intel 82546GB NIC in its single PCI slot. (It also has an on-board Realtek RTL-8110SC/8169SC NIC that is plugged in, but doesn't currently have an IP address configured.)
Wow. I'm trying to troubleshoot a very similar problem. I was convinced that it was hardware, but beginning to exhaust my hardware troubleshooting skills.
I'm running an Asus M5a99X EVO 2.0, Asus Geforce GTX 660, and AMD 8150 CPU, 32G RAM, Corsair 850W PS. Randomly I get a complete lockup. Mouse freezes, network dies, etc..
There's absolutely nothing in the logs related to the lockup. The system simply becomes totally unresponsive, to the point that the console cursor stops blinking. A hard reset is required to bring it back.
kernel-2.6.32-279.22.1.el6.i686 seems to be completely stable.
Same here. No log messages, just a complete freeze. At first I was suspecting some Pulseaudio glitches because of thousands of messages in the log. Then suspected the proprietary NVidia graphics, then thought it might be power supply. I've since swapped out every component with no improvement. It can sometimes for for hours without a problem, sometimes with a minute after a reboot it will lock up.
Have you enabled your thermal sensors? Do you have any messages in the kernel log?
On 03/07/2013 12:52 PM, Kwan Lowe wrote:
On Sun, Mar 3, 2013 at 11:02 PM, Ian Pilcher arequipeno@gmail.com wrote:
I updated my home server with the 6.4 CR packages, and I've experienced 3 or 4 hard lockups since. The server is a fanless VIA C7 "CentaurHauls" system with a 1GHz CPU underclocked to 800MHz and 1GB of RAM. It has a dual-port Intel 82546GB NIC in its single PCI slot. (It also has an on-board Realtek RTL-8110SC/8169SC NIC that is plugged in, but doesn't currently have an IP address configured.)
Wow. I'm trying to troubleshoot a very similar problem. I was convinced that it was hardware, but beginning to exhaust my hardware troubleshooting skills.
I'm running an Asus M5a99X EVO 2.0, Asus Geforce GTX 660, and AMD 8150 CPU, 32G RAM, Corsair 850W PS. Randomly I get a complete lockup. Mouse freezes, network dies, etc..
I have 2 of these motherboards (ASUS M5A99X EVO R2.0) that I am using in CentOS development and testing. I am not seeing this issue .. I have "M5A99X EVO R2.0 BIOS 1503" dated "2013/01/31 update".
Do you have the latest BIOS?
There's absolutely nothing in the logs related to the lockup. The system simply becomes totally unresponsive, to the point that the console cursor stops blinking. A hard reset is required to bring it back.
kernel-2.6.32-279.22.1.el6.i686 seems to be completely stable.
Same here. No log messages, just a complete freeze. At first I was suspecting some Pulseaudio glitches because of thousands of messages in the log. Then suspected the proprietary NVidia graphics, then thought it might be power supply. I've since swapped out every component with no improvement. It can sometimes for for hours without a problem, sometimes with a minute after a reboot it will lock up.
Have you enabled your thermal sensors? Do you have any messages in the kernel log?
On Thu, Mar 7, 2013 at 8:51 PM, Johnny Hughes johnny@centos.org wrote:
I have 2 of these motherboards (ASUS M5A99X EVO R2.0) that I am using in CentOS development and testing. I am not seeing this issue .. I have "M5A99X EVO R2.0 BIOS 1503" dated "2013/01/31 update".
Do you have the latest BIOS?
Thank you for your reply.
Yes, latest BIOS installed. I have 2 of these also with similar configurations except for the NIC. One works perfectly the other has constant freezes. The working one has a slightly older BIOS so I'm thinking of downgrading the giltchy one.
As far as logging goes, any idea what sort of failures could cause such a lockup? I.e., if memory was failing, would the system still be able to log? As the mouse is frozen and kernel sysrq has no effect, I'm still leaning towards hardware but literally everything except the case has been swapped out. (Well.. let me qualify that.. Everything but the 64GB SSD drive has been swapped but it seemed unlikely that a drive failure could cause such a lockup. Incorrect assumption?)
Kwan Lowe wrote:
On Thu, Mar 7, 2013 at 8:51 PM, Johnny Hughes johnny@centos.org wrote:
<snip>
As far as logging goes, any idea what sort of failures could cause such a lockup? I.e., if memory was failing, would the system still be able to log? As the mouse is frozen and kernel sysrq has no effect, I'm still leaning towards hardware but literally everything except the case has been swapped out. (Well.. let me qualify that.. Everything but the 64GB SSD drive has been swapped but it seemed unlikely that a drive failure could cause such a lockup. Incorrect assumption?)
No ideas... and I've had a number of systems do this, over the last couple years, where someone noted it had stopped responding; I go down, and it doesn't respond *at* *all* when I plug in a monitor & keyboard, and power cycling's the only answer.
Thinking about it, I believe it's mostly been on our Penguin servers, and that co. uses Supermicro m/b's, and we've had h/w problems with them, also, and have had several m/b's replaced under warranty.
mark
On 03/08/2013 10:43 AM, m.roth@5-cent.us wrote:
Kwan Lowe wrote:
On Thu, Mar 7, 2013 at 8:51 PM, Johnny Hughes johnny@centos.org wrote:
<snip> > As far as logging goes, any idea what sort of failures could cause > such a lockup? I.e., if memory was failing, would the system still be > able to log? As the mouse is frozen and kernel sysrq has no effect, > I'm still leaning towards hardware but literally everything except the > case has been swapped out. (Well.. let me qualify that.. Everything > but the 64GB SSD drive has been swapped but it seemed unlikely that a > drive failure could cause such a lockup. Incorrect assumption?) No ideas... and I've had a number of systems do this, over the last couple years, where someone noted it had stopped responding; I go down, and it doesn't respond *at* *all* when I plug in a monitor & keyboard, and power cycling's the only answer.
Thinking about it, I believe it's mostly been on our Penguin servers, and that co. uses Supermicro m/b's, and we've had h/w problems with them, also, and have had several m/b's replaced under warranty.
mark
Nearly every time we've had lockup problems it has come down to bad or failing memory.
I've even had memory cause problems where it would pass a quick memtest but ultimately would fail if you left it running the tests overnight.
Gerry Reno wrote:
On 03/08/2013 10:43 AM, m.roth@5-cent.us wrote:
Kwan Lowe wrote:
On Thu, Mar 7, 2013 at 8:51 PM, Johnny Hughes johnny@centos.org wrote:
<snip> > As far as logging goes, any idea what sort of failures could cause > such a lockup? I.e., if memory was failing, would the system still be > able to log? As the mouse is frozen and kernel sysrq has no effect, > I'm still leaning towards hardware but literally everything except the > case has been swapped out. (Well.. let me qualify that.. Everything > but the 64GB SSD drive has been swapped but it seemed unlikely that a > drive failure could cause such a lockup. Incorrect assumption?) No ideas... and I've had a number of systems do this, over the last couple years, where someone noted it had stopped responding; I go down, and it doesn't respond *at* *all* when I plug in a monitor & keyboard, and power cycling's the only answer.
Thinking about it, I believe it's mostly been on our Penguin servers, and that co. uses Supermicro m/b's, and we've had h/w problems with them, also, and have had several m/b's replaced under warranty.
Nearly every time we've had lockup problems it has come down to bad or failing memory.
I've even had memory cause problems where it would pass a quick memtest but ultimately would fail if you left it running the tests overnight.
Right, but I've always *seen* error messages, dmesg, and, if mcelogd is actually working (I can't figure out why it seems to on some machines, and not on others, or why it doesn't keep running), it's in there. The times we've had lockups, there's been nothing.
mark
On Fri, Mar 8, 2013 at 11:25 AM, m.roth@5-cent.us wrote:
Right, but I've always *seen* error messages, dmesg, and, if mcelogd is actually working (I can't figure out why it seems to on some machines, and not on others, or why it doesn't keep running), it's in there. The times we've had lockups, there's been nothing.
That's the frustrating thing.. Not a single error message. It also appears unrelated to system load as I went through 4 hours of the Phoronix test suite that pegged all 8 cores, Unigine Valley benchmark for several loops, memtest.. All passed. But at night it locked up when there was no load.
Kwan Lowe wrote:
On Fri, Mar 8, 2013 at 11:25 AM, m.roth@5-cent.us wrote:
Right, but I've always *seen* error messages, dmesg, and, if mcelogd is actually working (I can't figure out why it seems to on some machines, and not on others, or why it doesn't keep running), it's in there. The
times
we've had lockups, there's been nothing.
That's the frustrating thing.. Not a single error message. It also appears unrelated to system load as I went through 4 hours of the Phoronix test suite that pegged all 8 cores, Unigine Valley benchmark for several loops, memtest.. All passed. But at night it locked up when there was no load.
Ok, so there was nothing in /var/log/dmesg? Have you tried running mcelogd?
mark
On Fri, Mar 8, 2013 at 11:34 AM, m.roth@5-cent.us wrote:
Ok, so there was nothing in /var/log/dmesg? Have you tried running mcelogd?
Nothing in dmesg, but I have not run mcelogd. I will try that tonight. Thanks!
On 03/08/2013 11:46 AM, Kwan Lowe wrote:
On Fri, Mar 8, 2013 at 11:34 AM, m.roth@5-cent.us wrote:
Ok, so there was nothing in /var/log/dmesg? Have you tried running mcelogd?
Nothing in dmesg, but I have not run mcelogd. I will try that tonight. Thanks! _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Run memtest on your memory and leave it running overnight.
.
On Fri, Mar 8, 2013 at 10:51 AM, Gerry Reno greno@verizon.net wrote:
Nearly every time we've had lockup problems it has come down to bad or failing memory.
I've even had memory cause problems where it would pass a quick memtest but ultimately would fail if you left it running the tests overnight.
Thank you for your reply.
I was leaning towards memory after swapping the power supply did not solve the problem. There are 4 8GB DDR3 sticks, so I took out two and ran with 16G. It still failed. I then swapped that out for the other 16GB. Still failed. What I haven't tried is to downclock the memory to a slower speed but will try that tonight if the BIOS supports it.
On Fri, Mar 8, 2013 at 11:27 AM, Kwan Lowe kwan.lowe@gmail.com wrote:
On Fri, Mar 8, 2013 at 10:51 AM, Gerry Reno greno@verizon.net wrote:
Nearly every time we've had lockup problems it has come down to bad or
failing memory.
I've even had memory cause problems where it would pass a quick memtest
but ultimately would fail if you left it running
the tests overnight.
Thank you for your reply.
I was leaning towards memory after swapping the power supply did not
Sure sounds like a memory related lock up since you've ruled out the power supply.
solve the problem. There are 4 8GB DDR3 sticks, so I took out two and ran with 16G. It still failed. I then swapped that out for the other 16GB. Still failed. What I haven't tried is to downclock the memory to
Are you able to boot the system with memory in the second pair of slots?
If it's not memory related (test this memory in another system) then it is probably a motherboard failure. I've seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its cache it triggers a lock up/throws an exception.
In that case the memory controller was probably going so that ancient system got thrown out (was not in production). In that case the system previously had a proprietary Linux 2.2 kernel and a 2.4 or 2.6 kernel would cause it to wig out. Differences in how a 2.2 and 2.4/2.6 kernel allocates memory really brought out the problem in that system!
But to be sure, run a memtest overnight on the original 4x8GB RAM as has been recommended by others.
a slower speed but will try that tonight if the BIOS supports it. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
SilverTip257 wrote:
On Fri, Mar 8, 2013 at 11:27 AM, Kwan Lowe kwan.lowe@gmail.com wrote:
On Fri, Mar 8, 2013 at 10:51 AM, Gerry Reno greno@verizon.net wrote:
Nearly every time we've had lockup problems it has come down to bad or
failing memory.
I've even had memory cause problems where it would pass a quick
memtest but ultimately would fail if you left it running
the tests overnight.
<snip>
I was leaning towards memory after swapping the power supply did not
<snip>
If it's not memory related (test this memory in another system) then it is probably a motherboard failure. I've seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its cache it triggers a lock up/throws an exception.
<snip> I lean towards the m/b failing. Btw, the Penguins I've mentioned that had m/b's replaced - most of them, we can run a *user* program (parallel processing using torque, very heavy duty scientific computing), and it will crash the system, through reboot, repeatably. We've shipped them back, and they wind up replacing the m/b.
mark
On Fri, Mar 8, 2013 at 12:33 PM, SilverTip257 silvertip257@gmail.com wrote:
If it's not memory related (test this memory in another system) then it is probably a motherboard failure. I've seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its cache it triggers a lock up/throws an exception.
:) I've also swapped the motherboard. *Every* component except for the case and the SSD boot drive has been swapped. This is going on now for almost two weeks.
I will try your suggestion of trying a separate set of banks in the off chance that those slots are faulty.
Kwan Lowe wrote:
On Fri, Mar 8, 2013 at 12:33 PM, SilverTip257 silvertip257@gmail.com wrote:
If it's not memory related (test this memory in another system) then it is probably a motherboard failure. I've seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its
cache it
triggers a lock up/throws an exception.
:) I've also swapped the motherboard. *Every* component except for the case and the SSD boot drive has been swapped. This is going on now for almost two weeks.
<snip> Oh, that's *bad*. We had a server like that, a Dell (fortunately): three replacement m/b's (one was DoA), and we *still* couldn't get it to boot, so they offered us a newer server as a replacement (all within two weeks, and that *includes* the three days that FE's showed - that's why I'm glad it was Dell).
mark
On Fri, Mar 8, 2013 at 12:28 PM, Kwan Lowe kwan.lowe@gmail.com wrote:
If it's not memory related (test this memory in another system) then it is probably a motherboard failure. I've seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its cache it triggers a lock up/throws an exception.
:) I've also swapped the motherboard. *Every* component except for the case and the SSD boot drive has been swapped. This is going on now for almost two weeks.
I will try your suggestion of trying a separate set of banks in the off chance that those slots are faulty.
I had one a few years ago where it took about 3 days for memtest to catch the bad RAM but even after fixing that there were random crashes. Turned out that the bad RAM had caused some disk corruption which was partly hidden by raid1 mirroring. Once in a while a program block read would hit the bad copy, but when you look for it everything looks OK...
On Fri, Mar 8, 2013 at 2:12 PM, Les Mikesell lesmikesell@gmail.com wrote:
I will try your suggestion of trying a separate set of banks in the off chance that those slots are faulty.
I had one a few years ago where it took about 3 days for memtest to catch the bad RAM but even after fixing that there were random crashes. Turned out that the bad RAM had caused some disk corruption which was partly hidden by raid1 mirroring. Once in a while a program block read would hit the bad copy, but when you look for it everything looks OK...
I'm running on the second bank now. I ran into a snag running mcelogd however (processor might not be supported). It appears that the CPU is not supported even after enabling the CONFIG_EDAC_MCE and CONFIG_EDAC_AMD64 in the /boot/config-xxx.. The error sometimes takes a few hours to occur so will use this system throughout the night to try to catch the failure.
Starting mcelog daemon [FAILED] AMD Processor family 21: Please load edac_mce_amd module. CPU is unsupported
On 08.Mär.2013, at 19:28, Kwan Lowe wrote:
On Fri, Mar 8, 2013 at 12:33 PM, SilverTip257 silvertip257@gmail.com wrote:
If it's not memory related (test this memory in another system) then it is probably a motherboard failure. I've seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its cache it triggers a lock up/throws an exception.
:) I've also swapped the motherboard. *Every* component except for the case and the SSD boot drive has been swapped. This is going on now for almost two weeks.
I tell you of one very stable system that was not stable the other day. It was locking up in half hour frequency after running stable for years. It turned out that the temperature was not monitored on this system, the cpu fan got angry about this fact, stopped to work and it was getting hot. After replacing the fan you might think *problem solved* but nah. It kept locking up. It turned out that an adapter for the power supply had a loose contact. Do you think that think loose contact could have been introduced while fixing the fan?
On Mar.08.2013 11:27 AM, Kwan Lowe wrote:
On Fri, Mar 8, 2013 at 10:51 AM, Gerry Reno greno@verizon.net wrote:
Nearly every time we've had lockup problems it has come down to bad or failing memory.
I've even had memory cause problems where it would pass a quick memtest but ultimately would fail if you left it running the tests overnight.
Thank you for your reply.
I was leaning towards memory after swapping the power supply did not solve the problem. There are 4 8GB DDR3 sticks, so I took out two and ran with 16G. It still failed. I then swapped that out for the other 16GB. Still failed. What I haven't tried is to downclock the memory to a slower speed but will try that tonight if the BIOS supports it. _______________________________________________
A diagnostic board should, at least, limit the search space. Characterizing the tyoe(s)/point(s) of failure may make it possible to handle them more gracefully.
On 03/08/2013 09:38 AM, Kwan Lowe wrote:
On Thu, Mar 7, 2013 at 8:51 PM, Johnny Hughes johnny@centos.org wrote:
I have 2 of these motherboards (ASUS M5A99X EVO R2.0) that I am using in CentOS development and testing. I am not seeing this issue .. I have "M5A99X EVO R2.0 BIOS 1503" dated "2013/01/31 update".
Do you have the latest BIOS?
Thank you for your reply.
Yes, latest BIOS installed. I have 2 of these also with similar configurations except for the NIC. One works perfectly the other has constant freezes. The working one has a slightly older BIOS so I'm thinking of downgrading the giltchy one.
As far as logging goes, any idea what sort of failures could cause such a lockup? I.e., if memory was failing, would the system still be able to log? As the mouse is frozen and kernel sysrq has no effect, I'm still leaning towards hardware but literally everything except the case has been swapped out. (Well.. let me qualify that.. Everything but the 64GB SSD drive has been swapped but it seemed unlikely that a drive failure could cause such a lockup. Incorrect assumption?)
The board has a couple of buttons on it to find the best memory timings, etc.
The button is labeled Memory OK! .. and is on the top right corner of the board.
On Fri, 8 Mar 2013, Kwan Lowe wrote:
Yes, latest BIOS installed. I have 2 of these also with similar configurations except for the NIC. One works perfectly the other has constant freezes. The working one has a slightly older BIOS so I'm thinking of downgrading the giltchy one.
Just a wild idea: is the NIC in the system that freezes a Broadcom and in the other system something else? If so, disable_msi=1 may help.
Steve
On Fri, Mar 8, 2013 at 5:04 PM, Steve Thompson smt@vgersoft.com wrote:
Just a wild idea: is the NIC in the system that freezes a Broadcom and in the other system something else? If so, disable_msi=1 may help.
NICs are now both ThinkPenguin cards with an Atheros chipset.. At this point, the systems are identical except that the failing one has an even bigger PSU than is needed (I calculated 650W required and had an 850W in there... Now it's a 1200W :D ).
On Sun, Mar 3, 2013 at 11:02 PM, Ian Pilcher arequipeno@gmail.com wrote:
I updated my home server with the 6.4 CR packages, and I've experienced 3 or 4 hard lockups since. The server is a fanless VIA C7 "CentaurHauls" system with a 1GHz CPU underclocked to 800MHz and 1GB of RAM. It has a dual-port Intel 82546GB NIC in its single PCI slot. (It also has an on-board Realtek RTL-8110SC/8169SC NIC that is plugged in, but doesn't currently have an IP address configured.)
Well.. Looks like my hardware problems were only superficially the same as yours. After fighting it for two weeks, I got the second replacement motherboard in on Tuesday. Swapped it out and it has been rock solid stable since then. At some point I may try bringing up the BIOS to the same version as on the failed board if someone has a similar problem, but for now it's staying at the back rev version.
Running kernel-2.6.32-358.2.1.el6.i686 for a couple of days now with no problem.
<knock on='wood'/>