Hello all
I am running CentOS 5 on a small server and I am having very strange memory malfunctions.
The computer runs perfectly with no problems whatsoever. From time to time, after a soft reboot, the computer emmits beeps corresponding to a memory fault. It never reboots again until I find and remove a now defective DIMM. That DIMM can never be used again because it is out of order.
This just happened for the *fourth* time and is costing me a lot of hassle and perhaps expense if the manufacturer does not replace the DIMM freely.
The memory is DDR400 ECC from Kingston and of the type recommended by the motherboard's manufacturer. The board is a Tyan Tomcat i875p (S5102).
Would it be possible that the DIMMs are being destroyed by some software component from the OS, perhaps the I2C management? The DIMMs do have EPROMS... Are they being incorrectly accessed by some software component and their program modified?
Could this be the board's fault? And how?
I don't know what to think of this, I never saw anything like that in my already long experience with computers...
I must say again that this memory error *never* happens while the computer is in service, it always happened upon a soft reboot.
Any hints would be appreciated. Thank you
On Fri, 19 Oct 2007 22:13:13 +0100 Miguel Medalha miguelmedalha@sapo.pt wrote:
Would it be possible that the DIMMs are being destroyed by some software component from the OS, perhaps the I2C management? The DIMMs do have EPROMS... Are they being incorrectly accessed by some software component and their program modified?
I very much doubt that.
Could this be the board's fault? And how?
There are many things that could create the problem you describe and they are all hardware problems. A faulty power supply or a motherboard that is providing incorrect power to the ram slots could cause these failures. Dirt in the slots or poor ventilation could also cause this. Crappy line voltage from your power company.
The only software-ish thing that I can think of that would cause problems is if you set the motherboard's bios to values that caused the board to overheat or overload in some way. Did you play with the bios, especially any "overclocking" or "high performance" features? Again, that's not Centos at fault; set your bios to use normal values and see if the problem goes away (assuming that you haven't fried the board already, of course).
There are probably other things that could cause these failures as well but again, it's pretty much all hardware related so I would look there and not at anything operating-system or software-related.
Miguel Medalha wrote:
Hello all
I am running CentOS 5 on a small server and I am having very strange memory malfunctions.
The computer runs perfectly with no problems whatsoever. From time to time, after a soft reboot, the computer emmits beeps corresponding to a memory fault. It never reboots again until I find and remove a now defective DIMM. That DIMM can never be used again because it is out of order.
This just happened for the *fourth* time and is costing me a lot of hassle and perhaps expense if the manufacturer does not replace the DIMM freely.
The memory is DDR400 ECC from Kingston and of the type recommended by the motherboard's manufacturer. The board is a Tyan Tomcat i875p (S5102).
Would it be possible that the DIMMs are being destroyed by some software component from the OS, perhaps the I2C management? The DIMMs do have EPROMS... Are they being incorrectly accessed by some software component and their program modified?
Could this be the board's fault? And how?
I don't know what to think of this, I never saw anything like that in my already long experience with computers...
I must say again that this memory error *never* happens while the computer is in service, it always happened upon a soft reboot.
Any hints would be appreciated. Thank you
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
I've had similar problems with a Tyan motherboard (8-Opteron S4881/S4882 combination) which had all memory slots filled. It turned out that the BIOS allowed to run the memory at its full DDR400 speed, whereas the AMD specs say that one has to go down to DDR333 speed. The result was destroyed DIMMs _and_ Opterons (the built-in memory controllers)! It was not easy to track this down; I'm still wondering why the BIOS programmers do not read and follow the specs! This computer is now happily running at DDR333 with uptime > 6 months.
The upshot: don't trust Tyan BIOS - check possible settings against specs. Don't trust vendors - they sell DDR400 memory even if the particular setup only allows to use DDR333 speed.
HTH, Kay