Hey folks,
I have 1 system ( Sunfire x2250 running 5.7 ) that is having issues with RAM, but I'm not sure how to debug it. And unfortunately it is not under support anymore.
I started the job about 4 months ago and when I came aboard the guy who handed stuff over to me told me this issue was on his list of things he was unable to get to yet. He told me he'd seen errors in the past in the Sun ILOM message log, but unfortuantely he did not record exactly what the messages were.
Fast forward a bit and I've had problems with this machine. Sometimes when I reboot it, it just won't come up. The console and everything just go completely dead no matter what I do. I unplug it for a while and try again, same thing. It seems to just randomly come back to life, and when it does I see something in the ILOM log like this :
ID = f74 : 03/07/2012 : 19:17:42 : System Firmware Error : ACPI : No usable system memory
So it seems to me that when it is having trouble, it is not seeing any RAM at all. And when it does come back up, Linux only sees half the RAM it is supposed to see.
"lshw" sees all the RAM
The only errors I see in the ILOM logs are above.
I don't see anything in dmesg or /var/log/messages on the Linux side.
Back about 3 months ago I took this system down and removed all the RAM, and stuck individual chips into it and booted it, testing each chip on its own. At that time every single one of them worked! But I'm about to try this again to see what happens. Back then I also ran memtest86 for some time and it seemed OK too.
Other than that I'm a bit stumped on how to get to the bottom of this. Tips?
I googled the error and got precisely 1 hit at a university high performance computing center in Utah, so I dug up a contact there and emailed them hoping they could tell me something, but I have not yet heard back.
thanks, -Alan
On Tue, Mar 13, 2012 at 11:50 AM, Alan McKay alan.mckay@gmail.com wrote:
Back about 3 months ago I took this system down and removed all the RAM, and stuck individual chips into it and booted it, testing each chip on its own. At that time every single one of them worked! But I'm about to try this again to see what happens. Back then I also ran memtest86 for some time and it seemed OK too.
I've seen systems where it took 3 or 4 days for memtest86 to catch an error (i.e. just over a weekend wasn't long enough).
Alan McKay wrote:
Hey folks,
I have 1 system ( Sunfire x2250 running 5.7 ) that is having issues with RAM, but I'm not sure how to debug it. And unfortunately it is not under support anymore.
<snip> Oy, as they say, vey. You still *might* be able to email Sun, er, Oracle support without paying (though I don't know *how* you expect poor Larry to keep his fighter jet fueled....)
mark
On Mar 13, 2012, at 12:50 PM, Alan McKay alan.mckay@gmail.com wrote:
Back about 3 months ago I took this system down and removed all the RAM, and stuck individual chips into it and booted it, testing each chip on its own. At that time every single one of them worked! But I'm about to try this again to see what happens. Back then I also ran memtest86 for some time and it seemed OK too.
It could be a bad physical RAM slot on the motherboard. Try filling the slots one at a time (or two if paired) until you hit the problem slot.
-Ross
On Tue, Mar 13, 2012 at 2:07 PM, Ross Walker rswwalker@gmail.com wrote:
It could be a bad physical RAM slot on the motherboard.
Oh dang, why didn't I think of that! I'll try that next
on 3/13/2012 11:07 AM Ross Walker spake the following:
On Mar 13, 2012, at 12:50 PM, Alan McKayalan.mckay@gmail.com wrote:
Back about 3 months ago I took this system down and removed all the RAM, and stuck individual chips into it and booted it, testing each chip on its own. At that time every single one of them worked! But I'm about to try this again to see what happens. Back then I also ran memtest86 for some time and it seemed OK too.
It could be a bad physical RAM slot on the motherboard. Try filling the slots one at a time (or two if paired) until you hit the problem slot.
-Ross
It could also be a power supply problem... Add memory load, and a bit of heat, and voltage drops a bit...
On Tue, Mar 13, 2012 at 2:15 PM, Scott Silva ssilva@sgvwater.com wrote:
It could also be a power supply problem... Add memory load, and a bit of heat, and voltage drops a bit...
Problem is that even if I leave it unplugged for some time I can get the problem. And I have the heat sensors all graphed, and last time this popped up its head last week the mobo was relatively cool according to the heat graphs
Well I did exactly what I'd done 3 months ago and found a faulty RAM chip this time
My guess is that back then the chip was still functioning some of the time, and happened to be fine just when I was doing the tests.
This time I found it fairly easily with a systematic approach.
On Wed, Mar 14, 2012 at 1:43 PM, Alan McKay alan.mckay@gmail.com wrote:
Well I did exactly what I'd done 3 months ago and found a faulty RAM chip this time
My guess is that back then the chip was still functioning some of the time, and happened to be fine just when I was doing the tests.
This time I found it fairly easily with a systematic approach.
If you were running software RAID1 on that box, don't trust anything on the drives now. Maybe even if you weren't, but it is especially weird when alternate reads randomly revive bad data that you thought had been fixed already.
On 03/14/12 12:16 PM, Les Mikesell wrote:
If you were running software RAID1 on that box, don't trust anything on the drives now. Maybe even if you weren't, but it is especially weird when alternate reads randomly revive bad data that you thought had been fixed already.
and the worst part is, even if you found mismatching blocks on the mirrors, there's no way to know which one is the 'good' one, as there's no block checksumming or anything like that with conventional RAID.
this is a major reason I *insist* on ECC for any sort of server other than a lightweight home system. ECC memory will detect bit failures so you KNOW something is funky.
this is also a major reason why RAID is *not* a substitute for backup, its ONLY about availability.
On Wed, Mar 14, 2012 at 2:35 PM, John R Pierce pierce@hogranch.com wrote:
On 03/14/12 12:16 PM, Les Mikesell wrote:
If you were running software RAID1 on that box, don't trust anything on the drives now. Maybe even if you weren't, but it is especially weird when alternate reads randomly revive bad data that you thought had been fixed already.
and the worst part is, even if you found mismatching blocks on the mirrors, there's no way to know which one is the 'good' one, as there's no block checksumming or anything like that with conventional RAID.
this is a major reason I *insist* on ECC for any sort of server other than a lightweight home system. ECC memory will detect bit failures so you KNOW something is funky.
I _thought_ the server where I had this problem was supposed to have had 1-bit error correction and I also thought that if the error couldn't be corrected with ECC it was supposed to crash instead of continuing. But maybe it had the wrong kind of RAM installed or something that disabled the ECC.
On Wed, Mar 14, 2012 at 3:16 PM, Les Mikesell lesmikesell@gmail.com wrote:
If you were running software RAID1 on that box, don't trust anything on the drives now. Maybe even if you weren't, but it is especially weird when alternate reads randomly revive bad data that you thought had been fixed already.
No worries, it is a disposable compute node and not even RAIDed. In a pinch I could reinstall from scratch and my network would not skip a heartbeat
But thanks for the heads up - something to think about for my big iron server :-)