Hey folks,
I have 1 system ( Sunfire x2250 running 5.7 ) that is having issues with RAM, but I'm not sure how to debug it. And unfortunately it is not under support anymore.
I started the job about 4 months ago and when I came aboard the guy who handed stuff over to me told me this issue was on his list of things he was unable to get to yet. He told me he'd seen errors in the past in the Sun ILOM message log, but unfortuantely he did not record exactly what the messages were.
Fast forward a bit and I've had problems with this machine. Sometimes when I reboot it, it just won't come up. The console and everything just go completely dead no matter what I do. I unplug it for a while and try again, same thing. It seems to just randomly come back to life, and when it does I see something in the ILOM log like this :
ID = f74 : 03/07/2012 : 19:17:42 : System Firmware Error : ACPI : No usable system memory
So it seems to me that when it is having trouble, it is not seeing any RAM at all. And when it does come back up, Linux only sees half the RAM it is supposed to see.
"lshw" sees all the RAM
The only errors I see in the ILOM logs are above.
I don't see anything in dmesg or /var/log/messages on the Linux side.
Back about 3 months ago I took this system down and removed all the RAM, and stuck individual chips into it and booted it, testing each chip on its own. At that time every single one of them worked! But I'm about to try this again to see what happens. Back then I also ran memtest86 for some time and it seemed OK too.
Other than that I'm a bit stumped on how to get to the bottom of this. Tips?
I googled the error and got precisely 1 hit at a university high performance computing center in Utah, so I dug up a contact there and emailed them hoping they could tell me something, but I have not yet heard back.
thanks, -Alan