debugging RAM issues - Discuss

13 Mar 2012


      Hey folks,
I have 1 system ( Sunfire x2250 running 5.7 ) that is having issues with
RAM, but I'm not sure how to debug it.   And unfortunately it is not under
support anymore.
I started the job about 4 months ago and when I came aboard the guy who
handed stuff over to me told me this issue was on his list of things he was
unable to get to yet.   He told me he'd seen errors in the past in the Sun
ILOM message log, but unfortuantely he did not record exactly what the
messages were.
Fast forward a bit and I've had problems with this machine.   Sometimes
when I reboot it, it just won't come up.  The console and everything just
go completely dead no matter what I do.  I unplug it for a while and try
again, same thing.  It seems to just randomly come back to life, and when
it does I see something in the ILOM log like this :
ID = f74 : 03/07/2012 : 19:17:42 : System Firmware Error : ACPI : No usable
system memory
So it seems to me that when it is having trouble, it is not seeing any RAM
at all.  And when it does come back up, Linux only sees half the RAM it is
supposed to see.
"lshw" sees all the RAM
The only errors I see in the ILOM logs are above.
I don't see anything in dmesg or /var/log/messages on the Linux side.
Back about 3 months ago I took this system down and removed all the RAM,
and stuck individual chips into it and booted it, testing each chip on its
own.   At that time every single one of them worked!   But I'm about to try
this again to see what happens.   Back then I also ran memtest86 for some
time and it seemed OK too.
Other than that I'm a bit stumped on how to get to the bottom of this.
 Tips?
I googled the error and got precisely 1 hit at a university high
performance computing center in Utah, so I dug up a contact there and
emailed them hoping they could tell me something, but I have not yet heard
back.
thanks,
-Alan
-- 
“Don't eat anything you've ever seen advertised on TV”
         - Michael Pollan, author of "In Defense of Food"