[CentOS] DIMM problem

Thu Apr 25 21:09:27 UTC 2013
m.roth at 5-cent.us <m.roth at 5-cent.us>

Nathan Duehr wrote:
> It won't help you on troubleshooting which RAM module is bad, but
> dmidecode may be helpful in figuring out how many slots/sticks you have
> and what's populated and not populated.
>
Heh. It's *fully* populated, the whole m/b, and all four optional risers.

> Typically if the lights are not on on that display, the RAM is tossing ECC
> errors or similar, but not fully failing.  I have a bunch of G6 and G7
> machines, but no G5 to look at to assist you.

That's what it's doing, ECC correctable. BUT
/sys/devices/system/edac/mc/mc0/ce_count showed, as I noted, a ton of
errors, but under mc0 was csrow[0-7], and the ce_count in each was *0* -
not sure how that could be, but it was.
<snip>
> Swap the RAM out completely.  If that doesn't fix it, swap the associated

Can't do that. I don't have 256G of FBDIMMs laying around, nor do I have
another identical box (well, maybe one, and I'm about to surplus that).

But it's worse than that, Jim.... The memory's *mirrored*, *and* it's
requiring the entire m/b to be populated before the optional risers... and
the optional risers are each paired.
<snip>
> If you can't swap it completely, swap sides and move it to the other side.
>  See if it follows the RAM or the slots.  Often it follows the slots, and
> the problem is the CPU which talks to that "half" of the motherboard, not
> the RAM.
<snip>
Thanks, Nate, I was just hoping someone could show me how to translate
what the kernel's throwing to be able to identify the explicit DIMM.

And a) it's technically not ours, it belongs to another Institute, but
they're doing intrmural work, and we're running it, and b) it's long out
of warranty, so I can't even talk to HP.

What I've done so far, after scheduling downtime, was to pull DIMM 2c, and
its mate 6c, then take two from riser 4, and put them on the m/b. After a
couple of reboots, I discovered that a) I couldn't put it all back without
those two DIMMS on riser 4, nor could I just leave riser 4 out, I had to
pull *both* riser 3 and 4.

It's been back up all day, I ran stress on it for a bit, and my user tried
some stuff, and no errors, so I now know that it's a DIMM on one of those
two risers, or it's one of the ones I pulled from the m/b. Only 1 of 8,
instead of 1 of 32....

In addition, after much googling, I finally found HP system management,
and the SIM, separately. Installed them... and SIM seems as though it's
missing something. I try to log on, via the SM homepage, and it takes
better than 5 min to get to the page. When I click on memory in system,
that takes a number of minutes... and tells me *nothing* at all, where the
SM web page at least used to show me what's occupied.

Annoyances, all the way around. I expect to bounce the system tomorrow
morning, and put the two I pulled from the m/b onto riser 4, then pull
riser 2 and replace it with 4; hopefully, we'll see errors, and I'll be
down to 1 of four bad.

        mark