Nathan Duehr wrote: > It won't help you on troubleshooting which RAM module is bad, but > dmidecode may be helpful in figuring out how many slots/sticks you have > and what's populated and not populated. > Heh. It's *fully* populated, the whole m/b, and all four optional risers. > Typically if the lights are not on on that display, the RAM is tossing ECC > errors or similar, but not fully failing. I have a bunch of G6 and G7 > machines, but no G5 to look at to assist you. That's what it's doing, ECC correctable. BUT /sys/devices/system/edac/mc/mc0/ce_count showed, as I noted, a ton of errors, but under mc0 was csrow[0-7], and the ce_count in each was *0* - not sure how that could be, but it was. <snip> > Swap the RAM out completely. If that doesn't fix it, swap the associated Can't do that. I don't have 256G of FBDIMMs laying around, nor do I have another identical box (well, maybe one, and I'm about to surplus that). But it's worse than that, Jim.... The memory's *mirrored*, *and* it's requiring the entire m/b to be populated before the optional risers... and the optional risers are each paired. <snip> > If you can't swap it completely, swap sides and move it to the other side. > See if it follows the RAM or the slots. Often it follows the slots, and > the problem is the CPU which talks to that "half" of the motherboard, not > the RAM. <snip> Thanks, Nate, I was just hoping someone could show me how to translate what the kernel's throwing to be able to identify the explicit DIMM. And a) it's technically not ours, it belongs to another Institute, but they're doing intrmural work, and we're running it, and b) it's long out of warranty, so I can't even talk to HP. What I've done so far, after scheduling downtime, was to pull DIMM 2c, and its mate 6c, then take two from riser 4, and put them on the m/b. After a couple of reboots, I discovered that a) I couldn't put it all back without those two DIMMS on riser 4, nor could I just leave riser 4 out, I had to pull *both* riser 3 and 4. It's been back up all day, I ran stress on it for a bit, and my user tried some stuff, and no errors, so I now know that it's a DIMM on one of those two risers, or it's one of the ones I pulled from the m/b. Only 1 of 8, instead of 1 of 32.... In addition, after much googling, I finally found HP system management, and the SIM, separately. Installed them... and SIM seems as though it's missing something. I try to log on, via the SM homepage, and it takes better than 5 min to get to the page. When I click on memory in system, that takes a number of minutes... and tells me *nothing* at all, where the SM web page at least used to show me what's occupied. Annoyances, all the way around. I expect to bounce the system tomorrow morning, and put the two I pulled from the m/b onto riser 4, then pull riser 2 and replace it with 4; hopefully, we'll see errors, and I'll be down to 1 of four bad. mark