Hey, folks,
I've got an HP Proliant DL580 G5 throwing ECC errors. This is annoying, since a) it's all new as of a few months ago, and b) it's *fully* populated. The two things I need to figure out are a) *which* DIMM it is, and b) is it mirrored; if so, which *other* DIMM needs to come out until we get replacements from the OEM.
Here's one of many, all identical, from dmesg: EDAC MC0: CE row 12, channel 1, label "": Corrected error (Branch=0, Channel 1), DRAM-Bank=2 RD RAS=8218 CAS=500, CE Err=0x10000, Syndrome=0x6cad8e02(Correctable Patrol Data ECC))
I see the Bank=2, so I assume that's the first riser board on the left; but I can't identify which of the four (?) DIMMs on it is the problem.
I've been googling, and skimming useless manuals, and have just been trying to look under /sys/devices/system/edac/mc/mc0/. I see ce_count there showing thousands; but all of the ce_count files under csrow[0-7] show zero.
Clues, anyone?
mark
On 4/24/2013 10:34 AM, m.roth@5-cent.us wrote:
Hey, folks,
I've got an HP Proliant DL580 G5 throwing ECC errors. This is annoying,
since a) it's all new as of a few months ago, and b) it's *fully* populated. The two things I need to figure out are a) *which* DIMM it is, and b) is it mirrored; if so, which *other* DIMM needs to come out until we get replacements from the OEM.
Here's one of many, all identical, from dmesg: EDAC MC0: CE row 12, channel 1, label "": Corrected error (Branch=0, Channel 1), DRAM-Bank=2 RD RAS=8218 CAS=500, CE Err=0x10000, Syndrome=0x6cad8e02(Correctable Patrol Data ECC))
call HP, that new server should be under support contract, no?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
John R Pierce said the following on 24/04/2013 19:43:
call HP, that new server should be under support contract, no?
A ProLiant G5 is all but "new" :)
Better buy some compatible RAM because the original HP for old servers is very expensive.
Ciao, luigi
- -- / +--[Luigi Rosa]-- \
Nel Giorno della Fine non ti servira` l'inglese. --Franco Battiato, "Il re del mondo"
Luigi Rosa wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
John R Pierce said the following on 24/04/2013 19:43:
call HP, that new server should be under support contract, no?
A ProLiant G5 is all but "new" :)
Better buy some compatible RAM because the original HP for old servers is very expensive.
The *memory* was new - I replaced all, I think, of the original memory. The server's from '09. If they had a warranty, it's well past that, and HP won't chat or email without $$$.
mark
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
m.roth@5-cent.us said the following on 24/04/2013 19:51:
The *memory* was new - I replaced all, I think, of the original memory. The server's from '09. If they had a warranty, it's well past that, and HP won't chat or email without $$$.
ProLiant DL 580 servers have an integrated log.
If you boot with SmartStart CD you can run "Integrated Management Log Viewer" application and see if the system has logged some event related to ECC memory.
If you find some errors about ECC memory, you have a fault memory module (the entry in the integrated log SHOULD say what module is faulty).
If the memory module is new you should be able to get a replacement.
Ciao, luigi
- -- / +--[Luigi Rosa]-- \
Any sufficiently advanced bug is indistinguishable from a feature.
Luigi Rosa wrote:
m.roth@5-cent.us said the following on 24/04/2013 19:51:
The *memory* was new - I replaced all, I think, of the original memory. The server's from '09. If they had a warranty, it's well past that, and HP won't chat or email without $$$.
ProLiant DL 580 servers have an integrated log.
If you boot with SmartStart CD you can run "Integrated Management Log Viewer" application and see if the system has logged some event related
to ECC
memory.
If you find some errors about ECC memory, you have a fault memory module (the entry in the integrated log SHOULD say what module is faulty).
If the memory module is new you should be able to get a replacement.
Oh, I know I can get a replacement. In the meantime, it's in *use*, and I need to arrange to be able to take it down. Then there's the issue of what comes out - it's got, I don't remember 32 DIMMs maybe, including 3 or 4 riser boards. The bank=2 makes me *think* it's riser 2, but which of the four? And where's it's mirror (I think it's mirrored memory).
Good idea, though, and I just installed OpenIPMI and ipmitool... and the only thing that ipmitool sel list shows is a power supply failure yesterday. I did go into the datacenter and look at it, and it's got this cute pull-out little display... and it's not showing any of the DIMMs as failing, which goes with the results of cat /sys/devices/system/edac/mc/mc0/csrow*/*count *all* giving me zero, though /sys/devices/system/edac/mc/mc0/ce_count shows 20260 and rising.
mark
From: "m.roth@5-cent.us" m.roth@5-cent.us
Good idea, though, and I just installed OpenIPMI and ipmitool... and the only thing that ipmitool sel list shows is a power supply failure yesterday. I did go into the datacenter and look at it, and it's got this cute pull-out little display... and it's not showing any of the DIMMs as failing, which goes with the results of cat /sys/devices/system/edac/mc/mc0/csrow*/*count *all* giving me zero, though /sys/devices/system/edac/mc/mc0/ce_count shows 20260 and rising.
Install the hp-health tools, and use hplog to get more info Might need to install compat-libstdc++, and temporarily put back the default '/etc/redhat-release' And, if you do not have them already, there is: hpdiags, hpacucli, hpadu, hponcfg For your DIMM problem, you could try the hp-health www interface (I never use that, but I think there are some tests options there) or I would just boot on SmartStart and do a ram check.
JD
It won't help you on troubleshooting which RAM module is bad, but dmidecode may be helpful in figuring out how many slots/sticks you have and what's populated and not populated.
Typically if the lights are not on on that display, the RAM is tossing ECC errors or similar, but not fully failing. I have a bunch of G6 and G7 machines, but no G5 to look at to assist you.
The G8 machines should have been given a different model number, they're completely different beasts (and there's a number of things I'm learning to dislike about them, honestly).
I had a G7 that kept setting the RAM lights as if it had a RAM problem, so the server support vendor visited more than once. The real problem was a failing CPU. I mention it, because I've seen RAM problems that really weren't and were misdiagnosed by the relatively crude monitoring built into those motherboards, more than once. I've also run the HP diagnostics for a full day, and had it find absolutely nothing, and have the lights come back on 5 minutes after firing the on-disk OS back up. Same thing with other tools like memtest86.
Swap the RAM out completely. If that doesn't fix it, swap the associated processor out. I've never seen any other hardware in those pizza box machines be the cause of the RAM problems you're seeing.
If you can't swap it completely, swap sides and move it to the other side. See if it follows the RAM or the slots. Often it follows the slots, and the problem is the CPU which talks to that "half" of the motherboard, not the RAM.
At least that's what I've seen... YMMV.
Nate
On 4/25/2013 1:49 PM, Nathan Duehr wrote:
I had a G7 that kept setting the RAM lights as if it had a RAM problem, so the server support vendor visited more than once. The real problem was a failing CPU. I mention it, because I've seen RAM problems that really weren't and were misdiagnosed by the relatively crude monitoring built into those motherboards, more than once.
thats not altogether surprising, the newer Intel CPUs (and all the AMD Opterons) integrate the RAM controller into the CPU chip, and it is basically impossible to tell which is at fault.
Nathan Duehr wrote:
It won't help you on troubleshooting which RAM module is bad, but dmidecode may be helpful in figuring out how many slots/sticks you have and what's populated and not populated.
Heh. It's *fully* populated, the whole m/b, and all four optional risers.
Typically if the lights are not on on that display, the RAM is tossing ECC errors or similar, but not fully failing. I have a bunch of G6 and G7 machines, but no G5 to look at to assist you.
That's what it's doing, ECC correctable. BUT /sys/devices/system/edac/mc/mc0/ce_count showed, as I noted, a ton of errors, but under mc0 was csrow[0-7], and the ce_count in each was *0* - not sure how that could be, but it was. <snip>
Swap the RAM out completely. If that doesn't fix it, swap the associated
Can't do that. I don't have 256G of FBDIMMs laying around, nor do I have another identical box (well, maybe one, and I'm about to surplus that).
But it's worse than that, Jim.... The memory's *mirrored*, *and* it's requiring the entire m/b to be populated before the optional risers... and the optional risers are each paired. <snip>
If you can't swap it completely, swap sides and move it to the other side. See if it follows the RAM or the slots. Often it follows the slots, and the problem is the CPU which talks to that "half" of the motherboard, not the RAM.
<snip> Thanks, Nate, I was just hoping someone could show me how to translate what the kernel's throwing to be able to identify the explicit DIMM.
And a) it's technically not ours, it belongs to another Institute, but they're doing intrmural work, and we're running it, and b) it's long out of warranty, so I can't even talk to HP.
What I've done so far, after scheduling downtime, was to pull DIMM 2c, and its mate 6c, then take two from riser 4, and put them on the m/b. After a couple of reboots, I discovered that a) I couldn't put it all back without those two DIMMS on riser 4, nor could I just leave riser 4 out, I had to pull *both* riser 3 and 4.
It's been back up all day, I ran stress on it for a bit, and my user tried some stuff, and no errors, so I now know that it's a DIMM on one of those two risers, or it's one of the ones I pulled from the m/b. Only 1 of 8, instead of 1 of 32....
In addition, after much googling, I finally found HP system management, and the SIM, separately. Installed them... and SIM seems as though it's missing something. I try to log on, via the SM homepage, and it takes better than 5 min to get to the page. When I click on memory in system, that takes a number of minutes... and tells me *nothing* at all, where the SM web page at least used to show me what's occupied.
Annoyances, all the way around. I expect to bounce the system tomorrow morning, and put the two I pulled from the m/b onto riser 4, then pull riser 2 and replace it with 4; hopefully, we'll see errors, and I'll be down to 1 of four bad.
mark
On 4/25/2013 2:09 PM, m.roth@5-cent.us wrote:
In addition, after much googling, I finally found HP system management, and the SIM, separately. Installed them... and SIM seems as though it's missing something. I try to log on, via the SM homepage, and it takes better than 5 min to get to the page. When I click on memory in system, that takes a number of minutes... and tells me*nothing* at all, where the SM web page at least used to show me what's occupied.
I never managed to get anything useful out of that HP system management stuff, I think I tried from scratch 3 times, on different systems. I dunno what bit I was missing, the instructions are all for seperate parts, and obviously, I was missing something, but I'd end up with a web framework that had nothing useful in it. at the time I was more interested in the raid management, but there wasn't any info on the cpu hardware, either.
On 4/24/2013 10:46 AM, Luigi Rosa wrote:
call HP, that new server should be under support contract, no?
A ProLiant G5 is all but "new":)
ah, he said 'all new as of a few months ago'. actually, thats a 2009-ish server, with quad Tigerton/Dunnington quadcore processors (roughly equivalent to the Core2 Q series), and it came with a 3 year onsite warranty. It uses PC2-5300 fully buffered dram (ddr2-6700) so yeah, that ram is going to be expensive since its an older generation. looks to take 4x4 1/2/4/8GB sticks, in pairs, but if you have all 4 CPUs, you probably need to populate all 4 banks. oh, and there were memory expansion mezzenaine cards, so bringing it up to 32 dimms total.
I figured for laughs I'd try and find a ram layout for it, but best I've found says its printed on the lid to the CPU/memory module. http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc...
John R Pierce wrote:
On 4/24/2013 10:46 AM, Luigi Rosa wrote:
call HP, that new server should be under support contract, no?
A ProLiant G5 is all but "new":)
ah, he said 'all new as of a few months ago'. actually, thats a 2009-ish server, with quad Tigerton/Dunnington quadcore processors (roughly equivalent to the Core2 Q series), and it came with a 3 year onsite warranty. It uses PC2-5300 fully buffered dram (ddr2-6700) so yeah, that ram is going to be expensive since its an older generation. looks to take 4x4 1/2/4/8GB sticks, in pairs, but if you have all 4 CPUs, you probably need to populate all 4 banks. oh, and there were memory expansion mezzenaine cards, so bringing it up to 32 dimms total.
I figured for laughs I'd try and find a ram layout for it, but best I've found says its printed on the lid to the CPU/memory module. http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc...
I need to consider that - it might help. Anyway, that's it... and *everything* is populated. Another HBS (you know, the technical term, Honkin' Big Server... and I forget how many U's)
On Wed, 24 Apr 2013, m.roth@5-cent.us wrote:
Hey, folks,
I've got an HP Proliant DL580 G5 throwing ECC errors. This is annoying, since a) it's all new as of a few months ago, and b) it's *fully* populated. The two things I need to figure out are a) *which* DIMM it is, and b) is it mirrored; if so, which *other* DIMM needs to come out until we get replacements from the OEM.
Here's one of many, all identical, from dmesg: EDAC MC0: CE row 12, channel 1, label "": Corrected error (Branch=0, Channel 1), DRAM-Bank=2 RD RAS=8218 CAS=500, CE Err=0x10000, Syndrome=0x6cad8e02(Correctable Patrol Data ECC))
I see the Bank=2, so I assume that's the first riser board on the left; but I can't identify which of the four (?) DIMMs on it is the problem.
I've been googling, and skimming useless manuals, and have just been trying to look under /sys/devices/system/edac/mc/mc0/. I see ce_count there showing thousands; but all of the ce_count files under csrow[0-7] show zero.
Clues, anyone?
Is there anything in the iml log on ILO? Also did you try just re-seating the memory or moving it into other slots to see if you can track it down that way??
Regards,
I recently switched our workstations from gdm to kdm, because I was receiving some complaints with gdm about having to set the desktop halfway through the login (versus being able to set it from the start of the login process), as well as having the mile long list of previous logins (which, by the way, if anyone knows how to turn that off, lemme know, please!).
After switching the window manager to kdm, we started having problems with our vncserver sessions. Every single session was kde, without regard for the contents of the xstartup file for that particular user. I have users that want gnome and others that want kde, and some project accounts that use some other custom stuff, so obviously this was a problem.
The only "configuration" I did after yum group-installing KDE-Desktop was to set up the /etc/sysconfig/desktop file to contain: DESKTOP="KDE" DISPLAYMANAGER="KDE"
so I am not sure what is up with kdm to be causing this. I tried looking online but only found some ancient debian/ubuntu questions that are mostly unanswered or resolved in ways that won't work for us (hard coding something in xinitrc, for example).
So, does anyone have any ideas as to what is going on and what I can do to resolve it? I am currently in the process of rolling back the workstations to the gdm configuration as a "fix", but if possible I'd really like to be able to actually sort out what is going on. Our systems are not heavily customized, being "software development workstation" installs with a few extra group installs (including the centos base tigervnc-server package) and a handful of third-party rpmforge packages installed, and kdm seems to work properly on the console (allowing selection of other desktops, etc), it's only with the ~/.vnc/xstartup that we are having issues.
Thanks! Miranda
On 2013/04/24 12:02, Miranda Hawarden-Ogata wrote:
I recently switched our workstations from gdm to kdm, because I was receiving some complaints with gdm about having to set the desktop halfway through the login (versus being able to set it from the start of the login process), as well as having the mile long list of previous logins (which, by the way, if anyone knows how to turn that off, lemme know, please!).
After switching the window manager to kdm, we started having problems with our vncserver sessions. Every single session was kde, without regard for the contents of the xstartup file for that particular user. I have users that want gnome and others that want kde, and some project accounts that use some other custom stuff, so obviously this was a problem.
The only "configuration" I did after yum group-installing KDE-Desktop was to set up the /etc/sysconfig/desktop file to contain: DESKTOP="KDE" DISPLAYMANAGER="KDE"
so I am not sure what is up with kdm to be causing this. I tried looking online but only found some ancient debian/ubuntu questions that are mostly unanswered or resolved in ways that won't work for us (hard coding something in xinitrc, for example).
So, does anyone have any ideas as to what is going on and what I can do to resolve it? I am currently in the process of rolling back the workstations to the gdm configuration as a "fix", but if possible I'd really like to be able to actually sort out what is going on. Our systems are not heavily customized, being "software development workstation" installs with a few extra group installs (including the centos base tigervnc-server package) and a handful of third-party rpmforge packages installed, and kdm seems to work properly on the console (allowing selection of other desktops, etc), it's only with the ~/.vnc/xstartup that we are having issues.
Thanks! Miranda
For anyone in the future trying to figure this out, thought I'd post. Turned out that the "default" xstartup that I thought was explicitly calling gnome was not doing so, and was calling the system default desktop instead. Once I explicitly called gnome via a gnome-session script call in xstartup, the problem was resolved.
Although a great many of the online instructions for setting gnome as your vnc desktop tell you to unset some variables and then call startx, all that does is call the system default desktop, which is fine and dandy if you're running gdm as your manager, but completely worthless if you are running kdm. You need to explicitly call gnome-session instead of startx, and you can also dump everything else in the "default" file as it is not necessary. A functional xstartup can be as barren as:
(gnome) #!/bin/sh gnome-session &
(kde) #!/bin/sh startkde &
Hope this helps someone in the future :D
Thanks! Miranda