I've got a production system running CentOS 4 that was rock solid until I upgraded from 2.6.9-55 to 2.6.9-78.0.13 (now running 2.6.9-89.0.11). The system now crashes intermittently after a few weeks. I finally caught the panic message :
EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4) Kernel panic - not syncing: MC0: Uncorrected Error
Looking at the kernel changelog, I see that EDAC support was added for the Intel 5000 chipset in 2.6.9-68.20.EL which this server runs.
I'm trying to determine if this is a potential memory issue, or is this related to some other hardware item. Also considering disabling EDAC in the kernel (is "noedac" a valid option?) as a last resort. I will run memtest86+ on the server as soon as possible to check the memory, just formulating my game plan if it's something else.
Thoughts?
Chris
Chris Miller wrote:
Thoughts?
Check your bios/system event log for any indication that it is logging memory errors? Most modern server class motherboards (past 5 years) do this, though not always reliably.
I've also had trouble with memtest86 myself, I prefer to run ctcs:
http://sourceforge.net/projects/va-ctcs/
The software is really old and is picky what you build it on, if I recall right I could only get it to build on RHEL/CentOS 4 not 5 (though the binaries work fine on 5).
It does a good torture test which in my experience can find problems faster than memtest86(which can take days).
nate
nate wrote:
Check your bios/system event log for any indication that it is logging memory errors? Most modern server class motherboards (past 5 years) do this, though not always reliably.
Nothing in the logs, it's a Supermicro X7DVL-E (fyi).
I've also had trouble with memtest86 myself, I prefer to run ctcs:
README.FIRST scares me. Server is 70 miles away, not feeling really good about this. I ran memtest86+ last night for 6+ hours and it came back clean.
The software is really old and is picky what you build it on, if I recall right I could only get it to build on RHEL/CentOS 4 not 5 (though the binaries work fine on 5).
I just booted the binary from the memtest site under Grub, it worked fine.
Regards, Chris
Chris,
I've got a production system running CentOS 4 that was rock solid until I upgraded from 2.6.9-55 to 2.6.9-78.0.13 (now running 2.6.9-89.0.11). The system now crashes intermittently after a few weeks. I finally caught the panic message :
EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4) Kernel panic - not syncing: MC0: Uncorrected Error
Looking at the kernel changelog, I see that EDAC support was added for the Intel 5000 chipset in 2.6.9-68.20.EL which this server runs.
Same issue here with a machine running centos 5.3. The problem began with a kernel update that introduced the 5000 chipset. See the thread "RAM errors after kernel-update" for more details. I couldn't solve the problem yet, but because the machine crashes every two days with this kernel, I had to boot an earlier kernel without chipset support.
I'm trying to determine if this is a potential memory issue, or is this related to some other hardware item. Also considering disabling EDAC in the kernel (is "noedac" a valid option?) as a last resort. I will run memtest86+ on the server as soon as possible to check the memory, just formulating my game plan if it's something else.
Don't use the memtest86+ version that comes with the centos ISO. There is a much newer version available from the authors website. Only the new version identifies the chipset correctly.
On 20-Oct-2009 Michael Schumacher wrote:
I've got a production system running CentOS 4 that was rock solid until I upgraded from 2.6.9-55 to 2.6.9-78.0.13 (now running 2.6.9-89.0.11). The system now crashes intermittently after a few weeks. I finally caught the panic message :
EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4) Kernel panic - not syncing: MC0: Uncorrected Error
I have also seen this message or something very close. The server is 200 km away and the person who read it to me over the phone wasn't very fluent in English.
That server has a ASUS DSBF-D12 motherboard. Kernel was 2.6.9-89.0.11.EL. The crash could happen within hours or even minutes.
I downgraded to 2.6.9-55.0.9.EL, which doesn't have the i500_edac module. Now that I have a PDU and remote KVM set up, I'm going to try other kernels tomorrow.
-Philip
On 10/21/2009 10:21 PM Philip Gwyn wrote:
On 20-Oct-2009 Michael Schumacher wrote:
I've got a production system running CentOS 4 that was rock solid until I upgraded from 2.6.9-55 to 2.6.9-78.0.13 (now running 2.6.9-89.0.11). The system now crashes intermittently after a few weeks. I finally caught the panic message : EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4) Kernel panic - not syncing: MC0: Uncorrected Error
I have also seen this message or something very close. The server is 200 km away and the person who read it to me over the phone wasn't very fluent in English.
That server has a ASUS DSBF-D12 motherboard. Kernel was 2.6.9-89.0.11.EL. The crash could happen within hours or even minutes.
I downgraded to 2.6.9-55.0.9.EL, which doesn't have the i500_edac module. Now that I have a PDU and remote KVM set up, I'm going to try other kernels tomorrow.
-Philip
When I've upgraded a kernel on CentOS, the previous kernel(s) is/are not removed and in fact remain part of the boot menu, albeit not then the kernel(s) booted by default. E.g.,
cat /boot/grub/menu.lst ... title CentOS (2.6.18-164.2.1.el5.plus) root (hd0,2) kernel /vmlinuz-2.6.18-164.2.1.el5.plus ro root=/dev/mapper/luks-3d723b4f-0184-438d-9cb9-9ebff16e683a rhgb quiet initrd /initrd-2.6.18-164.2.1.el5.plus.img title CentOS (2.6.18-164.el5) root (hd0,2) kernel /vmlinuz-2.6.18-164.el5 ro root=/dev/mapper/luks-3d723b4f-0184-438d-9cb9-9ebff16e683a rhgb quiet initrd /initrd-2.6.18-164.el5.img title CentOS (2.6.18-128.7.1.el5) root (hd0,2) kernel /vmlinuz-2.6.18-128.7.1.el5 ro root=/dev/mapper/luks-3d723b4f-0184-438d-9cb9-9ebff16e683a rhgb quiet initrd /initrd-2.6.18-128.7.1.el5.img ...
If your /boot/grub/menu.lst is similar, then you need only select a previously installed kernel at the boot menu. You can access this via your remote KVM setup, yes?
In the past I've edited menu.lst to change what's booted, i.e., I rearranged the order of the stanzas to make the first one, which is the default (the one booted if no action is taken at the boot menu), the working/desired kernel.
hth, ken
On Thu, 2009-10-22 at 04:20 -0400, ken wrote:
<snip>
cat /boot/grub/menu.lst ... title CentOS (2.6.18-164.2.1.el5.plus) root (hd0,2) kernel /vmlinuz-2.6.18-164.2.1.el5.plus ro root=/dev/mapper/luks-3d723b4f-0184-438d-9cb9-9ebff16e683a rhgb quiet initrd /initrd-2.6.18-164.2.1.el5.plus.img title CentOS (2.6.18-164.el5) root (hd0,2) kernel /vmlinuz-2.6.18-164.el5 ro root=/dev/mapper/luks-3d723b4f-0184-438d-9cb9-9ebff16e683a rhgb quiet initrd /initrd-2.6.18-164.el5.img title CentOS (2.6.18-128.7.1.el5) root (hd0,2) kernel /vmlinuz-2.6.18-128.7.1.el5 ro root=/dev/mapper/luks-3d723b4f-0184-438d-9cb9-9ebff16e683a rhgb quiet initrd /initrd-2.6.18-128.7.1.el5.img ...
If your /boot/grub/menu.lst is similar, then you need only select a previously installed kernel at the boot menu. You can access this via your remote KVM setup, yes?
In the past I've edited menu.lst to change what's booted, i.e., I rearranged the order of the stanzas to make the first one, which is the default (the one booted if no action is taken at the boot menu), the working/desired kernel.
Don't forget htat you can use the default command in the grub.conf, e.g. "default 1", rather than rearranging all the time.
Use "info grub", select the "index" entry and then look for "default"
hth, ken
<snip sig stuff>