Hi,
I've installed Centos 5.5 (plus updates) in a machine with INTEL DP43BF motherboard. In order to make Linux detect the PCIs I've added the pci=assign-busses in my GRUB conf.
Everything runs fine but within less than 2 days of uptime the machine simply freezes (black console no connectivity). This has happened more than one time so I'm considering to be a problem. The memtest passed without a problem and the machine uses a compact flash (sandisk extreme III 4GB) as a disk.
I could only find the error messages in my /var/log/messages but those appear hours before the actual lock.
kernel: 0000:00:1a.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: 0000:00:1d.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: eth4: PCI Bus error a290.
kernel: eth4: PCI Bus error 0290.
kernel: eth3: PCI Bus error 2290.
kernel: eth3: PCI Bus error 0290.
Any tips?
On Mon, Dec 27, 2010 at 11:04 AM, robert mena robert.mena@gmail.com wrote:
Hi, Everything runs fine but within less than 2 days of uptime the machine simply freezes (black console no connectivity). This has happened more than one time so I'm considering to be a problem.
What kind of CPU is in there? This sounds like what happens to some brands of CPUs when they overheat. Others just melt.
On 12/27/10 11:04 AM, robert mena wrote:
Hi,
I've installed Centos 5.5 (plus updates) in a machine with INTEL DP43BF motherboard. In order to make Linux detect the PCIs I've added the pci=assign-busses in my GRUB conf.
Everything runs fine but within less than 2 days of uptime the machine simply freezes (black console no connectivity). This has happened more than one time so I'm considering to be a problem. The memtest passed without a problem and the machine uses a compact flash (sandisk extreme III 4GB) as a disk.
I could only find the error messages in my /var/log/messages but those appear hours before the actual lock.
kernel: 0000:00:1a.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: 0000:00:1d.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: eth4: PCI Bus error a290.
kernel: eth4: PCI Bus error 0290.
kernel: eth3: PCI Bus error 2290.
kernel: eth3: PCI Bus error 0290.
Any tips?
thats a desktop board, right? so it probably doesn't have ECC or any of the other system integrity features of a server board, nor do they usually have the IO bus bandwidth to handle substantial IO workloads.
PCI bus errors are not a good thing at all, either. you have 5 ethernet adapters in use? what sort of Ethernet controller? I believe those PCI Bus errors are being reported by your ethernet adapters, and could be the result of excess bus contention. a single gigE can way more than saturate a 32bit 33Mhz PCI (parallel) bus. All the PCI slots on a desktop board like you have are on the same bus and contend for the same bandwidth.
Also, as mentioned thermal problems are a definite possibility, although Intel CPUs tend to self-throttle if they get too hot, the Chipset might not be that good at it (eg, watch the chipset and memory temperature as well as the CPU). Another possible cause would be silent memory corruption although that would be more likely to cause a kernel fault ("Fatal kernel error - system halted") however if your display is in a GUI mode, you won't see this unless the console is directed to a serial port which is being monitored.
Try turning off the green features completely on the board.. Never allow the board to go to sleep, don't even let the board put the monitor into power saving mode..
John
On 12/27/2010 4:19 PM, John R Pierce wrote:
On 12/27/10 11:04 AM, robert mena wrote:
Hi,
I've installed Centos 5.5 (plus updates) in a machine with INTEL DP43BF motherboard. In order to make Linux detect the PCIs I've added the pci=assign-busses in my GRUB conf.
Everything runs fine but within less than 2 days of uptime the machine simply freezes (black console no connectivity). This has happened more than one time so I'm considering to be a problem. The memtest passed without a problem and the machine uses a compact flash (sandisk extreme III 4GB) as a disk.
I could only find the error messages in my /var/log/messages but those appear hours before the actual lock.
kernel: 0000:00:1a.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: 0000:00:1d.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: eth4: PCI Bus error a290.
kernel: eth4: PCI Bus error 0290.
kernel: eth3: PCI Bus error 2290.
kernel: eth3: PCI Bus error 0290.
Any tips?
thats a desktop board, right? so it probably doesn't have ECC or any of the other system integrity features of a server board, nor do they usually have the IO bus bandwidth to handle substantial IO workloads.
PCI bus errors are not a good thing at all, either. you have 5 ethernet adapters in use? what sort of Ethernet controller? I believe those PCI Bus errors are being reported by your ethernet adapters, and could be the result of excess bus contention. a single gigE can way more than saturate a 32bit 33Mhz PCI (parallel) bus. All the PCI slots on a desktop board like you have are on the same bus and contend for the same bandwidth.
Also, as mentioned thermal problems are a definite possibility, although Intel CPUs tend to self-throttle if they get too hot, the Chipset might not be that good at it (eg, watch the chipset and memory temperature as well as the CPU). Another possible cause would be silent memory corruption although that would be more likely to cause a kernel fault ("Fatal kernel error - system halted") however if your display is in a GUI mode, you won't see this unless the console is directed to a serial port which is being monitored.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1170 / Virus Database: 426/3341 - Release Date: 12/26/10
Hi John,
I'll have a look a that. This seems odd because, if I understand correctly, those settings would only affect if/when the system is idle and the lockups occur during regular/busy hours.
BUT... they should be off anyway.
On Mon, Dec 27, 2010 at 5:34 PM, John Plemons john@mavin.com wrote:
Try turning off the green features completely on the board.. Never allow the board to go to sleep, don't even let the board put the monitor into power saving mode..
John
On 12/27/2010 4:19 PM, John R Pierce wrote:
On 12/27/10 11:04 AM, robert mena wrote:
Hi,
I've installed Centos 5.5 (plus updates) in a machine with INTEL DP43BF motherboard. In order to make Linux detect the PCIs I've added the pci=assign-busses in my GRUB conf.
Everything runs fine but within less than 2 days of uptime the machine simply freezes (black console no connectivity). This has happened more than one time so I'm considering to be a problem. The memtest passed without a problem and the machine uses a compact flash (sandisk extreme III 4GB) as a disk.
I could only find the error messages in my /var/log/messages but those appear hours before the actual lock.
kernel: 0000:00:1a.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: 0000:00:1d.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001
kernel: eth4: PCI Bus error a290.
kernel: eth4: PCI Bus error 0290.
kernel: eth3: PCI Bus error 2290.
kernel: eth3: PCI Bus error 0290.
Any tips?
thats a desktop board, right? so it probably doesn't have ECC or any of the other system integrity features of a server board, nor do they usually have the IO bus bandwidth to handle substantial IO workloads.
PCI bus errors are not a good thing at all, either. you have 5 ethernet adapters in use? what sort of Ethernet controller? I believe those PCI Bus errors are being reported by your ethernet adapters, and could be the result of excess bus contention. a single gigE can way more than saturate a 32bit 33Mhz PCI (parallel) bus. All the PCI slots on a desktop board like you have are on the same bus and contend for the same bandwidth.
Also, as mentioned thermal problems are a definite possibility, although Intel CPUs tend to self-throttle if they get too hot, the Chipset might not be that good at it (eg, watch the chipset and memory temperature as well as the CPU). Another possible cause would be silent memory corruption although that would be more likely to cause a kernel fault ("Fatal kernel error - system halted") however if your display is in a GUI mode, you won't see this unless the console is directed to a serial port which is being monitored.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1170 / Virus Database: 426/3341 - Release Date: 12/26/10
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi John,
Regular realtek fast ethernet. Each one connected to a broadband modem (1 Mbps each) so I do not think this should be a bus saturation.
I do not think this is a thermal problem due to the lack of messages (I got this problem in the past with a different machine and I got those overheating message - with the throttle but I'll investigate further.
I'll remove the gui mode to try to catch those errors.
On Mon, Dec 27, 2010 at 5:19 PM, John R Pierce pierce@hogranch.com wrote:
thats a desktop board, right? so it probably doesn't have ECC or any of the other system integrity features of a server board, nor do they usually have the IO bus bandwidth to handle substantial IO workloads.
PCI bus errors are not a good thing at all, either. you have 5 ethernet adapters in use? what sort of Ethernet controller? I believe those PCI Bus errors are being reported by your ethernet adapters, and could be the result of excess bus contention. a single gigE can way more than saturate a 32bit 33Mhz PCI (parallel) bus. All the PCI slots on a desktop board like you have are on the same bus and contend for the same bandwidth.
Also, as mentioned thermal problems are a definite possibility, although Intel CPUs tend to self-throttle if they get too hot, the Chipset might not be that good at it (eg, watch the chipset and memory temperature as well as the CPU). Another possible cause would be silent memory corruption although that would be more likely to cause a kernel fault ("Fatal kernel error - system halted") however if your display is in a GUI mode, you won't see this unless the console is directed to a serial port which is being monitored.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 12/27/10 9:09 PM, robert mena wrote:
Regular realtek fast ethernet.
IMNSHO, realtek are pretty close to junk grade NICs. they have far too many variations with far too many weird bugs when used for any more than single user desktop kind of systems.
Each one connected to a broadband modem (1 Mbps each) so I do not think this should be a bus saturation.
what speed is the local link to the modem? even if your internet connection is 1Mbps, if your ethernet is running at 100baseT, that can be 10MB/sec bursts, and a few of those could potentially cause bus contention issues
how fast is the LAN? since the errors were on eth3 and eth4, I'm wondering what eth0, eth1, and eth2 are doing traffic wise.
I do not think this is a thermal problem due to the lack of messages (I got this problem in the past with a different machine and I got those overheating message - with the throttle but I'll investigate further.
I wouldn't rely on that assumption. Thermal monitoring might not be configured correctly for this board, etc etc.
Hi John,
I agree that realtek are far from something we could cold call a good product. But I have similar setup (with different motherboard) working without a flaw.
All your arguments are valid and worth investigating.
The local lan (eth0) is 100Mbits. Both eth1 and eth2 are realtek (same model/chipset).
I'll have a look at the BIOS settings and removed the vga mode from grub and make sure lm_sensors is installed.
Besides this I am not sure what else I could do. The main question is am I dealing with a falty component (motherboard, cpu, memory, NIC) or some other OS/Software bug?
On Tue, Dec 28, 2010 at 1:59 AM, John R Pierce pierce@hogranch.com wrote:
On 12/27/10 9:09 PM, robert mena wrote:
Regular realtek fast ethernet.
IMNSHO, realtek are pretty close to junk grade NICs. they have far too many variations with far too many weird bugs when used for any more than single user desktop kind of systems.
Each one connected to a broadband modem (1 Mbps each) so I do not think this should be a bus saturation.
what speed is the local link to the modem? even if your internet connection is 1Mbps, if your ethernet is running at 100baseT, that can be 10MB/sec bursts, and a few of those could potentially cause bus contention issues
how fast is the LAN? since the errors were on eth3 and eth4, I'm wondering what eth0, eth1, and eth2 are doing traffic wise.
I do not think this is a thermal problem due to the lack of messages (I got this problem in the past with a different machine and I got those overheating message - with the throttle but I'll investigate further.
I wouldn't rely on that assumption. Thermal monitoring might not be configured correctly for this board, etc etc.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Tue, Dec 28, 2010 at 6:37 AM, robert mena robert.mena@gmail.com wrote:
Hi John, I agree that realtek are far from something we could cold call a good product. But I have similar setup (with different motherboard) working without a flaw.
Similar isn't the same. I will not buy a board with a Realtek NIC anymore. I have had problems with them in the past with CentOS. It isn't worth the time and effort to debug for the few bucks it costs for a board with an Intel NIC. Although it sounds like you have a bunch of Realtek PCI NICs in your system. If you only need 1Mbps why not grab some Intel 10/100 NICs. You can get these real cheap off eBay.
Ryan
----- Original Message ----
From: John R Pierce pierce@hogranch.com To: centos@centos.org Sent: Tue, December 28, 2010 2:59:09 AM Subject: Re: [CentOS] Problems with motherboard support? INTEL DP43BF
On 12/27/10 9:09 PM, robert mena wrote:
Regular realtek fast ethernet.
IMNSHO, realtek are pretty close to junk grade NICs. they have far too many variations with far too many weird bugs when used for any more than single user desktop kind of systems.
rl nics are toy nics. I wouldn't use them on production servers unless I have no choice
For some reasons, see this, textually from FreeBSD's 5.4 if_rl.c:
/* * The RealTek 8139 PCI NIC redefines the meaning of 'low end.' This is * probably the worst PCI ethernet controller ever made, with the possible * exception of the FEAST chip made by SMC. The 8139 supports bus-master * DMA, but it has a terrible interface that nullifies any performance * gains that bus-master DMA usually offers. * * For transmission, the chip offers a series of four TX descriptor * registers. Each transmit frame must be in a contiguous buffer, aligned * on a longword (32-bit) boundary. This means we almost always have to * do mbuf copies in order to transmit a frame, except in the unlikely * case where a) the packet fits into a single mbuf, and b) the packet * is 32-bit aligned within the mbuf's data area. The presence of only * four descriptor registers means that we can never have more than four * packets queued for transmission at any one time. * * Reception is not much better. The driver has to allocate a single large * buffer area (up to 64K in size) into which the chip will DMA received * frames. Because we don't know where within this region received packets * will begin or end, we have no choice but to copy data from the buffer * area into mbufs in order to pass the packets up to the higher protocol * levels. *
sadly, things hadn't improved since then
Fer