network connectivity lost after reboot/upgrade

List overview All Threads
Download

newer

older

New java update?

ProLiant ML110 G7

Kai Schaetzl

4 Mar 2013 4 Mar '13

6:15 p.m.

I upgraded one of my old machines running 5.x to the latest kernel (from 308.24.1 to 348.1.1). After rebooting network connectivity was gone. I rebooted with the old kernel, I also tried the one before it (308.20.1) still no luck. So I assume it's got nothing to do with the kernel or even CentOS. But a hardware failure seems also unlikely, see below.

ethtool shows the link as up and if I remove the cable as down. I attached a laptop via crossover cable, it detects the link, but same problem. I disabled iptables and set selinux to disabled. No change. There's a Xen VM running on that machine and I can ping it from the hardware. So, internal networking seems to be ok. I'm using bridged networking for Xen connectivity, setup by normal Red Hat means, not via Xen. Never had a problem. There are no errors in the logs, except for dhcpd telling network is down and named is also giving some weird errors. This is my only dhcpd, so I would like to have it up ASAP :-(

Is there anything else besides a weird hardware failure that I could check? I'm going to get a new card tomorrow and see if that changes the situation. This is mobo internal networking based on nforce-MCP61.

Has anyone seen such a hardware failure where the link goes up but no packets go over the wire? It seems a bit unlikely that this hardware failure (and nothing else) should happen on a reboot after an upgrade.

Thanks.

Kai

Show replies by date

zGreenfelder

4 Mar 4 Mar

8:22 p.m.

On Mon, Mar 4, 2013 at 1:15 PM, Kai Schaetzl maillists@conactive.com wrote:

...

I upgraded one of my old machines running 5.x to the latest kernel (from 308.24.1 to 348.1.1). After rebooting network connectivity was gone. I rebooted with the old kernel, I also tried the one before it (308.20.1) still no luck. So I assume it's got nothing to do with the kernel or even CentOS. But a hardware failure seems also unlikely, see below.

ethtool shows the link as up and if I remove the cable as down. I attached a laptop via crossover cable, it detects the link, but same problem. I disabled iptables and set selinux to disabled. No change. There's a Xen VM running on that machine and I can ping it from the hardware. So, internal networking seems to be ok. I'm using bridged networking for Xen connectivity, setup by normal Red Hat means, not via Xen. Never had a problem. There are no errors in the logs, except for dhcpd telling network is down and named is also giving some weird errors. This is my only dhcpd, so I would like to have it up ASAP :-(

Is there anything else besides a weird hardware failure that I could check? I'm going to get a new card tomorrow and see if that changes the situation. This is mobo internal networking based on nforce-MCP61.

Has anyone seen such a hardware failure where the link goes up but no packets go over the wire? It seems a bit unlikely that this hardware failure (and nothing else) should happen on a reboot after an upgrade.

I've seen similarly weird things when running VMs on some smart switches where (and I'm not a networking guy here, so my terminology will get fuzzy) something was set to disable ports(port fast, maybe?) if multiple MACs were seen on the port (on machine other than my desktop, I can normally get that fixed by having a trunkport and default VLAN assigned to my port(s)). not sure if that could be applied to your situation.

-- Even the Magic 8 ball has an opinion on email clients: Outlook not so good.

Kai Schaetzl

9:42 p.m.

thanks for the tip, but, unfortunately, this cannot be the case here. Networking of the host is also affected, even when Xen is shut off. I have no smart switches in this office and I ruled out switches by using a direct connection to the laptop.

Kai

James Hogarth

9:54 p.m.

...

thanks for the tip, but, unfortunately, this cannot be the case here. Networking of the host is also affected, even when Xen is shut off. I have no smart switches in this office and I ruled out switches by using a direct connection to the laptop.

So it's something unrelated to xen...

Is the host using a static address or dhcp?

If you tcpdump do you see all the packets you'd expect for layer 2 connectivity (ie ARP requests and responses?)

Does ss or ifconfig show any transmit or receive errors? Do packet counts go up?

Given that ethtool states the link is up I'd statically configure an address and try to ping the gateway whilst running tcpdump ... Then take the packet dump (-w filename to save it) and take a look in wire shark ... You should see 'who has gateway IP' as an ARP request and the response from the gateway... Along with the ICMP echo-request and echo-reply packets...

...

From there you can start diagnosis properly...

Gordon Messmer

11:29 p.m.

On 03/04/2013 01:54 PM, James Hogarth wrote:

...

If you tcpdump do you see all the packets you'd expect for layer 2 connectivity (ie ARP requests and responses?)

specifically, use tcpdump on your bridged interface: tcpdump -nn -i br0

Check your bridge details and make sure that the ethernet device is listed: brctl show

If those look good, send the content of /etc/sysconfig/network-scripts/ifcfg-{br0,eth0} (or whatever eth device is a member of the bridge).

Kai Schaetzl

5 Mar 5 Mar

1:02 a.m.

Gordon Messmer wrote on Mon, 04 Mar 2013 15:29:58 -0800:

...

Check your bridge details and make sure that the ethernet device is listed: brctl show

If those look good, send the content of /etc/sysconfig/network-scripts/ifcfg-{br0,eth0} (or whatever eth device is a member of the bridge).

This is all fine, it's been this way for years. It looks as it always has. No errors, collisions, whatever anywhere. TX and RX are about the same. Just to prove that config is fine I removed the bridge and brought up a normal eth0. It's got the same problem. I've never seen such a problem before.

The tcpdump shows a lot of arp requests who-has <IP> tell <IP> As I understand these are requests for MAC addresses? And tell is the asking IP number? In that case there is at least *some* outside connectivity. Most of the requests are from the local IP and the IP of the VM, but a few are from other machines on the network, including the outbound router. The VM runs a monitoring system and these are the clients that want to call in. Also a few UDP requests (port 1900 and NBT), and that's all. There are also a few responses to the arp requests, but mostly it's requests. Makes sense if it doesn't have much in the arp cache. arp -a lists two machines with missing MAC data, that's all.

Kai

Robert

1:18 a.m.

On Tue, 05 Mar 2013 02:02:54 +0100 Kai Schaetzl maillists@conactive.com wrote:

...

Gordon Messmer wrote on Mon, 04 Mar 2013 15:29:58 -0800:

This is all fine, it's been this way for years. It looks as it always has. No errors, collisions, whatever anywhere. TX and RX are about the same. Just to prove that config is fine I removed the bridge and brought up a normal eth0. It's got the same problem. I've never seen such a problem before.

Things I would look at

1. route to ensure that the routing table is correct. 2. ifcfg-<eth0> and see it there are any MAC addresses listed if so ensure they match the MAC address in ifconfig output.

-- Regards Robert

Linux The adventure of a lifetime.

Linux User #296285 Get Counted http://linuxcounter.net/

Gordon Messmer

3:55 a.m.

On 03/04/2013 05:02 PM, Kai Schaetzl wrote:

...

The tcpdump shows a lot of arp requests who-has <IP> tell <IP> As I understand these are requests for MAC addresses? And tell is the asking IP number?

The arp request will have both the source IP address and the Ethernet address of the requesting host. tcpdump will only print the IP unless you use the -e flag.

If the layout of your network is such a closely guarded secret that you can't share the information that we need to help, you're mostly on your own here.

At this point, the problem could be almost anything. A bad switch port, or a bad switch, or a bad cable seem very likely. Try a new cable to a new switch port and reboot the switch if the problem continues. Try a full power down (as in, remove the power cable) for the affected system and with the switch. It sounds like your system is receiving packets but unable to send them to other hosts.

From any other host on the network, you should be able to: tcpdump -nn -e ether host <mac> where <mac> is the Ethernet address of the system with no connectivity. If you try to ping any address at all, the other system should see it broadcasting ARP requests for the local destination or the default gateway. If you don't see ARP requests on the other host, then you know that the affected system isn't able to sent out traffic.

Kai Schaetzl

2:22 p.m.

Kai Schaetzl wrote on Mon, 04 Mar 2013 19:15:46 +0100:

...

Has anyone seen such a hardware failure where the link goes up but no packets go over the wire? It seems a bit unlikely that this hardware failure (and nothing else) should happen on a reboot after an upgrade.

It was indeed a weird hardware failure. All works fine with disabled inboard LAN and a cheap PCI network card.

Kai

SilverTip257

5:28 p.m.

On Tue, Mar 5, 2013 at 9:22 AM, Kai Schaetzl maillists@conactive.comwrote:

...

Kai Schaetzl wrote on Mon, 04 Mar 2013 19:15:46 +0100:

...
Has anyone seen such a hardware failure where the link goes up but no packets go over the wire? It seems a bit unlikely that this hardware failure (and nothing else) should happen on a reboot after an upgrade.

It was indeed a weird hardware failure. All works fine with disabled inboard LAN and a cheap PCI network card.

That's a suitable workaround for getting a system operational again. In the end that is nothing more than a workaround, not a true solution. :-/

But it would have been helpful if you had shared more information (think NIC model, NIC chipset, kernel module in use for that chipset).

...

Kai

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- ---~~.~~--- Mike // SilverTip257 //

Kai Schaetzl

5:38 p.m.

SilverTip257 wrote on Tue, 5 Mar 2013 12:28:29 -0500:

...

But it would have been helpful if you had shared more information (think NIC model, NIC chipset, kernel module in use for that chipset).

Why? It's quite clear that this is a hardware failure. I tested a live CD and PXE booting on it with the same problem before buying the new card. I also tested the system disk fine in another machine. It's got nothing to do with the system, although it happened right after the update/reboot.

So, other than replacing the mobo, it *is* the solution. Mobo might be going haywire next as well, but currently it's absolutely stable. And I have a backup now in case it wants to go ...

Kai

4556

Age (days ago)

4557

Last active (days ago)

discuss@lists.centos.org

10 comments

6 participants

tags (0)

participants (6)

Gordon Messmer
James Hogarth
Kai Schaetzl
Robert
SilverTip257
zGreenfelder