[CentOS] Loss of Ethernet adaptor

On 06/09/2014 11:34 AM, James B. Byrne wrote:
> On Fri, June 6, 2014 09:58, Alexander Dalloz wrote:
>> Am 06.06.2014 14:50, schrieb James B. Byrne:
>>> At ~07:40 (UTC-4:00) this morning our gateway host lost its WAN Ethernet
>>> adaptor.  Subsequent to recovery, which required a reboot, the following
>> [ ... ]
>>
>>> lspci -tv                  # provides this device tree
>>>
>>> -[0000:00]-+-00.0  Intel Corporation Atom Processor D4xx/D5xx/N4xx/N5xx DMI
>>> Bridge
>>> . . .
>>>              +-1c.0-[01]--
>>>              +-1c.4-[02]----00.0  Intel Corporation 82574L Gigabit Network
>>> Connection
>>>              +-1c.5-[03]----00.0  Intel Corporation 82574L Gigabit Network
>>> Connection
>>> . . .
>>>
>>>
>>>
>>> lspci -v -nn -k -qq -D     # provides this information:
>>>
>>> . . .
>>> 0000:02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit
>>> Network Connection [8086:10d3]
>>> 	Subsystem: Super Micro Computer Inc Device [15d9:10d3]
>>> 	Physical Slot: 0-1
>>> 	Flags: bus master, fast devsel, latency 0, IRQ 16
>>> 	Memory at fe9e0000 (32-bit, non-prefetchable) [size=128K]
>>> 	I/O ports at dc00 [size=32]
>>> 	Memory at fe9dc000 (32-bit, non-prefetchable) [size=16K]
>>> 	Capabilities: [c8] Power Management version 2
>>> 	Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>>> 	Capabilities: [e0] Express Endpoint, MSI 00
>>> 	Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
>>> 	Capabilities: [100] Advanced Error Reporting
>>> 	Capabilities: [140] Device Serial Number 00-25-90-ff-ff-61-74-c0
>>> 	Kernel driver in use: e1000e
>>> 	Kernel modules: e1000e
>>> . . .
>>>
>>> I have never run into this before.  Can anyone cast any light on what might
>>> be
>>> going on?  Is this an incipient hardware failure with one of the on-board
>>> PCI
>>> Ethernet adaptors?  Is there any relationship with the syn flood that was
>>> blacklisted immediately before the failure?  I do not thinks so but I need
>>> to
>>> ask.
>>>
>>> Thanks,
>> https://isc.sans.edu/forums/diary/Intel+Network+Card+82574L+Packet+of+Death/15109
>>
>> http://www.itwalker3.com/2013/02/packet-of-death-attack-a-deadly-dos-against-intel-nics/
>>
>> Worth to verify in your case.
>>
>> Alexander
>>
>>
>>
>
> Re: Packet of Death attack: a deadly DoS against Intel NICs
>
> It appears that my problem is caused by something else as the EPROM
> fingerprint matches the 'good' version (mostly).
>
> ethtool -e eth0
> . . .
> 0x0010:01 01 ff ff 6b 02 d3 10 d9 15 d3 10 ff ff 58 a5
> . . .
> 0x0030:c9 6c 50 31 3e 07 0b 46 84 2d 40 01 00 f0 06 07
> . . .
>
>
> However this matches neither the known 'bad' nor the reputed 'good' EPROM image:
>
> 0x0060:00 01 ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>
> But it seems a lot closer to the 'bad:
>
> 0×0060:ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>
> than to the 'good':
>
> 0×0060:20 01 00 40 16 13 ff ff ff ff ff ff ff ff ff ff
>
>
> I cannot find the file pod-icmp-ping.pcap so I cannot try out the recommended
> test using tcpreplay.  The original Google code page reference is now gone.
>
> However, ping -p 32 -s 1110 192.168.99.1 against the on-board nic adaptors
> does not shut them down. I infer (so long as there is no great delay between
> sending the packet of death and its effects made manifest) that this means
> that the POD was not the cause of our recent difficulty.
>
Hi,

Don't know if you saw my prior email, but we experienced this exact same problem see log excerpts below:
...
Jul 31 17:05:18 wolfpac kernel: pciehp 0000:00:1c.5:pcie04: Card not present on Slot(37)
Jul 31 17:05:18 wolfpac kernel: pciehp 0000:00:1c.5:pcie04: Card present on Slot(37)
Jul 31 17:05:18 wolfpac kernel: device eth5 left promiscuous mode
Jul 31 17:05:19 wolfpac kernel: e1000e 0000:07:00.0: PCI INT A disabled
Jul 31 17:05:20 wolfpac ntpd[2726]: Deleting interface #7 eth5, 192.168.198.95#123, interface stats: received=517, sent=522, dropped=0, active_time=108106 secs
Jul 31 17:05:20 wolfpac ntpd[2726]: Deleting interface #8 eth5, fe80::290:bff:fe2a:acf3#123, interface stats: received=0, sent=0, dropped=0, active_time=108039 secs
...

This would randomly happen on systems that weren't connected directly to the internet.
We experienced this on multiple systems. Since we upgraded to the latest elrepo driver and added
pcie_aspm=off to our kernel command line we have never experienced the issue again.

-- 
Stephen Clark
*NetWolves Managed Services, LLC.*
Director of Technology
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark at netwolves.com
http://www.netwolves.com