[CentOS] Loss of Ethernet adaptor

Wed Oct 15 15:41:05 UTC 2014
James B. Byrne <byrnejb at harte-lyne.ca>

This is a return to an issue I first raised back in June. We had a similar
occurrence in September while I was away and so I am revisiting the entire
matter.

Steve Clark on 6 Jun 16:02 2014 wrote:
> Hi,
>
> We ran into this problem also - the interface would disappear.
> There is newer e1000e driver that fixes it or you could
> add pcie_aspm=off to your kernel command line.
>
> HTH,
> Steve

I have run into other reports of similar occurrences and some of these refer
to this bug report:  https://bugzilla.redhat.com/show_bug.cgi?id=632650

However, that report is closed as being  a duplicate of: 
https://bugzilla.redhat.com/show_bug.cgi?id=562273

Which is not available to viewing by the great unwashed.

Nonetheless, following the discussion thread in the bug report that I can view
it appears that this issue was supposedly resolved sometime in late 2012. 
>From what I can gather the fix was to disable ASPM L1 for this model adaptor
in the e1000e driver module.

* Upstream commit d4a4206ebbaf48b55803a7eb34e330530d83a889 - e1000e: Disable
ASPM L1 on 82574

However, when I run lspci -vvv on the host that exhibited the problem I see this:

. . .
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
        Subsystem: Super Micro Computer Inc Device 10d3
        Physical Slot: 0-2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 17
        Region 0: Memory at feae0000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at ec00 [size=32]
        Region 3: Memory at feadc000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [e0] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns,
L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency
L0 <128ns, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot-


############
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain-
CommClk+

############


                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
        Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-61-74-c1
        Kernel driver in use: e1000e
        Kernel modules: e1000e
. . .


lsmod
. . .
e1000e                267701  0
. . .

The host is running CentOS-6.5 with all updates applied to date.  My question
is: Has this issue been addressed in the official e1000e module or not?  if
not then does the recommendation to "add pcie_aspm=off to your kernel command
line" hold?


-- 
***          E-Mail is NOT a SECURE channel          ***
James B. Byrne                mailto:ByrneJB at Harte-Lyne.ca
Harte & Lyne Limited          http://www.harte-lyne.ca
9 Brockley Drive              vox: +1 905 561 1241
Hamilton, Ontario             fax: +1 905 561 0757
Canada  L8E 3C3