zoukankan      html  css  js  c++  java
  • Intel 82599网卡异常挂死原因

    前提背景:

    生产环境上,服务器网络突然断链,ssh连接失败。

    问题初步定位:

    查找内核日志,得到网卡异常信息

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 14 not cleared within the polling period

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 15 not cleared within the polling period

    Jan 24 11:52:43 localhost kernel: bonding: bond5: link status definitely down for interface eth0, disabling it

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: detected SFP+: 5

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

    Jan 24 11:52:43 localhost kernel: bond5: link status definitely up for interface eth0, 10000 Mbps full duplex.

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Detected Tx Unit Hang

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx_buffer_info[next_to_clean]

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx hang 448 detected on queue 6, resetting adapter

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Reset adapter

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

    网卡PCI信息:

    # lspci -vvv -s 84:00.0
    84:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
            Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
            Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
            Interrupt: pin A routed to IRQ 16
            Region 0: Memory at f7e20000 (64-bit, non-prefetchable) [disabled] [size=128K]
            Region 2: I/O ports at f020 [disabled] [size=32]
            Region 4: Memory at f7e44000 (64-bit, non-prefetchable) [disabled] [size=16K]
            Capabilities: [40] Power Management version 3
                    Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                    Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
            Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                    Address: 0000000000000000  Data: 0000
                    Masking: 00000000  Pending: 00000000
            Capabilities: [70] MSI-X: Enable- Count=64 Masked-
                    Vector table: BAR=4 offset=00000000
                    PBA: BAR=4 offset=00002000
            Capabilities: [a0] Express (v2) Endpoint, MSI 00
                    DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                    DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                            MaxPayload 128 bytes, MaxReadReq 512 bytes
                    DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                    LnkCap: Port #4, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <2us, L1 <32us
                            ClockPM- Surprise- LLActRep- BwNot-
                    LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                    LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                    DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
                    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                    LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                             Compliance De-emphasis: -6dB
                    LnkSta2: Current De-emphasis Level: -6dB
            Capabilities: [e0] Vital Product Data
                    Unknown small resource type 06, will not decode more.
            Capabilities: [100] Advanced Error Reporting
                    UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                    UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                    UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                    CESta:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                    CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                    AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
            Capabilities: [140] Device Serial Number 98-f5-37-ff-ff-e3-64-73
            Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
                    ARICap: MFVC- ACS-, Next Function: 1
                    ARICtl: MFVC- ACS-, Function Group: 0
            Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
                    IOVCap: Migration-, Interrupt Message Number: 000
                    IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                    IOVSta: Migration-
                    Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                    VF offset: 384, stride: 2, Device ID: 10ed
                    Supported Page Size: 00000553, System Page Size: 00000001
                    Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                    Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
                    VF Migration: offset: 00000000, BIR: 0
            Kernel driver in use: ixgbe
            Kernel modules: ixgbe

    网卡寄存器信息:

    # ethtool -d  eth0
    0x042A4: LINKS (Link Status register)                 0xFFFFFFFF
           Link Status:                                   up
           Link Speed:                                    10G
    0x05080: FCTRL (Filter Control register)              0xFFFFFFFF
           Receive Flow Control Packets:                  enabled
           Receive Priority Flow Control Packets:         enabled
           Discard Pause Frames:                          enabled
           Pass MAC Control Frames:                       enabled
           Broadcast Accept:                              enabled
           Unicast Promiscuous:                           enabled
           Multicast Promiscuous:                         enabled
           Store Bad Packets:                             enabled
    0x05088: VLNCTRL (VLAN Control register)              0xFFFFFFFF
           VLAN Mode:                                     enabled
           VLAN Filter:                                   enabled
    0x02100: SRRCTL0 (Split and Replic Rx Control 0)      0xFFFFFFFF
           Receive Buffer Size:                           16KB
    0x03D00: RMCS (Receive Music Control register)        0xFFFFFFFF
           Transmit Flow Control:                         enabled
           Priority Flow Control:                         enabled
    0x04250: HLREG0 (Highlander Control 0 register)       0xFFFFFFFF
           Transmit CRC:                                  enabled
           Receive CRC Strip:                             enabled
           Jumbo Frames:                                  enabled
           Pad Short Frames:                              enabled
           Loopback:                                      enabled
    0x00000: CTRL        (Device Control)                 0xFFFFFFFF
    0x00008: STATUS      (Device Status)                  0xFFFFFFFF
    0x00018: CTRL_EXT    (Extended Device Control)        0xFFFFFFFF
    0x00020: ESDP        (Extended SDP Control)           0xFFFFFFFF
    0x00028: EODSDP      (Extended OD SDP Control)        0xFFFFFFFF
    0x00200: LEDCTL      (LED Control)                    0xFFFFFFFF

    ........

    0x01010: RDH00       (Receive Descriptor Head 00)     0xFFFFFFFF
    0x01050: RDH01       (Receive Descriptor Head 01)     0xFFFFFFFF
    0x01090: RDH02       (Receive Descriptor Head 02)     0xFFFFFFFF
    0x010D0: RDH03       (Receive Descriptor Head 03)     0xFFFFFFFF
    0x01110: RDH04       (Receive Descriptor Head 04)     0xFFFFFFFF

    ..........

    0x01028: RXDCTL00    (Receive Descriptor Control 00)  0xFFFFFFFF
    0x01068: RXDCTL01    (Receive Descriptor Control 01)  0xFFFFFFFF
    0x010A8: RXDCTL02    (Receive Descriptor Control 02)  0xFFFFFFFF

    ........

    0x06010: TDH00       (Transmit Descriptor Head 00)    0xFFFFFFFF
    0x06050: TDH01       (Transmit Descriptor Head 01)    0xFFFFFFFF
    0x06090: TDH02       (Transmit Descriptor Head 02)    0xFFFFFFFF
    0x060D0: TDH03       (Transmit Descriptor Head 03)    0xFFFFFFFF
    0x06110: TDH04       (Transmit Descriptor Head 04)    0xFFFFFFFF
    0x06150: TDH05       (Transmit Descriptor Head 05)    0xFFFFFFFF

    问题可能原因:

    Bar0地址看起来没有问题,但寄存器全是0xffffffff了 82599寄存器开始是正常的, 跑了一段时间(10小时)就 变成FFFF了

    可能pcie 接口接触问题。

  • 相关阅读:
    linux sysfs (2)
    微软——助您启动云的力量网络虚拟盛会
    Windows Azure入门教学系列 全面更新啦!
    与Advanced Telemetry创始人兼 CTO, Tom Naylor的访谈
    Windows Azure AppFabric概述
    Windows Azure Extra Small Instances Public Beta版本发布
    DataMarket 一月内容更新
    和Steve, Wade 一起学习如何使用Windows Azure Startup Tasks
    现实世界的Windows Azure:与eCraft的 Nicklas Andersson(CTO),Peter Löfgren(项目经理)以及Jörgen Westerling(CCO)的访谈
    正确使用Windows Azure 中的VM Role
  • 原文地址:https://www.cnblogs.com/smith9527/p/10348953.html
Copyright © 2011-2022 走看看