zoukankan      html  css  js  c++  java
  • x86服务器MCE(Machine Check Exception)问题

    MCE现象

    Intel在Pentium 4、Xenon和P6系列处理器中实现了机器检查(Machinecheck)架构,提供能够检测和报告硬件(机器)的错误机制,如系统总线错误、ECC错误、奇偶校验错误、缓存错误、TLB错误等。它包括一直MSR(Model-Specific Registers)寄存器,用来设置机器检查和额外的bank MSR记录错误。

    当机器检查到不可纠正的machine-check错误时,就触发一个machine-check异常。machine-check架构不允许在出现MCE后处理器重启,但MCE处理程序可以从MSR寄存器收集相关信息。

    CPU 7: Machine Check Exception: 5 Bank 0: b200004010000400

    RIP !INEXACT! 10:<ffffffff8010f16e> {mwait_idle+0x5e/0x90}

    TSC 1952dbeebcc8

    Kernel panic: Machine check

    Reconfiguring memory bank information….

    This may take a while….

    done waiting: 3 cpus not responding

    Warning: Non-empty request queue

    I/O requests in flight at dump time

    CPU 7: Machine Check Exception: 4 Bank 0: f200004040000400

    RIP !INEXACT! 10:<ffffffff8011ef69>

     
     

    MCE错误判断原则

    凡是内核死机打印“Machine Check Exception“或内核栈信息中打印有do_machine_check()函数,均为MCE问题。

     
     

    MCE错误来源

    • PCI-E设备信号质量/时钟
    • CPU芯片损坏/设计BUG

      CPU Cache损坏或其它故障

    • CPU可能的缺陷

      如CPU生产制造过程中带来的缺陷

    • 内存坏/接触不良
    • BIOS配置不当
    • OS/MCE中断程序Bug
    • 环境因素,如温度/湿度

     
     

    MCE错误码解析

    以上面MCE错误为例,Machine Check Exception和Bank 0(5)的值分别对应IA32_MCG_STATUS MSR、IA32_MCi_STATUS寄存器。

    则对应的寄存器值为:

    IA32_MCG_STATUS MSR寄存器的值为0000000000000004

    IA32_MC0_STATUS MSR的值为f200000410000800

    IA32_MC5_STATUS MSR的值为f200001044100e0f

     
     

    根据MSR的值,对照Intel编程手册和Intel其他资料,就可以比较容易找出MCE原因。

    dmesg显示

    1
    2
    3
    4
    5
    6
    7
    8
    
    ...
    
    sbridge: HANDLING MCE MEMORY ERROR
    CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
    TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0
    EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr= 0x67081b300 => socket=0, Channel=3(mask=8), rank=0
    
    ...

    保存4行log为mlog

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    # mcelog --ascii < /tmp/mlog
    WARNING: with --dmi mcelog --ascii must run on the same machine with the
    	 same BIOS/memory configuration as where the machine check occurred.
    sbridge: HANDLING MCE MEMORY ERROR
    CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    Wed Sep  2 16:14:36 2015
    CPU 0 BANK 5 MISC 2140040486 ADDR 67081b300
    STATUS 8c00004000010093 MCGSTATUS 0
    CPUID Vendor Intel Family 6 Model 45
    WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
    <24> DIMM 1333 Mhz Res13 Width 72 Data Width 64 Size 16 GB
    Device Locator: Node0_Channel2_Dimm0
    Bank Locator: Node0_Bank0
    Manufacturer: Hynix Semiconducto
    Serial Number: 40743B5A
    Asset Tag: Dimm2_AssetTag
    Part Number: HMT42GR7BFR4A-PB
    TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0
    EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x67081b300 => socket=0, Channel=3(mask=8), rank=0

    根据
    Part Number: HMT42GR7BFR4A-PB
    Serial Number: 40743B5A

    在lshw中找相应硬件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    
    ...
    
    	 *-memory:0
    	      description: System Memory
    	      physical id: 2d
    	      slot: System board or motherboard
    	    *-bank:0
    	         description: DIMM 1333 MHz (0.8 ns)
    	         product: HMT42GR7BFR4A-PB
    	         vendor: Hynix Semiconducto
    	         physical id: 0
    	         serial: 905D21AE
    	         slot: Node0_Channel1_Dimm0
    	         size: 16GiB
    	          64 bits
    	         clock: 1333MHz (0.8ns)
    	    *-bank:1
    	         description: DIMM Synchronous [empty]
    	         product: A1_Dimm1_PartNumber
    	         vendor: Dimm1_Manufacturer
    	         physical id: 1
    	         serial: Dimm1_SerNum
    	         slot: Node0_Channel1_Dimm1
    	          64 bits
    	    *-bank:2
    	         description: DIMM 1333 MHz (0.8 ns)
    	         product: HMT42GR7BFR4A-PB
    	         vendor: Hynix Semiconducto
    	         physical id: 2
    	         serial: 40743B5A
    	         slot: Node0_Channel2_Dimm0
    	         size: 16GiB
    	          64 bits
    	         clock: 1333MHz (0.8ns)
    
    		...


  • 相关阅读:
    python的装饰器
    闭包的概念
    py3.x和py2.x的区别
    python在WIN下CMD运行中文乱码及python 2.x python 3.x编码问题
    python 中文乱码问
    字符编码
    第03章 科学计算库Numpy
    《数据结构与算法》-哈希查找算法
    python寻找小于给定值的最大质数
    《数据结构与算法》-6-七大查找算法-1
  • 原文地址:https://www.cnblogs.com/DataArt/p/10374028.html
Copyright © 2011-2022 走看看