zoukankan      html  css  js  c++  java
  • EDAC DIMM CE Error错误导致server重新启动

     现象:

    近期几天一个华为RH2285server一直不定时自己主动重新启动。基本每天一两次,查看系统日志报以下的错误,每秒记录一条错误日志

    OS:OEL 6.5

    $ more /var/log/message

    Jul 21 08:54:32 customerkernel: EDAC MC1: 5486 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:33 customerkernel: EDAC MC1: 11480 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:34 customerkernel: EDAC MC1: 11330 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:35 customerkernel: EDAC MC1: 6584 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:36 customerkernel: EDAC MC1: 27428 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:37 customerkernel: EDAC MC1: 30113 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:38 customerkernel: EDAC MC1: 4453 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:39 customerkernel: EDAC MC1: 6269 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:40 customer kernel:EDAC MC1: 15720 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1 page:0x0offset:0x0 grain:8 syndrome:0x0)

    Jul 21 08:54:41 customerkernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)

    分析解决:

    这个是[EDAC (Error Detection AndCorrection)](https://www.kernel.org/doc/Documentation/edac.txt) 的日志.

    CE Error 是 Correctable Error 的简称。另外还有 UE(Uncorrectable Error)

    依照上面的文档, 找出错误的DIMM:

    [root@customer log]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

    /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0

    /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0

    /sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0

    /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0

    /sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0

    /sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0

    /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0

    /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0

    /sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0

    /sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0

    /sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0

    /sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:554836518

    查到是 /mc1/csrow1/ch2, 依据结构图:

           Channel 0   Channel 1

    ===================================

    csrow0 | DIMM_A0   | DIMM_B0 |

    csrow1 | DIMM_A0   | DIMM_B0 |

    ===================================

    ===================================

    csrow2 | DIMM_A1   | DIMM_B1 |

    csrow3 | DIMM_A1   | DIMM_B1 |

    ===================================

    然后通过dmidecode查看:

    [root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'

           Locator: DIMM_D0

           Locator: DIMM_D1

           Locator: DIMM_E0

           Locator: DIMM_E1

           Locator: DIMM_F0

           Locator: DIMM_F1

           Locator: DIMM_A0

           Locator: DIMM_A1

           Locator: DIMM_B0

           Locator: DIMM_B1

           Locator: DIMM_C0

           Locator: DIMM_C1

           

    通过server控制台查看内存:

    主板上内存插槽的分布:


    结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1

    应该是内存插槽DIMM_F1的问题。

    解决:

    最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其他的内存插槽上面,之后系统启动后不再报错。

    參考:

    http://blog.tankywoo.com/2014/12/02/edac-dimm-ce-error.html

    http://serverfault.com/questions/648240/how-can-i-find-which-memory-have-ce-error


     

  • 相关阅读:
    写了一个数据库的连继ID号(格式:xxxx000001)
    热心的网友<寒羽枫>帮忙解决水晶报表打印纸张问题
    解决vs2005自带水晶报表次数的限制的次数
    WebWork教程 Interceptor(拦截器)
    由于最近网站内容需要更新的还是满多的,于是想开发一个采集系统。收集了一下资料。
    ASP.NET AJAX 1.0 Beta 2 发布
    水晶报表的显示与打印不一至问题
    去年治疗过敏性鼻炎所用的药。
    正则表达式快速入门教程
    sql复制一条相同的记录最快最好的办法。
  • 原文地址:https://www.cnblogs.com/wgwyanfs/p/7060766.html
Copyright © 2011-2022 走看看