zoukankan      html  css  js  c++  java
  • (原)一次内存故障导致ora7445

    昨天下午1点,收到公司在客户现场的电话,数据库访问不了了,因为我也不在现场,当时那个操作系统是windows 2003 server,oracle版本是oracle10g win32版本(10.2.0.1.0),觉得是操作系统不稳定的问题(先天不良),也没有多想,告诉他重启一下机器看看,然后就没事了,快到下班的时候,再次出现了这个问题。

    通过teamViewer等上这个系统,取下了alert_xx.log日志的,简单的see了一下,trace文件在bdump下,顺便把bdump文件夹取下来,一起分析。

    alert_xx.log文件报错误如下:

    Mon May 27 07:21:54 2013
    Thread 1 advanced to log sequence 194
      Current log# 1 seq# 194 mem# 0: F:\ORACLE\PRODUCT\10.2.0\ORADATA\KXONE\REDO01.LOG
    Mon May 27 13:43:44 2013
    Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
    
    Mon May 27 13:43:45 2013
    Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C95A308] [ADDR:0xC0C56E3] [UNABLE_TO_WRITE] []
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
    
    Mon May 27 13:43:46 2013
    Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C95A308] [ADDR:0xC0C56E3] [UNABLE_TO_WRITE] []
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []

    大概说明是:内存地址不可读取,同时ora-7445和ora-600的一样,是oracle两个严重的错误。

    我们看到这是是5月27下午1点的报错,,也现场打电话过来的时间也一致,j000进程是oracle 的job调用进程,于是打看这个trace:

    Dump file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc
    Mon May 27 13:43:44 2013
    ORACLE V10.2.0.1.0 - Production vsnsta=0
    vsnsql=14 vsnxtr=3
    Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
    With the Partitioning, OLAP and Data Mining options
    Windows Server 2003 Version V5.2 Service Pack 2
    CPU                 : 32 - type 586, 2 Physical Cores
    Process Affinity    : 0x00000000
    Memory (Avail/Total): Ph:20800M/24553M, Ph+PgF:22741M/26282M, VA:1209M/2047M
    Instance name: kxone
    
    Redo thread mounted by this instance: 1
    
    Oracle process number: 42
    
    Windows thread id: 7236, image: ORACLE.EXE
    
    
    *** 2013-05-27 13:43:44.810
    ksedmp: internal or fatal error
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
    Current SQL information unavailable - no session.
    ----- Call Stack Trace -----
    calling              call     entry                argument values in hex      
    location             type     point                (? means dubious value)     
    -------------------- -------- -------------------- ----------------------------
    7C959F40                      00000000             
    77C4E78F             CALL???  00000000             
    77C4E761             CALLrel  77C4E76D             
    77C468FF             CALLrel  77C4E759             
    77C46216             CALLrel  77C46883             
    77C46196             CALLrel  77C461EA             
    77C4F860             CALLrel  77C4F87D             
    77C4F78C             CALLrel  77C4F7C9             
    77C4F3BE             CALLreg  00000000             
    77C4EECF             CALLrel  77C4EEDD             
    77C4F61F             CALLrel  77C4EEC5             
    77CB40DF             CALLrel  77C4F600             
    76E350DC             CALLrel  76E34365             76E34370 76E342B4 AA3E118
    76E34FFC             CALLrel  76E350C5             
    76E35102             CALLrel  76E33FD2             
    71A83F89             CALLrel  71A83E82             
    71A83F16             CALLrel  71A83F5F             
    71A83473             CALLrel  71A83E8D             
    71B66DBD             CALLreg  00000000             
    71B66D9B             CALLrel  71B66DA9             
    71B66D65             CALLrel  71B66D84             
    71B66D03             CALLrel  71B66D1D             
    71B678E9             CALLrel  71B66C91             
    71B67843             CALLrel  71B6788B             
    71B67D29             CALLrel  71B677DA             
    _sldmGetHostInfo+27  CALL???  00000000             
    _sldmInit+97         CALLrel  _sldmGetHostInfo+0   93D93F4 93D94F4 100 93D93F4
                                                       100
    _ldmInit+318         CALLrel  _sldmInit+0          93D93F0 8AF6618 93D93F0 0 23C
                                                       8E0F9A8 0 658
    _keltnfy+148         CALLrel  _ldmInit+0           AA3EA30 0 6BBE922C E73D908

     当时没有sql在运行(current sql),这个是一个十分重要的信息,一般oracle的在各平台下的bug,大多可能是应用的sql或是dba的某些操作触发的,而函数调用为:_sldmGetHostInfo,_sldmInit,同时这个trace上面还有一个os的的极重要的内存信息:Memory (Avail/Total): Ph:20800M/24553M,即操作系统内存为24G。

    把ora-7445 [ACCESS_VIOLATION]和ora-7445 _sldmGetHostInfo _sldmInit no session在mos上找了一下,没有相同的症状。之前也看了baidu了一下这方面的问题,在老杨博客上找到一篇文章(http://yangtingkun.net/?p=401):

    查询MOS后,确认是Windows平台上的bug,详情参考MS-Windows: ORA-7445 On functions: ssbev_env, slzgetevar + no SGA [ID 1305096.1]。这个bug影响10.2.0.5以前的Windows平台下的10g,导致问题的原因是一个未公布的Bug:8592848,Oracle在10.2.0.5和11.2.0.1中fixed了这个bug。

    但和我的还不一样,继续找alert.log发现了另一个trace文件:

    Tue May 28 17:02:49 2013
    Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
    
    Tue May 28 17:02:49 2013
    Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
    
    Tue May 28 17:02:50 2013
    Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []

    这是5月28日的,打开 f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc

    Dump file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc
    Tue May 28 17:02:49 2013
    ORACLE V10.2.0.1.0 - Production vsnsta=0
    vsnsql=14 vsnxtr=3
    Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
    With the Partitioning, OLAP and Data Mining options
    Windows Server 2003 Version V5.2 Service Pack 2
    CPU                 : 32 - type 586, 2 Physical Cores
    Process Affinity    : 0x00000000
    Memory (Avail/Total): Ph:22687M/24553M, Ph+PgF:24657M/26282M, VA:1243M/2047M
    Instance name: kxone
    
    Redo thread mounted by this instance: 0 <none>
    
    Oracle process number: 0
    
    Windows thread id: 4176, image: ORACLE.EXE
    
    
    *** 2013-05-28 17:02:49.437
    ksedmp: internal or fatal error
    ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
    Current SQL information unavailable - no SGA.
    ----- Call Stack Trace -----
    calling              call     entry                argument values in hex      
    location             type     point                (? means dubious value)     
    -------------------- -------- -------------------- ----------------------------
    7C959F40                      00000000             
    7C95A913             CALLrel  7C959E46             
    7C963BBF             CALL???  00000000             
    7C830BC2             CALL???  00000000             
    _ssbev_env+40        CALL???  00000000             
    _slzgetevar+278      CALLrel  _ssbev_env+0         
    _kpummpin+745        CALLrel  _slzgetevar+0        894F3A8 607B5174 1A 894F388

    这个和上面 ID 1305096.1 描述的一致,大致判断为:10.2.0.1.0在windows 32 bit上的一个bug,根本解决办法是升级oracle到10.2.0.5.0。

    这时,我打开了一开始为确认现场环境相关情况时要求对方给我发一个操作信息截图:

    这和上面的trace文件中的描述基本一致,但通常情况下,服务器的内存为8G,16G,32G的为多,一个24G的内存,这时又得到一个信息,就是两台相同的机器,那台是32G的,会不会是内存故障,此时 ,在我远程修改sga_max_size,sga_target的时候(oracle给出的解决方案中也有增大和减小sga的方案),修改完之后,进行了一个startup force,接关oracle就hang住了 ,但在这之前,有个操作系统的报错:

    同样的内存不可读,会不会是内存硬件故障?

    打电话到现场让他们查看,相关的指示灯,发现ibm的这台server上,mem的灯已经开始告警,同时确认了这台机器上原来有32G的内存,可操作系统和oracle只识别出了24G,基本确认是内存物理故障。

    同时在alert.log里也有一点东西证实了这一点:

    就是数据库从open到故障的时间间隔很短,当然这只是一个佐证。

    后记:这台机器的操作系统是windows server 2003 32位,raid可能是raid5,没有热备盘,数据库oracle 10.2.0.1单机环境,无集群,做为一个准备上线的重要生产环境,两个重要问题:

    操作系统先天不良,windows定位问题困难,oracle的版本至少要patch到10.2.0.5.0。

    最好有rac环境,独立存储。

    最后感谢:dba群里笑笑提供的mos相关支持。

  • 相关阅读:
    这之后的事。。。
    POJ
    POJ
    博客园的装饰
    高斯消元
    逆序数技巧
    各种小的 dp (精)
    最大区间和变形
    树dp 统计异或值
    dp
  • 原文地址:https://www.cnblogs.com/taowang2016/p/3109989.html
Copyright © 2011-2022 走看看