zoukankan      html  css  js  c++  java
  • NDMCDB数据库hang住故障分析

    问题描写叙述:

    上午刚刚到办公室,就有监控人员邮件反馈,昨晚NDMCDB407数据库被重新启动过,让我分析一下数据库重新启动的原因。因为昨晚业务有版本号上线,所以短信警告关闭了,所以没有短信下发到我手机上,并且故障时相关人员也没有通知到我。

    1     检查alert日志

    从alert日志中,能够看到,先是在03:29时有一个job执行失败了:
    Fri Aug 22 03:29:29 2014
    Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:
    ORA-12012: error on auto execute of job 31
    ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorized
    ORA-06512: at "NDMC.PROC_NDMC_CANCEL_OPEN",line 5
    ORA-06512: at line 1
    然后在03:49时,出现了连接超时失败,并且一直持续到05:00:08:
    Fri Aug 22 03:49:43 2014
    ***********************************************************************
     
    Fatal NI connect error 12170.
     
     VERSION INFORMATION:
           TNS for Linux: Version 11.1.0.7.0 - Production
           Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0 -Production
           TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0 - Production
     Time: 22-AUG-2014 03:49:43
     Tracing not turned on.
      Tnserror struct:
       ns main err code: 12535
       
    TNS-12535: TNS:operation timed out
       ns secondary err code: 12606
       nt main err code: 0
       nt secondary err code: 0
       nt OS err code: 0
     Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.130.87)(PORT=36628))
    WARNING: inbound connection timed out(ORA-3136)
    Fri Aug 22 03:49:44 2014
    ……
    并且出现了连接数耗尽了:
    Fri Aug 22 03:49:50 2014
    ORA-00020: maximum number of processes 0exceeded
       ns secondary err code: 12560
       ns secondary err code: 12560
       ns main err code: 12537
    Fri Aug 22 03:49:50 2014
    ……
    Fri Aug 22 03:51:48 2014
     
    ***********************************************************************
     
    Fatal NI connect error 12537, connectingto:
     (LOCAL=NO)
     
     VERSION INFORMATION:
           TNS for Linux: Version 11.1.0.7.0 - Production
           Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0 -Production
           TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0 - Production
     Time: 22-AUG-2014 03:51:48
     Tracing not turned on.
      Tnserror struct:
       ns main err code: 12537
       
    TNS-12537: TNS:connection closed
    ns secondaryerr code: 12560
       nt main err code: 0
       nt secondary err code: 0
       nt OS err code: 0
    ORA-609 : opiodr aborting process unknownospid (30476_47044991385184)
    Fri Aug 22 04:14:15 2014
    ORA-28 : opiodr aborting process unknownospid (24925_46986315964000)
    Fri Aug 22 04:16:27 2014
    ORA-28 : opiodr aborting process unknownospid (22475_47013891882592)
    Fri Aug 22 04:16:28 2014
    ORA-28 : opiodr aborting process unknownospid (21356_47116835528288)
    Fri Aug 22 04:16:29 2014
    ORA-28 : opiodr aborting process unknownospid (24947_47774766210656)
    ORA-28 : opiodr aborting process unknownospid (14958_47053435166304)
    ……
    Fri Aug 22 05:00:05 2014
    ORA-28 : opiodr aborting process unknownospid (25765_46941307182688)
    Fri Aug 22 05:00:08 2014
    ORA-28 : opiodr aborting process unknownospid (4949_47396524895840)
    于是在05:04数据库被关闭,从日志来看,这是正常关闭的,初步怀疑是人为关闭或是VCS双机自己主动将数据库关闭了:
    Fri Aug 22 05:04:10 2014
    Stopping background process SMCO
    Stopping background process FBDA
    Shutting down instance: further logonsdisabled
    Fri Aug 22 05:04:12 2014
    Stopping background process CJQ0
    Stopping background process QMNC
    Stopping background process MMNL
    Stopping background process MMON
    Shutting down instance (immediate)
    License high water mark = 1220
    Stopping Job queue slave processes, flags =7
    Fri Aug 22 05:04:20 2014
    Waiting for Job queue slaves to complete
    Job queue slave processes stopped
    Fri Aug 22 05:09:11 2014
    License high water mark = 1220
    USER (ospid: 25110): terminating theinstance
    Termination issued to instance processes.Waiting for the processes to exit
    Fri Aug 22 05:09:21 2014
    Instance termination failed to kill one ormore processes
    Instance terminated by USER, pid = 25110

    2     检查messages日志

    大概在05:03:51时,人为的想将双机切换到备机中:

    Aug 22 05:03:51 NDMCDB11 user_cmd:2014-08-22 05:03:51 hagrp -switch RCS_DB_SG -to system by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]
    Aug 22 05:04:01 NDMCDB11/usr/sbin/cron[15348]: (root) CMD (su - root -c'/opt/watchdog/watchdog_schedule -n OS,oracle' >/dev/null 2>&1)
    Aug 22 05:04:01 NDMCDB11 su: (to root) rooton none
    Aug 22 05:04:03 NDMCDB11 su: (to oracle)root on none
    Aug 22 05:04:09 NDMCDB11 user_cmd:2014-08-22 05:04:09 hagrp -switch RCS_DB_SG -to NDMCDB12 by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]
    Aug 22 05:04:09 NDMCDB11 su: (to oracle)root on none

    但双机切换失败,最后是直接将双机停止,重新启动VCS:

    Aug 22 05:06:18 NDMCDB11 user_cmd:2014-08-22 05:06:18 hastop -all by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]
    ……
    Aug 22 05:07:02 NDMCDB11 user_cmd:2014-08-22 05:07:02 hastat by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]

    所以,到这里就已经确定,数据库这所以重新启动了,全然是因为人为将VCS集群重新启动引起的。那么为什么要VCS群集重新启动呢?数据库究竟有没有问题呢?再来看看。

    最后,经向升级人员操作确认,在升级时,有一个存储过程须要跑,但执行后,数据库基本响应就很慢了,一直执行到3:29左右,人为cancel掉了,所以这也就是为什么会出现这种报错了:

    Fri Aug 22 03:29:29 2014
    Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:
    ORA-12012: error on auto execute of job 31
    ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorized
    ORA-06512: at"NDMC.PROC_NDMC_CANCEL_OPEN", line 5
    ORA-06512: at line 1

    3     查看系统负载

    CPU负载:


    内存负载:


    可见,系统在3:49左右,出现了CPU及内存均被耗尽的情况,这个时间段,刚好数据库出现了大量连接超时失败,甚至是出现了连接数超过阀值:

    Fri Aug 22 03:49:50 2014
    ORA-00020: maximum number of processes 0exceeded
       ns secondary err code: 12560
       ns secondary err code: 12560
       ns main err code: 12537
    Fri Aug 22 03:49:50 2014

    4     分析AWR


    从这里看,数据库在2点到3点时,已经非常的繁忙,但从之前有系统负载来看,2点到3点时,CPU及内存使用率都不算非常高的。接着看:


    指标都没有什么特别高的。


    从top 5 event中,看到了有大量的cursor: pin S wait on X等待,可见出现mutex争用,但通常这仅仅是表象而已,并不是根因。


    绝大部分时间都在做SQL的解析,并且解析还失败了,这就是数据库hang住的根因。正常来说,一个数据库的绝大部分时间应该是用于SQL的运行,所以这个是占用最多时间的:sql execute elapsedtime等。


    不存在较高的versioncount。

    那么数据库什么时候出现的不停解析SQL,而且解析失败了呢?

    查了DBA_HIST_ACTIVE_SESS_HISTORY,分析了下历史会话信息,发如今02:57:00至03:00:00出现的问题:


    经过确认,恰巧就是运行存储过程的时间点左右。

    至此,数据库从3:00開始,已经是不正常的,数据库不停的在解析SQL,SQL都还没有到运行这一步,数据库已经处于无响应的状态,连接会话都被堵塞住了,直到连接数达到了最大连接数,最后被升级操作人员重新启动了VCS集群。

    5     分析结论

    (1)数据库down机主要还是人为进行了VCS切换失败后,进行了VCS重新启动操作引起。

    (2)这套数据库故障的根因,还是为什么数据库在2:58左右时出现解析SQL失败上。从眼下的日志分析来看,看不出是什么原因。


    -- Bosco
    ---- END ----

    -------------------------------------------------------------------------------------------------------

    版权全部,文章同意转载,但必须以链接方式注明源地址,否则追究法律责任!


  • 相关阅读:
    CentOS查看CPU信息、位数、多核信息
    Linux常用命令大全
    chmod命令详细用法
    tar命令的详细解释
    yum和rpm命令详解
    LeetCode 241. Different Ways to Add Parentheses
    LeetCode 139. Word Break
    LeetCode 201. Bitwise AND of Numbers Range
    LeetCode 486. Predict the Winner
    LeetCode 17. Letter Combinations of a Phone Number
  • 原文地址:https://www.cnblogs.com/mfrbuaa/p/4488822.html
Copyright © 2011-2022 走看看