zoukankan      html  css  js  c++  java
  • 客户数据库出现大量cache buffer chains latch

    http://yangtingkun.itpub.net/post/468/526943

    客户产品数据库上午出现了严重的性能问题,简单记录一下问题的诊断和解决过程。

    可以看到,数据库的DB TIME已经涨到了非常高的地步,这说明系统正经受着非常严重的性能问题:

     

    Snap Id

    Snap Time

    Sessions

    Cursors/Session

    Begin Snap:

    7170

    31-3 -12 08:00:50

    395

    10.5

    End Snap:

    7172

    31-3 -12 09:00:27

    689

    11.7

    Elapsed:

     

    59.61 (mins)

       

    DB Time:

     

    17,270.93 (mins)

       

    TOP 5看,等待事件中最明显的是latch: cache buffers chains,从当前系统的状态也可以看到这一点:

    SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC;
    
    EVENT COUNT(*)
    ---------------------------------------------------------------- ----------
    latch: cache buffers chains 1043
    SQL*Net message from client 500
    SQL*Net message to client 35
    rdbms ipc message 29
    gc cr request 25
    latch free 13
    gcs remote message 10
    gc buffer busy 7
    
    8 rows selected

    系统中部分SQL的执行时间已经长达几个小时:

     

    SQL ordered by Elapsed Time

     

    Elapsed Time (s)

    CPU Time (s)

    Executions

    Elap per Exec (s)

    % Total DB Time

    SQL Id

    SQL Module

    SQL Text

    341,821

    11,917

    112

    3051.97

    32.99

    8v9mhuxam279p

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    244,752

    3,380

    147

    1664.98

    23.62

    44zutxspd664c

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    52,419

    11,331

    3

    17472.91

    5.06

    a6aqkm30u7p90

    JDBC Thin Client

    select cf.cardid, (CASE WHEN c...

    38,767

    532

    11

    3524.24

    3.74

    8b8kbrqmg7qf9

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    37,146

    2,524

    2

    18573.15

    3.58

    8fpf0vtjzb583

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    30,796

    2,331

    5

    6159.20

    2.97

    9wxzf70vub4wg

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    29,991

    2,506

    1

    29991.41

    2.89

    5vdncfn06sxz8

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    24,762

    875

    3

    8254.06

    2.39

    8vyda1jxu2nsc

    JDBC Thin Client

    SELECT count(*) total FROM (SE...

    17,845

    699

    2

    8922.50

    1.72

    99g0x7u3jv28v

    JDBC Thin Client

    SELECT count(*) total FROM (SE...

    17,470

    1,295

    0

     

    1.69

    08qmyqnjkpgru

    JDBC Thin Client

    select * from (select aa.*, ro...

    其实根据这些现象,基本上可以判断问题了。肯定是SQL执行计划的改变导致了当前的性能问题。而类似绑定变量窥探之类的问题只会影响个别的SQL,而这么大面积的执行计划的改变,几乎可以断定是统计信息造成的。

     

    询问了一下客户最近的操作,原来昨天夜里通过数据库链的方式导入了一部分数据。而今天一早问题就出现了。

     

    其实问题已经很明显了,在通过数据库链加载数据后,并没有重新收集统计信息,且由于加载时间是在半夜,这批数据也没有赶上每天22:00的统计信息自动收集的工作。这就使得Oracle在生成执行计划时,会依赖数据加载前的统计信息,从而造成了错误的执行计划。

    首先解决问题的根源问题,对加载过数据的表重新收集统计:
    
    SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE OWNER = 'USER1' AND LAST_ANALYZED < TRUNC(SYSDATE);
    
    'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)'
    -----------------------------------------------------------------------------------------
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE)
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE)
    .
    .
    .
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE)
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE)
    
    12 rows selected.
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    
    .
    .
    .
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    虽然问题的根源已经被解决,但是当前运行的SQL并不会改变执行计划,因此还需要找到问题的SQL,从后台将其结束:
    
    SQL> SELECT 'kill -9 ' || spid from v$session s, v$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains';
    
    'KILL-9'||SPID
    --------------------
    kill -9 28321
    kill -9 25384
    kill -9 23697
    kill -9 7239
    .
    .
    .
    kill -9 9331
    kill -9 13759
    
    740 rows selected.
    
    SQL> SELECT 'kill -9 ' || spid from gv$session s, gv$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains' and s.inst_id = p.inst_id and s.inst_id = 2;
    
    'KILL-9'||SPID
    --------------------
    kill -9 23992
    kill -9 5289
    kill -9 21067
    kill -9 16816
    kill -9 16820
    kill -9 26767
    .
    .
    .
    kill -9 14981
    kill -9 26678
    kill -9 26682
    
    258 rows selected.
    
    分别在两个节点杀掉这些执行计划存在问题的会话,释放被大量占用的系统资源。
    由于Oracle的执行计划并非在收集统计信息后马上生效,因此还有个别的SQL仍然沿用错误的执行计划:
    
    SQL> select distinct inst_id, sql_id from gv$session where event = 'latch: cache buffers chains';
    
    INST_ID SQL_ID
    ---------- -------------
    1 39gvg7vbcm8jx
    1 a6aqkm30u7p90
    
    SQL> select address, hash_value from v$sqlarea where sql_id = 'a6aqkm30u7p90';
    
    ADDRESS HASH_VALUE
    ---------------- ----------
    C000000EB7ED3420 3248739616
    
    SQL> exec dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C')
    
    PL/SQL procedure successfully completed.
    
    SQL> select address, hash_value from v$sqlarea where sql_id ='39gvg7vbcm8jx';
    
    ADDRESS HASH_VALUE
    ---------------- ----------
    C000001037B8E308 3603538493
    
    SQL> exec dbms_shared_pool.purge('C000001037B8E308’,‘3603538493', 'C')
    
    PL/SQL procedure successfully completed.
    
    由于当前的数据库版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool将执行计划错误的SQL清除出共享池,强制其重新生成执行计划。
    
    SQL> select event, count(*) from gv$session where username like != user GROUP BY EVENT order by 2 desc;
    
    EVENT COUNT(*)
    ---------------------------------------------------------------- ----------
    SQL*Net message from client 370
    SQL*Net message to client 15
    gc cr request 10
    latch free 4
    Streams AQ: waiting for messages in the queue 1
    
    5 rows selected.

    数据库中SQL执行计划错误除了导致大量的latch: cache buffers chains等待以外,还存在gc cr requestlatch free等这些等待事件,经分析同样是由于错误的执行计划所致。将这些会话采用相同的方法清除后,系统负载恢复到正常范围:

     

     

    Snap Id

    Snap Time

    Sessions

    Cursors/Session

    Begin Snap:

    7188

    31-3 -12 14:55:17

    257

    20.3

    End Snap:

    7189

    31-3 -12 15:03:07

    256

    19.9

    Elapsed:

     

    7.84 (mins)

       

    DB Time:

     

    70.46 (mins)

       
  • 相关阅读:
    Top 10 Product Manager Skills To Boost Your Resume In 2021
    大数据知识梳理
    B端产品如何设计权限系统?
    华三盒式交换机MAC、ARP、Route性能表项参数查询
    中了传说中的挖矿病毒
    SqlServer 2019 事务日志传送
    docker中生成的pdf中文是方框的解决方案
    The Live Editor is unable to run in the current system configuration
    2021 面试题大纲
    五分钟搞定Docker安装ElasticSearch
  • 原文地址:https://www.cnblogs.com/taowang2016/p/3030829.html
Copyright © 2011-2022 走看看