zoukankan      html  css  js  c++  java
  • oracle慎用基于on commit刷新物化视图的方式(一)

     

    oracle慎用基于on commit刷新物化视图的方式(一)

    背景

    上周五,也就是2020-08-07那天快下班的时候,突然发生一件事。客户反馈xxxx平台登录不上去,当时一开始没意识到是数据库导致的问题。

    业务人员在那怀疑这里那里,还是网络ip哪里地区又被封了。后来都排除没发生什么问题,还想着重启Nginx服务器(我也不懂)。

    但是维护Nginx的人员刚好请假了。

    刚好当时大概是这个事情发生的半个小时之前,监控邮件报出如下错误,

     

    当时以为只是归档目录快满了,简单的备份了下归档,清理掉多余的归档。

    排查

    问题出现后,虽然没怀疑到数据库的问题,不过我也还是看了一眼数据库。

    发现警告日志闪回恢复区有在报使用率超标的问题,有问题,绝对有问题。

    除了归档量暴增的问题和刚搭建起来没多久的备库归档也爆掉的问题,没发现其他的报错。

    先看看日志切换的频率,和平时同一时段相比,这归档多了100多倍。

    让我一度怀疑是应用出现了什么问题导致dml操作剧增导致的。

    SELECT    TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'999') "00",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'999') "01",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'999') "02",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'999') "03",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'999') "04",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'999') "05",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'999') "06",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'999') "07",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'999') "08",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'999') "09",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'999') "10",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'999') "11",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'999') "12",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'999') "13",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'999') "14",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'999') "15",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'999') "16",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'999') "17",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'999') "18",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'999') "19",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'999') "20",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'999') "21",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'999') "22",
                    TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'999') "23"
    FROM V$LOG_HISTORY
    GROUP BY TO_CHAR(FIRST_TIME,'YYYY-MM-DD') 
    ORDER BY 1 DESC;
    View Code

     这个时候我需要定位找出是哪条SQL或者哪些SQL引起的问题。

    通过如下语句,可以找出哪些对象有大量数据块变化情况,

    SELECT TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24') SNAP_TIME, 
           DHSO.OBJECT_NAME, 
           SUM(DB_BLOCK_CHANGES_DELTA)                     BLOCK_CHANGED 
    FROM   DBA_HIST_SEG_STAT DHSS, 
           DBA_HIST_SEG_STAT_OBJ DHSO, 
           DBA_HIST_SNAPSHOT DHS 
    WHERE  DHS.SNAP_ID = DHSS.SNAP_ID 
           AND DHS.INSTANCE_NUMBER = DHSS.INSTANCE_NUMBER 
           AND DHSS.OBJ# = DHSO.OBJ# 
           AND DHSS.DATAOBJ# = DHSO.DATAOBJ# 
           AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('2020-08-07 16:00', 
                                           'YYYY-MM-DD HH24:MI') 
                                           AND 
               TO_DATE('2020-08-07 17:30', 'YYYY-MM-DD HH24:MI') 
    GROUP  BY TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24'), 
              DHSO.OBJECT_NAME 
    HAVING SUM(DB_BLOCK_CHANGES_DELTA) > 0 
    ORDER  BY SUM(DB_BLOCK_CHANGES_DELTA) asc;
    View Code

     部分关键结果如下,

    SNAP_TIME                              |OBJECT_NAME                   |BLOCK_CHANGED
    ---------------------------------------|------------------------------|-------------
    .............                          |..............                |       ......
    2020-08-07 16                          |..............                |       100480
    2020-08-07 16                          |..............                |       129184
    2020-08-07 16                          |..............                |       137024
    2020-08-07 17                          |..............                |       143520
    2020-08-07 17                          |..............                |       145440
    2020-08-07 16                          |..............                |       228624
    2020-08-07 16                          |..............                |       256704
    2020-08-07 17                          |..............                |       548336
    2020-08-07 17                          |..............                |     55019536
    2020-08-07 17                          |..............                |    103429408
    2020-08-07 16                          |T_XXXXXXXXXX_INFO             |    103997488
    2020-08-07 16                          |PK_X_XXXXXX_XXXX              |    195532752
    
    210 rows selected.
    
    Elapsed: 00:00:02.29

    在2020-08-07 16点内,有两个对象的块变化是最多的,分别是

    T_XXXXXXXXXX_INFO
    PK_X_XXXXXX_XXXX

    PK_X_XXXXXX_XXXX很容易就可以查出,这是一个主键索引,对应的表名为X_XXXXXX_XXXX。

    赶紧取问题时间段的awr报告,看看是否存在相关的X_XXXXXX_XXXX以及T_XXXXXXXXXX_INFO的sql,顺便也看看等待事件。

    结果一眼就看出等待事件存在问题,

    EventWaitsTotal Wait Time (sec)Wait Avg(ms)% DB timeWait Class
    enq: TX - row lock contention 683 223.1K 326667 39.0 Application
    gc buffer busy acquire 10,048 163.8K 16302 28.6 Cluster
    enq: JI - contention 23,034 112.7K 4892 19.7 Other
    buffer busy waits 1,770 20.9K 11784 3.6 Concurrency
    DB CPU   20K   3.5  
    enq: TX - index contention 591 6630.2 11219 1.2 Concurrency
    log file switch (checkpoint incomplete) 481 6268.1 13031 1.1 Configuration
    db file scattered read 482,552 4536 9 .8 User I/O
    gc current block busy 10,845 3853.8 355 .7 Cluster
    direct path read 832,639 3170 4 .6 User I/O

    这个系统平时没有存在那么高的TX锁等待,赶紧看看哪个段的行锁等待比较严重,

    Segments by Row Lock Waits

    OwnerTablespace NameObject NameSubobject NameObj. TypeRow Lock Waits% of Capture
    ABCD_EFG ABCD X_XXXXXX_XXXX   TABLE 681 53.71
    ABCD_XXXXXXXXS DYCK_FILETRANS PK_XXXX_XXXXXXXX_TMP   INDEX 321 25.32

    刚好和前边块变化最多的对象是同一个,也可以看看块变化,

    Segments by DB Blocks Changes

    OwnerTablespace NameObject NameSubobject NameObj. TypeDB Block Changes% of Capture
    XXXXX_DATA USERS PK_X_XXXXXX_XXXX   INDEX 161,129,664 64.72
    XXXXX_DATA USERS T_XXXXXXXXXX_INFO   TABLE 85,726,272 34.43

    都对的上!!!

    这个时候在从awr中找找含有X_XXXXXX_XXXX表的SQL,

    Elapsed Time (s)ExecutionsElapsed Time per Exec (s)%Total%CPU%IOSQL IdSQL ModuleSQL Text
    226,451.96 1,075 210.65 39.58 0.01 0.00 95nn49n57dqza JDBC Thin Client update X_XXXXXX_XXXX set ID=:1...
    166,564.13 1,368 121.76 29.12 0.14 0.00 08x3dymqbdbd4 JDBC Thin Client ......................
    20,730.38 294 70.51 3.62 0.01 0.03 4vs91dcv7u1p6 JDBC Thin Client ......................
    5,395.58 0   0.94 5.25 94.88 c7sa42jj690jw   ......................
    5,395.54 7 770.79 0.94 5.25 94.88 1hsz4n5g5dz1j   ......................
    5,205.90 38 137.00 0.91 40.58 58.45 66u44fa687q74 xxxxxxxxx.exe ......................
    4,619.25 17,384 0.27 0.81 0.05 0.08 atwcm5hwwbtj3 xxxxxxxxxxxxxxxxxx.exe ......................
    4,231.15 1 4,231.15 0.74 99.72 0.09 2d0r4v0y38vm1 xxxxxxxxxxx.exe ......................
    4,049.48 47,215 0.09 0.71 0.14 0.20 cqfs0h217uuc3 xxxxxxxxxxxxxxxxxx.exe ......................
    2,707.50 1 2,707.50 0.47 99.87 0.10 gjj8zbpgta9rv xxxx.exe ......................

    查看95nn49n57dqza的历史执行情况,

    set lines 150 pages 150                                                                                        
    col BEGIN_INTERVAL_TIME for a23                                                                                
    col PLAN_HASH_VALUE for 9999999999                                                                             
    col date_time for a30                                                                                          
    col snap_id heading 'SnapId'                                                                                   
    col executions_delta heading "No. of exec"                                                                     
    col sql_profile heading "SQL|Profile" for a7                                                                   
    col date_time heading 'Date time'                                                                              
    col avg_lio heading 'LIO/exec' for 99999999999.99                                                              
    col avg_cputime heading 'CPUTIM/exec' for 9999999.99                                                           
    col avg_etime heading 'ETIME/exec' for 9999999.99                                                              
    col avg_pio heading 'PIO/exec' for 9999999.99                                                                  
    col avg_row heading 'ROWs/exec' for 9999999.99                                                                 
    SELECT distinct                                                                                                
    s.snap_id ,                                                                                                    
    PLAN_HASH_VALUE,                                                                                               
    to_char(s.BEGIN_INTERVAL_TIME,'yyyy-mm-dd hh24:mi:ss')|| to_char(s.END_INTERVAL_TIME,'-hh24:mi:ss') Date_Time, 
    SQL.executions_delta,                                                                                          
    SQL.buffer_gets_delta/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_lio,                    
    --SQL.ccwait_delta,                                                                                            
    (SQL.cpu_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_cputime ,        
    (SQL.elapsed_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_etime,       
    SQL.DISK_READS_DELTA/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_pio,                     
    SQL.rows_processed_total/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_row                  
    --,SQL.sql_profile                                                                                             
    FROM                                                                                                           
    dba_hist_sqlstat SQL,                                                                                          
    dba_hist_snapshot s                                                                                            
    WHERE                                                                                                          
    SQL.instance_number =(select instance_number from v$instance)                                                  
    and SQL.dbid =(select dbid from v$database)                                                                    
    and s.snap_id = SQL.snap_id                                                                                    
    AND sql_id in                                                                                                  
    ('&SQLID') order by s.snap_id                                                                                  
    /  
    View Code

     结果如下,

        SnapId|PLAN_HASH_VALUE|Date time                     |No. of exec|       LIO/exec|CPUTIM/exec| ETIME/exec|   PIO/exec|  ROWs/exec
    ----------|---------------|------------------------------|-----------|---------------|-----------|-----------|-----------|-----------
         16581|     2297901584|2020-08-07 13:30:02-14:00:09  |        370|           4.04|        .00|        .00|        .00|    5547.76
         16583|     2297901584|2020-08-07 14:30:16-15:00:24  |        563|           4.02|        .00|        .00|        .00|    3648.02
         16584|     2297901584|2020-08-07 15:00:24-15:30:31  |        567|           4.03|        .00|        .00|        .00|    3623.28
         16585|     2297901584|2020-08-07 15:30:31-16:00:39  |        442|           4.05|        .00|        .00|        .00|    4648.97
         16586|     2297901584|2020-08-07 16:00:39-16:30:47  |        429|           4.28|        .00|        .59|        .01|    4790.84
         16587|     2297901584|2020-08-07 16:30:47-17:01:31  |        366|           5.72|        .06|     169.81|        .07|    5616.50
         16587|     2297901584|2020-08-07 16:30:47-17:07:50  |        366|           5.72|        .06|     169.81|        .07|    5616.50
         16588|     2297901584|2020-08-07 17:01:31-17:30:32  |        278|          15.70|        .03|     423.38|        .15|    7395.38
         16588|     2297901584|2020-08-07 17:07:50-17:30:32  |        278|          15.70|        .03|     423.38|        .15|    7395.38
         16589|     2297901584|2020-08-07 17:30:32-18:00:41  |        369|          31.79|        .07|    1180.75|        .00|    5572.59
         16590|     2297901584|2020-08-07 18:00:41-18:30:51  |        535|          44.85|        .06|     909.64|        .00|    3844.53
         16591|     2297901584|2020-08-07 18:30:51-19:00:00  |        201|          55.29|        .01|     192.97|        .00|   10233.94
         16592|     2297901584|2020-08-07 19:00:00-19:30:08  |         26|           4.69|        .00|       1.26|        .00|   79117.23

    从8月7日下午1点开始可以查询到记录,一直在下午4点半后开始,sql执行时间急剧增加,但是执行计划确是并没有变化过。

    并且从执行次数看,并没有多太过于剧烈的波动。

    到这里目前似乎没有什么好办法可以自己追查下去,能确定的是出问题的SQL语句就是95nn49n57dqza,由于update引起大量的TX锁等待,导致xxxx平台登录不上去或者响应很久才出现页面。

    而TX锁严重的原因就是一次update的时间从不到1s变到几百s甚至上千s。

    这里存在一个问题,便是在看了awr发现tx锁严重,再到确定SQL的过程,由于重心太过明确导致忽略掉了一开始找出哪些对象有大量数据块变化情况那里,除了PK_X_XXXXXX_XXXX外其实还有另外一个对象T_XXXXXXXXXX_INFO。

    暂且忽略T_XXXXXXXXXX_INFO顺着一直下去的思路的话,单纯从视图间的关联貌似已经无法继续下去,可以选择在数据库级别做一次ssd,或者10046。

    推荐10046,从结果反推回去,10046是可以看出会话除了有执行95nn49n57dqza之外,也是有对T_XXXXXXXXXX_INFO表做insert的操作的。

    因为第一时间发现95nn49n57dqza的问题,立马问了业务人员之后,他们有反馈在4点左右有创建一张物化视图T_XXXXXXXXXX_INFO,基表就是X_XXXXXX_XXXX。因此没做10046。

    看了创建语句,用了on commit。大致找了下资料,基本确定是这个引起的。当即决定删除这个物化视图。

    18:54:45 SYS@dyckdb1(2194)> drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO;
    drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO
    *
    ERROR at line 1:
    ORA-04021: timeout occurred while waiting to lock object
    
    
    Elapsed: 00:00:49.05
    
    --撤销授权操作,不然物化视图删不掉 18:57:45 SYS@dyckdb2(1269)> revoke select on dyck_pub.t_client_info from MACRO_DATA; ^Crevoke select on dyck_pub.t_client_info from MACRO_DATA * ERROR at line 1: ORA-01013: user requested cancel of current operation Elapsed: 00:02:49.57 19:02:04 SYS@dyckdb2(1269)> / Revoke succeeded. Elapsed: 00:00:09.58 19:02:46 SYS@dyckdb1(2194)> drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO; Materialized view dropped. Elapsed: 00:00:03.82

    删除视图,问题解决。

    这里提供另外一个查找产生指定时间段内产生redo最多的SQL做参考:

    SELECT TO_CHAR(BEGIN_INTERVAL_TIME,'YYYY_MM_DD HH24') WHEN,
                 DBMS_LOB.SUBSTR(SQL_TEXT,4000,1) SQL,
                 DHSS.INSTANCE_NUMBER INST_ID,
                 DHSS.SQL_ID,
                 EXECUTIONS_DELTA EXEC_DELTA,
                 ROWS_PROCESSED_DELTA ROWS_PROC_DELTA
    FROM DBA_HIST_SQLSTAT DHSS,
             DBA_HIST_SNAPSHOT DHS,
             DBA_HIST_SQLTEXT DHST
    WHERE UPPER(DHST.SQL_TEXT) LIKE '%T_CLIENT_INFO%'
      AND LTRIM(UPPER(DHST.SQL_TEXT)) NOT LIKE 'SELECT%'
      AND DHSS.SNAP_ID=DHS.SNAP_ID
      AND DHSS.INSTANCE_NUMBER=DHS.INSTANCE_NUMBER
      AND DHSS.SQL_ID=DHST.SQL_ID
      AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('2020-08-07 16:00','YYYY-MM-DD HH24:MI')
      AND TO_DATE('2020-08-07 17:00','YYYY-MM-DD HH24:MI')
      order by exec_delta;
    View Code

    通过上边的语句查找是可以发现update基表和insert物化视图的SQL语句的。

    接下来可以通过模拟on commit对update影响有多大,顺便看看是否由于物化视图关系是否导致redo量激增。

    oracle慎用基于on commit刷新物化视图的方式(二)。

  • 相关阅读:
    C# 创建Excel或需不安装Office
    Spring学习之旅(八)Spring 基于AspectJ注解配置的AOP编程工作原理初探
    Spring学习之旅(七)基于XML配置与基于AspectJ注解配置的AOP编程比较
    开发Spring过程中几个常见异常(三):java.lang.ClassCastException: com.sun.proxy.$Proxy4 cannot be cast to com.edu.aop.ArithmeticCalculatorImpl at com.edu.aop.Main.main(Main.java:11)
    Spring学习之旅(六)Spring AOP工作原理初探
    开发Spring过程中几个常见异常(二):Exception encountered during context initialization
    Spring学习之旅(五)极速创建Spring AOP java工程项目
    Spring学习之旅(四)Spring工作原理再探
    Spring学习之旅(三)Spring工作原理初探
    Spring学习之旅(二)极速创建Spring框架java Web工程项目
  • 原文地址:https://www.cnblogs.com/PiscesCanon/p/13496692.html
Copyright © 2011-2022 走看看