zoukankan      html  css  js  c++  java
  • log file switch (checkpoint incomplete)

    本文转自 https://blogs.oracle.com/database4cn/log-file-switch-checkpoint-incomplete-%e5%ae%b9%e6%98%93%e8%a2%ab%e8%af%af%e8%af%8a%e7%9a%84event
     
    首先来看如下的一份AWR,12分钟的采样,DB Time 105分钟。
    DB Name DB Id Instance Inst num Startup Time Release RAC
    R11204 2114874159 R11204 1 23-Oct-17 10:10 11.2.0.4.0 NO
     
    Host Name Platform CPUs Cores Sockets Memory (GB)
    nascds18 Linux x86 64-bit 2 2 1 11.64
     
    Snap Id Snap Time Sessions Cursors/Session
    Begin Snap: 3 23-Oct-17 10:55:46 37 2.5
    End Snap: 4 23-Oct-17 11:08:27 53 2.3
    Elapsed:   12.67 (mins)    
    DB Time:   105.90 (mins)
     
    Top event 发现 buffer busy waits 和 log file switch (checkpoint incomplete) 几乎占用了所有的DB Time
    Top 10 Foreground Events by Total Wait Time
     
    Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
    buffer busy waits 11 3310.6 300960 52.1 Concurrency
    log file switch (checkpoint incomplete) 10 3034.8 303479 47.8 Configuration
    DB CPU   5.5   .1  
    log file sync 28 2.3 82 .0 Commit
    log buffer space 24 .8 32 .0 Configuration
    
    通过ASH不难发现 buffer busy waits 被 log file switch (checkpoint incomplete) 阻塞,而 log file switch (checkpoint incomplete) 又被 LGWR 的 control file sequential read 阻塞。
    2017-10-23 11:05:31.563 1 37 perl@nascds18 (TNS V1-V3) buffer busy waits WAITING 1 150
    2017-10-23 11:05:31.563 1 150 sqlplus@nascds18 (TNS V1-V3) buffer busy waits WAITING 1 141
    2017-10-23 11:05:31.563 1 141 OMS log file switch (checkpoint incomplete) WAITING 1 130
    2017-10-23 11:05:31.563 1 130 oracle@nascds18 (LGWR) control file sequential read WAITING NO HOLDER
    
    接下来再看一眼AWR的iostat,很快发现LGWR有大量的读取,并且都是在读取控制控制文件,12分钟内的读取高达1.5G。
    IOStat by Function summary
     
    Function Name Reads: Data
    LGWR 1.5G
    Others 210M
    DBWR 0M
    Buffer Cache Reads 10M
    Direct Writes 0M
    TOTAL: 1.7G
     
    IOStat by Filetype summary
     
    Filetype Name Reads: Data
    Control File 1.5G
    Log File 185M
    Archive Log 0M
    Data File 10M
    Temp File 0M
    TOTAL: 1.7G
    
    结合ASH中的 LGWR 是最终holder,并且 LGWR 等待 control file sequential read 的事实,我们可能很快将 LGWR 列为重点怀疑对象,如果真是这样,那就跑偏了。
     
    此时不妨停下来思考一下,什么是 log file switch (checkpoint incomplete)?
     
    log file switch (checkpoint incomplete) 指的是当redo需要向下一组 redo group 切换的时候,发现下组日志是 active 的,也就是说下组日志中对应的一些 buffer cache 中的脏块尚未写入到数据文件中,因此必须等待这些脏块被完毕后,才可以复用下一组 redo group。
     
    接下来再思考一下,哪个进程负责将脏块写入到数据文件?答案是DBWn,因此我们更需要关注的是DBWn。分析一下OSWatcher不难发现DBW0的状态是D,man一下ps命令: D指的是Uninterruptible sleep (usually IO),也就是说DBW0 hang在I/O上了,这才是问题的根本原因。
     
    那么为什么LGWR会执行如此多的 control file sequential read ?答案是DBWn出现问题导致 LGWR不断查询control 文件获取redo 状态(看它有没有切换成功)导致大量的 control file sequential read
     
    下面通过实验来演示这个现象:
     
    Test on 12.1.0.2
     
    Session 1:  创建一个表,并且update一下让其产生redo
    Connected to:
    Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
    With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
     
    SQL> select sid from v$mystat where rownum=1;
     
          SID
    ----------
           11   <<<< session 1的 sid 为11
     
    SQL> create table t as select * from dba_objects;
     
    Table created.
     
    SQL> update t set object_name=object_name;
     
    93841 rows updated.
    
    Session 2:  用于查看redo的切换状态
    SQL> set line 200 pages 1000
    SQL> select group#,thread#,status from v$log;
     
       GROUP#    THREAD# STATUS
    ---------- ---------- ----------------
            1          1 INACTIVE
            2          1 INACTIVE
            3          1 CURRENT
     
    SQL> alter system switch logfile;
     
    System altered.
     
    SQL> alter system switch logfile;
     
    System altered.
     
    SQL> select group#,thread#,status from v$log;
     
       GROUP#    THREAD# STATUS
    ---------- ---------- ----------------
            1          1 ACTIVE     <<<<<< 让下一组是active的
            2          1 ACTIVE
            3          1 CURRENT
    
    Session 3: 确定DBWn以及LGWR的 spid,并且通过 oradebug 将 DBW0 suspend住让其不工作
    SQL> set line 200 pages 1000
    SQL> select spid,program from v$process where program like '%DBW%' or program like '%LG%';
     
    SPID                     PROGRAM
    ------------------------ ----------------------------------------------------------------
    5768                     ORACLE.EXE (DBW0)
    7248                     ORACLE.EXE (LGWR)
    6384                     ORACLE.EXE (LG00)
    6308                     ORACLE.EXE (LG01)
     
    SQL> oradebug setospid 5768
    Oracle pid: 11, Windows thread id: 5768, image: ORACLE.EXE (DBW0)
    SQL> oradebug suspend
    Statement processed.
    
    接下来在Session 1中再执行几次update让其自动切换日志:
    Session 1:
    SQL> update t set object_name=object_name;
     
    93841 rows updated.
     
    SQL> /
     
    93841 rows updated.
     
    SQL> /
     
    93841 rows updated.
     
    SQL> /   ---- 在这里hang住了
    
    Session 4: 观察等待事件
    SQL> set line 200 pages 1000
    SQL> col program for a15
    SQL> col event for a40
    SQL> select sid,serial#,program,event,state from v$session where program like '%sqlplus%';
     
          SID    SERIAL# PROGRAM         EVENT                                    STATE
    ---------- ---------- --------------- ---------------------------------------- -------------------
           10      33682 sqlplus.exe     SQL*Net message from client              WAITING
           11      48189 sqlplus.exe     log file switch (checkpoint incomplete)  WAITING
          129      25471 sqlplus.exe     SQL*Net message to client                WAITED SHORT TIME
          130      64963 sqlplus.exe     SQL*Net message from client              WAITING
    
    接下来我们对LGWR做10046 trace,观察其等待: 可见反复的 control file sequential read
    SQL> oradebug setospid 7248
    Oracle pid: 12, Windows thread id: 7248, image: ORACLE.EXE (LGWR)
    SQL> oradebug event 10046 trace name context forever, level 8
    Statement processed.
    SQL> oradebug tracefile_name
    D:ORACLEdiag
    dbms
    12102
    12102	race
    12102_lgwr_7248.trc
     
    tail -f D:ORACLEdiag
    dbms
    12102
    12102	race
    12102_lgwr_7248.trc
     
    WAIT #0: nam='LGWR all worker groups' ela= 72 p1=0 p2=0 p3=0 obj#=-1 tim=25178622234
    WAIT #0: nam='control file sequential read' ela= 407 file#=0 block#=1 blocks=1 obj#=-1 tim=25178622880
    WAIT #0: nam='control file sequential read' ela= 262 file#=1 block#=1 blocks=1 obj#=-1 tim=25178623344
    WAIT #0: nam='control file sequential read' ela= 717 file#=0 block#=15 blocks=1 obj#=-1 tim=25178624315
    WAIT #0: nam='control file sequential read' ela= 1774 file#=0 block#=17 blocks=1 obj#=-1 tim=25178626427
    WAIT #0: nam='control file sequential read' ela= 311 file#=0 block#=19 blocks=1 obj#=-1 tim=25178627527
    WAIT #0: nam='control file sequential read' ela= 269 file#=0 block#=284 blocks=1 obj#=-1 tim=25178627983
    WAIT #0: nam='control file sequential read' ela= 238 file#=0 block#=22 blocks=1 obj#=-1 tim=25178628363
    WAIT #0: nam='LGWR all worker groups' ela= 51 p1=0 p2=0 p3=0 obj#=-1 tim=25178628590
    WAIT #0: nam='control file sequential read' ela= 503 file#=0 block#=1 blocks=1 obj#=-1 tim=25178629320
    WAIT #0: nam='control file sequential read' ela= 322 file#=1 block#=1 blocks=1 obj#=-1 tim=25178630389
    WAIT #0: nam='control file sequential read' ela= 276 file#=0 block#=15 blocks=1 obj#=-1 tim=25178630864
    WAIT #0: nam='control file sequential read' ela= 253 file#=0 block#=17 blocks=1 obj#=-1 tim=25178631286
    WAIT #0: nam='control file sequential read' ela= 250 file#=0 block#=19 blocks=1 obj#=-1 tim=25178631696
    WAIT #0: nam='control file sequential read' ela= 658 file#=0 block#=284 blocks=1 obj#=-1 tim=25178632935
    WAIT #0: nam='control file sequential read' ela= 303 file#=0 block#=22 blocks=1 obj#=-1 tim=25178633812
    ......
    
    最后我们将DBWn恢复,结束这个无限的 control file sequential read 等待。
    回到Session 3,resume DBW0:
    SQL> oradebug resume
    Statement processed. 
    只要理解了原理,这个问题本来不难诊断,但是ASH和AWR的信息会让DBA错误的认为是 LGWR 的等待control file sequential read导致的问题。
  • 相关阅读:
    ios中静态库的创建和使用、制作通用静态库(Cocoa Touch Static Library)
    ios-字符串替换-正则表达式-例子
    Android 中自定义仪表盘
    Android中控件AutoCompleteTextView的使用方法和一些属性
    用代码如何检测一个android程序是否在运行
    android 报错: java.lang.NoClassDefFoundError: Failed resolution of: Landroid/support/v4/animation/AnimatorCompatHelper;
    android中Zing二维码扫描,二维码生成
    gps数据上传防止android系统休眠
    Android中判断service是否在运行
    BottomSheetDialogFragment 如何设置高度和禁止滑动
  • 原文地址:https://www.cnblogs.com/zylong-sys/p/12015975.html
Copyright © 2011-2022 走看看