zoukankan      html  css  js  c++  java
  • 转 [分享一个SQL] 查会话阻塞关系,层次关系.

     ###without sesesion_id

    with ash as (select /*+ materialize*/* from DBA_HIST_ACTIVE_SESS_HISTORY  where sample_time between timestamp '2015-03-18 15:00:00' and timestamp

    '2015-03-18 15:01:00'), 
       chains as
       select session_id, level lvl, 
              sys_connect_by_path(sql_id, ' -> ') path, 
              connect_by_isleaf isleaf 
         from ash
        start with event = 'row cache lock' 
        connect by nocycle (prior blocking_session = session_id and prior blocking_session_serial# = session_serial# and prior sample_id = sample_id)) 
     select lpad(round(ratio_to_report(count(*)) over () * 100)||'%',5,' ') "%This", 
            count(*) samples, 
            path 
       from chains 
      where isleaf = 1 
      group by path 
      order by samples
      

     ##with session_id


    with ash as
    (select /*+ materialize*/
    *
    from v$active_session_history t
    where sample_time >=
    to_date('2019-07-10 20:30:00', 'yyyy-mm-dd hh24:mi:ss')
    and sample_time <
    to_date('2019-07-10 21:11:00', 'yyyy-mm-dd hh24:mi:ss')),

    chains as
    (
    select session_id,
    level lvl,
    sys_connect_by_path(session_id || ' ' || sql_id || ' ' || event,
    ' -> ') path,
    connect_by_isleaf isleaf
    from ash
    start with event in ('enq: TX - row lock contention')
    connect by nocycle(prior blocking_session = session_id
    and prior blocking_session_serial# = session_serial#
    and prior sample_id = sample_id))

    select lpad(round(ratio_to_report(count(*)) over() * 100) || '%', 5, ' ') "%This",
    count(*) samples,
    path

    from chains

    where isleaf = 1

    group by path

    order by samples desc;

    #######sample 1 :

    我们碰到一个问题,一个库在一天跑批时间段,短暂的10分钟出现挂起,等待事件为log file switch (checkpoint incomplete), enq: US - contention,enq: TA - contention
    10分钟后,自动恢复了。


    step 1:##view 问题时间点 session 连接数
    select /*+ parallel 8 */
    dbid, instance_number, sample_id, sample_time, count(*) session_count
    from m_ash_2020 t
    group by dbid, instance_number, sample_id, sample_time
    order by dbid, instance_number, sample_time;


    看上去等待事件从 12:13 开始增加
    52 883975308 1 258303587 21-MAY-20 12.13.40.357 AM 103
    53 883975308 1 258303597 21-MAY-20 12.13.50.394 AM 121


    174 883975308 2 257867989 21-MAY-20 12.14.08.789 AM 110
    175 883975308 2 257867999 21-MAY-20 12.14.18.841 AM 124

    step 2: view 等待事件,看上去等待事件 从 12月13日 13:00 ~ 15:00数据 ,等待事件为log file switch (checkpoint incomplete), enq: US - contention,enq: TA - contention

    24 883975308 258303647 21-MAY-20 12.14.40.705 AM 1 enq: US - contention WAITING 51


    select t.dbid,
    t.sample_id,
    t.sample_time,
    t.instance_number,
    t.event,
    t.session_state,
    t.c session_count
    from (select t.*,
    rank() over(partition by dbid, instance_number, sample_time order by c desc) r
    from (select /*+ parallel 8 */
    t.*,
    count(*) over(partition by dbid, instance_number, sample_time, event) c,
    row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
    from m_ash_2020 t
    where sample_time >
    to_timestamp('2020-05-21 00:13:00',
    'yyyy-mm-dd hh24:mi:ss')
    and sample_time <
    to_timestamp('2020-05-21 00:15:00',
    'yyyy-mm-dd hh24:mi:ss')
    ) t
    where r1 = 1) t
    where r < 3
    order by dbid, instance_number, sample_time, r;


    看上去
    156 883975308 258303647 21-MAY-20 12.14.40.705 AM 1 enq: US - contention WAITING 51
    157 883975308 258303657 21-MAY-20 12.14.50.761 AM 1 log file switch (checkpoint incomplete) WAITING 93

    ####

    看上去是有长事务触发了这个问题。但是我们不知道哪个长事务触发了这个内部等待。
    How to correct performance issues with enq: US - contention related to undo segments (Doc ID 1332738.1) To BottomTo Bottom

    The wait event "enq: US Contention" is associated with contention on the latch in the row cache (dc_rollback_seg). Enqueue US - Contention can become a bottle-neck for performance if workload dictates that a lot of offlined undo segments must be onlined over a short period of time. The latch on the row cache can be unable to keep up with the workload.

    This can happen for a number of reasons and some scenarios are legitimate workload demands.

    Solution: ensure that peaks in onlined undo segments do not happen (see workaround #2). That is not always feasible.

    Workarounds:

    Bounce the instance.
    Setting _ROLLBACK_SEGMENT_COUNT to a high number to keep undo segments online:
    ALTER SYSTEM SET "_rollback_segment_count"=<n>;

    Note: In databases with high query activity, particularly parallel query and a high setting for _ROLLBACK_SEGMENT_COUNT, you can expect to see wait contention on the row cache for DC_ROLLBACK_SEGS. It is highly recommended in these environments where setting _ROLLBACK_SEGMENT_COUNT to a high value (10s of thousands and higher) apply the patch for Bug:14226599. This will increase the hash buckets on the DC_ROLLBACK_SEGS row cache to help alleviate latch contention.
    Set _UNDO_AUTOTUNE to FALSE:
    ALTER SYSTEM SET "_undo_autotune" = false;

    Note: Simply using _SMU_DEBUG_MODE=33554432 may not be enough to stop the problem, but valid fix for Bug:5387030.
    A fix to Bug:7291739 is to set a new hidden parameter, _HIGHTHRESHOLD_UNDORETENTION to set a high threshold for undo retention completely distinct from maxquerylen:
    ALTER SYSTEM SET "_highthreshold_undoretention"=<n>;


    #####step3:


    造成的结果就是inset 语句一直在等待。

    INSERT INTO CBSD_TRAN_LOG (SEQ_NO, TRAN_DATE, SOURCE_TYPE, STATUS, CHANNEL_DATE, MSG_CODE, MSG_TYPE, TRAN_TYPE, BRANCH, USER_ID, PROGRAM_ID, ORG_SYS_ID, BUSS_SEQ_NO, CONSUMER_ID, IN_DATE_TIME, OUT_DATE_TIME, RET_CODE, RET_MSG, LOG_FLAG, IN_OUT_FLAG, HOST_NAME, HOST_IP, FINANCIAL_TYPE, PTA_CODE ) VALUES (:B19 , :B18 , :B17 , 'P', :B16 , :B15 , :B14 , :B13 , :B12 , :B11 , :B10 , :B9 , :B8 , :B7 , SYSTIMESTAMP, NULL, NULL, NULL, :B6 , :B5 , :B4 , :B3 , :B2 , :B1 ) RETURNING ROWID INTO :O0


    ##继续寻找长事务


    通过AWR 找到 问题时间段
    begin samp: 21-5月 -20 00:00:12 (135418),
    End Snap: 21-5月 -20 00:30:08 (135420)

    套用如下SQL 检查:
    select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
    nvl(executions_delta,0) execs,
    (elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
    (buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
    from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
    where ss.snap_id = S.snap_id
    and ss.snap_id between 135418 and 135420
    and ss.instance_number = S.instance_number
    and executions_delta > 0
    order by 1, 2, 3


    260 135419 1 21-MAY-20 12.00.12.307 AM 8vyjutx6hg3wh 508202633 70 185.805728857143 5.52857142857143

    244 135419 1 21-MAY-20 12.00.12.307 AM a6xpm6yf5v853 0 1 634.598866 5756056
    307 135419 1 21-MAY-20 12.00.12.307 AM 1c86ws6hvf851 0 1 637.564491 4807707

    229 135419 1 21-MAY-20 12.00.12.307 AM buc2qbhrq5cww 0 13 644.167239307692 9371341.84615385
    271 135419 1 21-MAY-20 12.00.12.307 AM 7cpchd19ucx24 0 1 696.898091 8470013
    277 135419 1 21-MAY-20 12.00.12.307 AM 5vm94ytq5m1s6 0 1 703.738812 8250391

    -》以下SQL 平均时间都在600S 以外: 这些值的plan hash value 都是0.
    buc2qbhrq5cww:
    begin RB_INT_ACCR.PROCESS_ACCOUNTS('ACCR');end;

    7cpchd19ucx24:
    DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN BATCH_SPLIT.SPLITTED_PROC(2310964); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;


    5vm94ytq5m1s6:
    DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN BATCH_SPLIT.SPLITTED_PROC(2310958); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;

    a6xpm6yf5v853:
    DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN BATCH_SPLIT.SPLITTED_PROC(2310980); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;

    1c86ws6hvf851:
    DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN cd_endofday_pkg.cd_eodrun; :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;


    ->以下SQL高度值得怀疑 ,PLAH HASH VALUE 不为0
    260 135419 1 21-MAY-20 12.00.12.307 AM 8vyjutx6hg3wh 508202633 70 185.805728857143 5.52857142857143

    sql_id:8vyjutx6hg3wh
    update /*+ rule */ undo$ set name=:2, file#=:3, block#=:4, status$=:5, user#=:6, undosqn=:7, xactsqn=:8, scnbas=:9, scnwrp=:10, inst#=:11, ts#=:12, spare1=:13 where us#=:1


    -》 以下SQL

    #对比一下AWR 报告的正常时间段的效率。

    select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
    nvl(executions_delta,0) execs,
    (elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
    (buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
    from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
    where ss.snap_id = S.snap_id
    and ss.snap_id between 134746 and 134748
    and ss.instance_number = S.instance_number
    and executions_delta > 0
    order by 1, 2, 3

    -》
    -综合来看:

    问题触发了一个未知的bug .

    Re-visited the logs once again and found that there is known document for this issue. Please follow the solution suggested in the below document -

    Alert Log Shows 'ORA-12751: cpu time or run time policy violation' and Associated MMON Trace Shows 'KEBM: MMON action policy violation. 'Block Cleanout Optim, Undo Segment Scan' viol=1; err=12751' ( Doc ID 1671412.1 )

    //
    ----- START DDE Action: 'ORA_12751_DUMP' (Sync) -----
    Runtime exceeded 300 seconds
    Time limit violation detected at:
    ksedsts()+240<-kspol_12751_dump()+140<-dbgdaExecuteAction()+808<-dbgerRunAction()+108<-dbgerRunActions()+2976<-dbgexPhaseII()+1468<-dbgexProcessError()+1556<-dbgeExecuteForError()+72<-dbgePostErrorKGE()+2044<-dbkePostKGE_kgsf()+68<-kgeade()+364<-kgeselv()+96
    <-ksesecl0()+80<-kqrigt()+3156<-kqrLockAndPinPo()+572<-kqrpre1()+944<-kqrpre()+28<-ktucloUsegScan()+3580<-ksb_run_managed_action()+2872<-ksbcti()+4044<-ksbabs()+796<-kebm_mmon_main()+428<-ksbrdp()+2216<-opirip()+1620<-opidrv()+608<-sou2o()+136<-opimai_real()+188
    <-ssthrdmain()+276<-main()+204<-__start()+112Current Wait Stack:

    .........................................

    ----- END DDE Action: 'ORA_12751_DUMP' (SUCCESS, 0 csec) -----
    ----- END DDE Actions Dump (total 0 csec) -----
    KEBM: MMON action policy violation. 'Block Cleanout Optim, Undo Segment Scan' viol=1; err=12751
    //


    workaroud:
    规避方法如下:
    SQL> alter system set "_smu_debug_mode"=134217728 scope=both;

    ###

    感谢

    如何通过dba_hist_active_sess_history分析历史数据库性能问题

    背景
    在很多情况下,当数据库发生性能问题的时候,我们并没有机会来收集足够的诊断信息,比如system state dump或者hang analyze,甚至问题发生的时候DBA根本不在场。这给我们诊断问题带来很大的困难。那么在这种情况下,我们是否能在事后收集一些信息来分析问题的原因呢?在Oracle 10G或者更高版本上,答案是肯定的。本文我们将介绍一种通过dba_hist_active_sess_history的数据来分析问题的一种方法。

    适用于
    Oracle 10G或更高版本,本文适用于任何平台。

    详情
    在Oracle 10G中,我们引入了AWR和ASH采样机制,有一个视图gv$active_session_history会每秒钟将数据库所有节点的Active Session采样一次,而dba_hist_active_sess_history则会将gv$active_session_history里的数据每10秒采样一次并持久化保存。基于这个特征,我们可以通过分析dba_hist_active_sess_history的Session采样情况,来定位问题发生的准确时间范围,并且可以观察每个采样点的top event和top holder。下面通过一个例子来详细说明。

    1. Dump出问题期间的ASH数据:
    为了不影响生产系统,我们可以将问题大概期间的ASH数据export出来在测试机上分析。
    基于dba_hist_active_sess_history创建一个新表m_ash,然后将其通过exp/imp导入到测试机。在发生问题的数据库上执行exp:
    SQL> conn user/passwd
    SQL> create table m_ash as select * from dba_hist_active_sess_history where SAMPLE_TIME between TO_TIMESTAMP ('<time_begin>', 'YYYY-MM-DD HH24:MI:SS') and TO_TIMESTAMP ('<time_end>', 'YYYY-MM-DD HH24:MI:SS'); 

    $ exp user/passwd file=m_ash.dmp tables=(m_ash) log=m_ash.exp.log

    然后导入到测试机:
    $ imp user/passwd file=m_ash.dmp log=m_ash.imp.log

    2. 验证导出的ASH时间范围:
    为了加快速度,我们采用了并行查询。另外建议采用Oracle SQL Developer来查询以防止输出结果折行不便于观察。

    set line 200 pages 1000
    col sample_time for a25
    col event for a40
    alter session set nls_timestamp_format='yyyy-mm-dd hh24:mi:ss.ff';

    select /*+ parallel 8 */
     t.dbid, t.instance_number, min(sample_time), max(sample_time), count(*) session_count
      from m_ash t
     group by t.dbid, t.instance_number
     order by dbid, instance_number;

    INSTANCE_NUMBER    MIN(SAMPLE_TIME)    MAX(SAMPLE_TIME)    SESSION_COUNT
    1    2015-03-26 21:00:04.278    2015-03-26 22:59:48.387    2171
    2    2015-03-26 21:02:12.047    2015-03-26 22:59:42.584    36

    从以上输出可知该数据库共2个节点,采样时间共2小时,节点1的采样比节点2要多很多,问题可能发生在节点1上。

    3. 确认问题发生的精确时间范围:
    参考如下脚本:

    select /*+ parallel 8 */
     dbid, instance_number, sample_id, sample_time, count(*) session_count
      from m_ash t
     group by dbid, instance_number, sample_id, sample_time
     order by dbid, instance_number, sample_time;

    INSTANCE_NUMBER    SAMPLE_ID    SAMPLE_TIME    SESSION_COUNT
    1    36402900    2015-03-26 22:02:50.985    4
    1    36402910    2015-03-26 22:03:01.095    1
    1    36402920    2015-03-26 22:03:11.195    1
    1    36402930    2015-03-26 22:03:21.966    21
    1    36402940    2015-03-26 22:03:32.116    102
    1    36402950    2015-03-26 22:03:42.226    181
    1    36402960    2015-03-26 22:03:52.326    200
    1    36402970    2015-03-26 22:04:02.446    227
    1    36402980    2015-03-26 22:04:12.566    242
    1    36402990    2015-03-26 22:04:22.666    259
    1    36403000    2015-03-26 22:04:32.846    289
    1    36403010    2015-03-26 22:04:42.966    147
    1    36403020    2015-03-26 22:04:53.076    2
    1    36403030    2015-03-26 22:05:03.186    4
    1    36403040    2015-03-26 22:05:13.296    1
    1    36403050    2015-03-26 22:05:23.398    1

    注意观察以上输出的每个采样点的active session的数量,数量突然变多往往意味着问题发生了。从以上输出可以确定问题发生的精确时间在2015-03-26 22:03:21 ~ 22:04:42,问题持续了大约1.5分钟。
    注意: 观察以上的输出有无断档,比如某些时间没有采样。

    4. 确定每个采样点的top n event:
    在这里我们指定的是top 2 event,并且注掉了采样时间以观察所有采样点的情况。如果数据量较多,您也可以通过开启sample_time的注释来观察某个时间段的情况。注意最后一列session_count指的是该采样点上的等待该event的session数量。

    select t.dbid,
           t.sample_id,
           t.sample_time,
           t.instance_number,
           t.event,
           t.session_state,
           t.c session_count
      from (select t.*,
                   rank() over(partition by dbid, instance_number, sample_time order by c desc) r
              from (select /*+ parallel 8 */
                     t.*,
                     count(*) over(partition by dbid, instance_number, sample_time, event) c,
                     row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
                      from m_ash t
                    /*where sample_time >
                        to_timestamp('2013-11-17 13:59:00',
                                     'yyyy-mm-dd hh24:mi:ss')
                    and sample_time <
                        to_timestamp('2013-11-17 14:10:00',
                                     'yyyy-mm-dd hh24:mi:ss')*/
                    ) t
             where r1 = 1) t
     where r < 3
     order by dbid, instance_number, sample_time, r;

    SAMPLE_ID    SAMPLE_TIME    INSTANCE_NUMBER    EVENT    SESSION_STATE    SESSION_COUNT
    36402900    22:02:50.985    1        ON CPU    3
    36402900    22:02:50.985    1    db file sequential read    WAITING    1
    36402910    22:03:01.095    1        ON CPU    1
    36402920    22:03:11.195    1    db file parallel read    WAITING    1
    36402930    22:03:21.966    1    cursor: pin S wait on X    WAITING    11
    36402930    22:03:21.966    1    latch: shared pool    WAITING    4
    36402940    22:03:32.116    1    cursor: pin S wait on X    WAITING    83
    36402940    22:03:32.116    1    SGA: allocation forcing component growth    WAITING    16
    36402950    22:03:42.226    1    cursor: pin S wait on X    WAITING    161
    36402950    22:03:42.226    1    SGA: allocation forcing component growth    WAITING    17
    36402960    22:03:52.326    1    cursor: pin S wait on X    WAITING    177
    36402960    22:03:52.326    1    SGA: allocation forcing component growth    WAITING    20
    36402970    22:04:02.446    1    cursor: pin S wait on X    WAITING    204
    36402970    22:04:02.446    1    SGA: allocation forcing component growth    WAITING    20
    36402980    22:04:12.566    1    cursor: pin S wait on X    WAITING    219
    36402980    22:04:12.566    1    SGA: allocation forcing component growth    WAITING    20
    36402990    22:04:22.666    1    cursor: pin S wait on X    WAITING    236
    36402990    22:04:22.666    1    SGA: allocation forcing component growth    WAITING    20
    36403000    22:04:32.846    1    cursor: pin S wait on X    WAITING    265
    36403000    22:04:32.846    1    SGA: allocation forcing component growth    WAITING    20
    36403010    22:04:42.966    1    enq: US - contention    WAITING    69
    36403010    22:04:42.966    1    latch: row cache objects    WAITING    56
    36403020    22:04:53.076    1    db file scattered read    WAITING    1
    36403020    22:04:53.076    1    db file sequential read    WAITING    1

    从以上输出我们可以发现问题期间最严重的等待为cursor: pin S wait on X,高峰期等待该event的session数达到了265个,其次为SGA: allocation forcing component growth,高峰期session为20个。

    注意:
    1) 再次确认以上输出有无断档,是否有某些时间没有采样。
    2) 注意那些session_state为ON CPU的输出,比较ON CPU的进程个数与您的OS物理CPU的个数,如果接近或者超过物理CPU个数,那么您还需要检查OS当时的CPU资源状况,比如OSWatcher/NMON等工具,高的CPU Run Queue可能引发该问题,当然也可能是问题的结果,需要结合OSWatcher和ASH的时间顺序来验证。

    5. 观察每个采样点的等待链:
    其原理为通过dba_hist_active_sess_history. blocking_session记录的holder来通过connect by级联查询,找出最终的holder. 在RAC环境中,每个节点的ASH采样的时间很多情况下并不是一致的,因此您可以通过将本SQL的第二段注释的sample_time稍作修改让不同节点相差1秒的采样时间可以比较(注意最好也将partition by中的sample_time做相应修改)。该输出中isleaf=1的都是最终holder,而iscycle=1的代表死锁了(也就是在同一个采样点中a等b,b等c,而c又等a,这种情况如果持续发生,那么尤其值得关注)。采用如下查询能观察到阻塞链。

    select /*+ parallel 8 */
     level                     lv,
     connect_by_isleaf         isleaf,
     connect_by_iscycle        iscycle,
     t.dbid,
     t.sample_id,
     t.sample_time,
     t.instance_number,
     t.session_id,
     t.sql_id,
     t.session_type,
     t.event,
     t.session_state,
     t.blocking_inst_id,
     t.blocking_session,
     t.blocking_session_status
      from m_ash t
    /*where sample_time >
        to_timestamp('2013-11-17 13:55:00',
                     'yyyy-mm-dd hh24:mi:ss')
    and sample_time <
        to_timestamp('2013-11-17 14:10:00',
                     'yyyy-mm-dd hh24:mi:ss')*/
     start with blocking_session is not null
    connect by nocycle
     prior dbid = dbid
           and prior sample_time = sample_time
              /*and ((prior sample_time) - sample_time between interval '-1'
              second and interval '1' second)*/
           and prior blocking_inst_id = instance_number
           and prior blocking_session = session_id
           and prior blocking_session_serial# = session_serial#;

    LV    ISLEAF    ISCYCLE    SAMPLE_TIME    INSTANCE_NUMBER    SESSION_ID    SQL_ID    EVENT    SESSION_STATE    BLOCKING_INST_ID    BLOCKING_SESSION    BLOCKING_SESSION_STATUS
    1    0    0    22:04:32.846    1    1259    3ajt2htrmb83y    cursor:    WAITING    1    537    VALID
    2    1    0    22:04:32.846    1    537    3ajt2htrmb83y    SGA:    WAITING            UNKNOWN

    注意为了输出便于阅读,我们将等待event做了简写,下同。从上面的输出可见,在相同的采样点上(22:04:32.846),节点1 session 1259在等待cursor: pin S wait on X,其被节点1 session 537阻塞,而节点1 session 537又在等待SGA: allocation forcing component growth,并且ASH没有采集到其holder,因此这里cursor: pin S wait on X只是一个表面现象,问题的原因在于SGA: allocation forcing component growth

    6. 基于第5步的原理来找出每个采样点的最终top holder:
    比如如下SQL列出了每个采样点top 2的blocker session,并且计算了其最终阻塞的session数(参考blocking_session_count)

    select t.lv,
           t.iscycle,
           t.dbid,
           t.sample_id,
           t.sample_time,
           t.instance_number,
           t.session_id,
           t.sql_id,
           t.session_type,
           t.event,
           t.seq#,
           t.session_state,
           t.blocking_inst_id,
           t.blocking_session,
           t.blocking_session_status,
           t.c blocking_session_count
      from (select t.*,
                   row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
              from (select t.*,
                           count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
                           row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
                      from (select /*+ parallel 8 */
                             level              lv,
                             connect_by_isleaf  isleaf,
                             connect_by_iscycle iscycle,
                             t.*
                              from m_ash t
                            /*where sample_time >
                                to_timestamp('2013-11-17 13:55:00',
                                             'yyyy-mm-dd hh24:mi:ss')
                            and sample_time <
                                to_timestamp('2013-11-17 14:10:00',
                                             'yyyy-mm-dd hh24:mi:ss')*/
                             start with blocking_session is not null
                            connect by nocycle
                             prior dbid = dbid
                                   and prior sample_time = sample_time
                                      /*and ((prior sample_time) - sample_time between interval '-1'
                                      second and interval '1' second)*/
                                   and prior blocking_inst_id = instance_number
                                   and prior blocking_session = session_id
                                   and prior
                                        blocking_session_serial# = session_serial#) t
                     where t.isleaf = 1) t
             where r1 = 1) t
     where r < 3
     order by dbid, sample_time, r;

    SAMPLE_TIME    INSTANCE_NUMBER    SESSION_ID    SQL_ID    EVENT    SEQ#    SESSION_STATE    BLOCKING_SESSION_STATUS    BLOCKING_SESSION_COUNT
    22:03:32.116    1    1136    1p4vyw2jan43d    SGA:    1140    WAITING    UNKNOWN    82
    22:03:32.116    1    413    9g51p4bt1n7kz    SGA:    7646    WAITING    UNKNOWN    2
    22:03:42.226    1    1136    1p4vyw2jan43d    SGA:    1645    WAITING    UNKNOWN    154
    22:03:42.226    1    537    3ajt2htrmb83y    SGA:    48412    WAITING    UNKNOWN    4
    22:03:52.326    1    1136    1p4vyw2jan43d    SGA:    2150    WAITING    UNKNOWN    165
    22:03:52.326    1    537    3ajt2htrmb83y    SGA:    48917    WAITING    UNKNOWN    8
    22:04:02.446    1    1136    1p4vyw2jan43d    SGA:    2656    WAITING    UNKNOWN    184
    22:04:02.446    1    537    3ajt2htrmb83y    SGA:    49423    WAITING    UNKNOWN    10
    22:04:12.566    1    1136    1p4vyw2jan43d    SGA:    3162    WAITING    UNKNOWN    187
    22:04:12.566    1    2472        SGA:    1421    WAITING    UNKNOWN    15
    22:04:22.666    1    1136    1p4vyw2jan43d    SGA:    3667    WAITING    UNKNOWN    193
    22:04:22.666    1    2472        SGA:    1926    WAITING    UNKNOWN    25
    22:04:32.846    1    1136    1p4vyw2jan43d    SGA:    4176    WAITING    UNKNOWN    196
    22:04:32.846    1    2472        SGA:    2434    WAITING    UNKNOWN    48

    注意以上输出,比如第一行,代表在22:03:32.116,节点1的session 1136最终阻塞了82个session.  顺着时间往下看,可见节点1的session 1136是问题期间最严重的holder,它在每个采样点都阻塞了100多个session,并且它持续等待SGA: allocation forcing component growth,注意观察其seq#您会发现该event的seq#在不断变化,表明该session并未完全hang住,由于时间正好发生在夜间22:00左右,这显然是由于自动收集统计信息job导致shared memory resize造成,因此可以结合Scheduler/MMAN/MMNL的trace以及dba_hist_memory_resize_ops的输出进一步确定问题。

    注意:
    1) blocking_session_count 指某一个holder最终阻塞的session数,比如 a <- b<- c (a被b阻塞,b又被c阻塞),只计算c阻塞了1个session,因为中间的b可能在不同的阻塞链中发生重复。
    2) 如果最终的holder没有被ash采样(一般因为该holder处于空闲),比如 a<- c 并且b<- c (a被c阻塞,并且b也被c阻塞),但是c没有采样,那么以上脚本无法将c统计到最终holder里,这可能会导致一些遗漏。
    3) 注意比较blocking_session_count的数量与第3步查询的每个采样点的总session_count数,如果每个采样点的blocking_session_count数远小于总session_count数,那表明大部分session并未记载holder,因此本查询的结果并不能代表什么。
    4) 在Oracle 10g中,ASH并没有blocking_inst_id列,在以上所有的脚本中,您只需要去掉该列即可。因此10g的ASH一般只能用于诊断单节点的问题。

    其他关于ASH的应用
    除了通过ASH数据来找holder以外,我们还能用它来获取很多信息(基于数据库版本有所不同):
    比如通过PGA_ALLOCATED列来计算每个采样点的最大PGA,合计PGA以分析ora-4030/Memory Swap相关问题;
    通过TEMP_SPACE_ALLOCATED列来分析临时表空间使用情况;
    通过IN_PARSE/IN_HARD_PARSE/IN_SQL_EXECUTION列来分析SQL处于parse还是执行阶段;
    通过CURRENT_OBJ#/CURRENT_FILE#/CURRENT_BLOCK#来确定I/O相关等待发生的对象

  • 相关阅读:
    Spark数据读取
    05、TypeScript 中的泛型
    04、TypeScript 中的接口
    03、TypeScript 中的类
    02、TypeScript 中的函数
    01、TypeScript 数据类型
    Vue-router 知识点
    什么是跨域?如何解决跨域问题
    工作中积累的问题、知识点总结100题(0-20)
    封装一个 Promise 对象。了解其原理
  • 原文地址:https://www.cnblogs.com/feiyun8616/p/6138333.html
Copyright © 2011-2022 走看看