zoukankan      html  css  js  c++  java
  • RAC某节点v$asm_disk查询hang分析处理

    主题:RAC某节点v$asm_disk查询hang分析处理
    环境:Oracle 11.2.0.3 RAC
    故障描述:RAC环境2个节点,节点1查询v$asm_disk正常返回结果,节点2查询v$asm_disk就会一直hang,查询会话对应event是ASM file metadata operation.

    1.初步判断

    首先连接节点1,查询v$asm_disk没有问题:
    --节点1查询v$asm_disk都没有问题
    SQL> show parameter name  
    SQL> select path from v$asm_disk;
    
    PATH
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    /dev/oracleasm/disks/DATADISK
    /dev/oracleasm/disks/FRADISK
    /dev/oracleasm/disks/CRSDISK1
    /dev/oracleasm/disks/CRSDISK3
    /dev/oracleasm/disks/CRSDISK2
    

    然后连接节点2,查询v$asm_disk就一直hang

    SQL> select path from v$asm_disk;
    hang住一直无结果返回..
    

    故障重现,确认客户描述故障现场属实。

    既然是hang住,自然去查该会话的等待事件是什么?是否有阻塞会话?

    SQL> select sid, serial#, inst_id, sql_id, event, p1,p2,p3, machine, username, blocking_session from gv$session where wait_class# <> 6;
    
           SID    SERIAL#    INST_ID SQL_ID        EVENT                                       P1         P2         P3 MACHINE              USERNAME                       BLOCKING_SESSION
    ---------- ---------- ---------- ------------- ----------------------------------- ---------- ---------- ---------- -------------------- ------------------------------ ----------------
    
          289      33911          2 b5tshv0auyqm0 ASM file metadata operation               6229          9          0 newdb2               SYS
    

    发现等待事件是“ASM file metadata operation”,BLOCKING_SESSION为空,表明并没有阻塞会话。

    此时查了下MOS,初步怀疑是asm_diskstring配置有问题,或是对应的磁盘权限有什么异常。但实际排查并未发现异常。
    两个节点asm实例的参数设置都一样;对应目录下的磁盘等权限等也一致。

    1. 两个节点asm实例的参数设置都一样
    --节点1:
    SQL> show parameter asm
    
    NAME                                 TYPE        VALUE
    ------------------------------------ ----------- ------------------------------
    asm_diskgroups                       string      DATADG, FRADG, CRSDG_NEW
    asm_diskstring                       string      /dev/oracleasm/disks/*
    asm_power_limit                      integer     1
    asm_preferred_read_failure_groups    string
    SQL> 
    
    --节点2 一样的:
    SQL> show parameter asm
    
    NAME                                 TYPE        VALUE
    ------------------------------------ ----------- ------------------------------
    asm_diskgroups                       string      DATADG, FRADG, CRSDG_NEW
    asm_diskstring                       string      /dev/oracleasm/disks/*
    asm_power_limit                      integer     1
    asm_preferred_read_failure_groups    string
    
    1. 对应目录下的磁盘等权限等也一致
    [grid@newdb1 ~]$ ls -l /dev/oracleasm/disks/
    total 0
    brw-rw---- 1 grid asmadmin 253, 11 Jan 29 15:05 CRSDISK1
    brw-rw---- 1 grid asmadmin 253,  9 Jan 29 15:05 CRSDISK2
    brw-rw---- 1 grid asmadmin 253, 10 Jan 29 15:05 CRSDISK3
    brw-rw---- 1 grid asmadmin 253, 14 Jan 29 15:05 DATADISK
    brw-rw---- 1 grid asmadmin 253, 13 Jan 29 15:05 FRADISK
    [grid@newdb1 ~]$ 
    
    [grid@newdb2 ~]$ ls -l /dev/oracleasm/disks/
    total 0
    brw-rw---- 1 grid asmadmin 253, 21 Jan 29 15:05 CRSDISK1
    brw-rw---- 1 grid asmadmin 253, 18 Jan 29 15:05 CRSDISK2
    brw-rw---- 1 grid asmadmin 253, 19 Jan 29 15:05 CRSDISK3
    brw-rw---- 1 grid asmadmin 253, 22 Jan 29 15:05 DATADISK
    brw-rw---- 1 grid asmadmin 253, 20 Jan 29 15:05 FRADISK
    

    2.深入分析

    在故障节点2上使用oradebug获取short_stack以及SSD 266,然后进行深入分析。

    首先获取上面289,33911会话的spid:

    select p.spid
    from v$process p, v$session s
    where p.addr = s.paddr
    and s.sid = &sid
    and s.serial# = &serial;

    SQL> select p.spid
      2    from v$process p, v$session s
      3   where p.addr = s.paddr
      4     and s.sid = &sid
      5     and s.serial# = &serial;
    Enter value for sid: 289
    old   4:    and s.sid = &sid
    new   4:    and s.sid = 289
    Enter value for serial: 33911
    old   5:    and s.serial# = &serial
    new   5:    and s.serial# = 33911
    
    SPID
    ------------------------
    8763
    
    SQL> !ps -ef|grep 8763
    oracle    8763  8762  0 16:31 ?        00:00:00 oraclenewdb2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
    grid      9639     1  0 16:35 ?        00:00:00 oracle+ASM2_user8763_newdb2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
    oracle   12022  9750  0 16:43 pts/2    00:00:00 /bin/bash -c ps -ef|grep 8763
    oracle   12024 12022  0 16:43 pts/2    00:00:00 grep 8763
    

    然后使用oradebug 跟踪8763进程,获取short_stack和SSD 266:

    SQL> oradebug setospid 8763
    Oracle pid: 68, Unix process pid: 8763, image: oracle@newdb2 (TNS V1-V3)
    SQL> oradebug unlimit
    Statement processed.
    SQL> oradebug short_stack
    ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-read()+14<-ntpfprd()+115<-nsbasic_brc()+376<-nsbrecv()+69<-nioqrc()+485<-ttcdrv()+1461<-nioqwa()+61<-upirtrc()+1385<-kpurcsc()+98<-kpuexec()+10807<-OCIStmtExecute()+39<-kfdDskTableCb4Db()+4492<-kfdDskTableCbInternal()+301<-kfdDskTableCb()+56<-qerfxFetch()+2210<-rwsfcd()+103<-qerhjFetch()+3187<-opifch2()+2995<-kpoal8()+2939<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<-__libc_start_main()+253<-_start()+36
    SQL> oradebug dump systemstate 266
    Statement processed.
    SQL> oradebug tracefile_name
    /home/app/oracle/diag/rdbms/newdg/newdb2/trace/newdb2_ora_8763.trc
    SQL> 
    

    从获取到的trc文件中,并没有找到有用的线索。至此问题陷入僵局。

    3.峰回路转

    回过头来想之前的解决过程,发现虽然查询v$asm_disk这类操作是在db层,但是本质确实asm实例的管理资源,去查asm实例是否有阻塞呢?
    --ASM INSTANCE
    set lines 350 trimspool on pages 300
    
    select sid, state, event, seconds_in_wait, blocking_session
    from   v$session
    where  blocking_session is not null
    or sid in (select blocking_session 
               from   v$session 
               where  blocking_session is not null)
    order by sid;
    

    在故障节点2的asm实例上查询:

    SQL> set lines 350 trimspool on pages 300
    SQL> 
    SQL> select sid, state, event, seconds_in_wait, blocking_session
      2  from   v$session
      3  where  blocking_session is not null
      4  or sid in (select blocking_session 
      5             from   v$session 
      6             where  blocking_session is not null)
      7  order by sid;
    
           SID STATE               EVENT                                                            SECONDS_IN_WAIT BLOCKING_SESSION
    ---------- ------------------- ---------------------------------------------------------------- --------------- ----------------
           128 WAITING             enq: DD - contention                                                      542311             1702
           380 WAITING             enq: DD - contention                                                     2159938             1702
           442 WAITING             enq: DD - contention                                                     2159098             1702
           506 WAITING             enq: DD - contention                                                     2158258             1702
           569 WAITING             enq: DD - contention                                                      713509             1702
           632 WAITING             enq: DD - contention                                                     2156578             1702
           695 WAITING             enq: DD - contention                                                     2155738             1702
           758 WAITING             enq: DD - contention                                                     1108924             1702
           821 WAITING             enq: DD - contention                                                     2142990             1702
           884 WAITING             enq: DD - contention                                                     2139901             1702
           947 WAITING             enq: DD - contention                                                     2136286             1702
          1010 WAITING             enq: DD - contention                                                     2078621             1702
          1073 WAITING             enq: DD - contention                                                     1845020             1702
          1135 WAITING             GPnP Get Item                                                            2161258
          1136 WAITING             enq: DD - contention                                                     1844671             1702
          1199 WAITING             enq: DD - contention                                                     1844394             1702
          1263 WAITING             enq: DD - contention                                                     1844065             1702
          1325 WAITING             enq: DD - contention                                                     1843549             1702
          1388 WAITING             enq: DD - contention                                                     1471932             1702
          1451 WAITING             enq: DD - contention                                                     1641694             1702
          1514 WAITING             enq: DD - contention                                                     1641426             1702
          1577 WAITING             enq: DD - contention                                                     1572189             1702
          1640 WAITING             enq: DD - contention                                                     1501765             1702
          1702 WAITING             rdbms ipc reply                                                                0             1135
          1703 WAITING             enq: DD - contention                                                      896552             1702
          1765 WAITING             enq: DD - contention                                                     2161018             1702
          1766 WAITING             enq: DD - contention                                                      893474             1702
          1829 WAITING             enq: DD - contention                                                       11782             1702
          1892 WAITING             enq: DD - contention                                                        5041             1702
          1955 WAITING             enq: DD - contention                                                        3357             1702
    
    30 rows selected.
    

    果然看到很多阻塞,EVENT为"enq: DD - contention"的都是被1702会话阻塞,而1702会话又是被1135会话阻塞,1135会话的event是"GPnP Get Item"。
    看两个会话的详细信息:

    select sid,
           SERIAL#,
           sql_id,
           event,
           MACHINE,
           PROGRAM,
           username,
           blocking_session from   v$session where sid = 1702; 
    
    
    select sid,
           SERIAL#,
           sql_id,
           event,
           MACHINE,
           PROGRAM,
           username,
           blocking_session from   v$session where sid = 1135; 
    

    结果如下:

    SQL> select sid,
      2         SERIAL#,
      3         sql_id,
      4         event,
      5         MACHINE,
      6         PROGRAM,
      7         username,
      8         blocking_session from   v$session where sid = 1702; 
    
           SID    SERIAL# SQL_ID        EVENT                                                            MACHINE                                                          PROGRAM                                       USERNAME                BLOCKING_SESSION
    ---------- ---------- ------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ------------------------------------------------ ------------------------------ ----------------
          1702      15127 b8a2pdbq3p2mj rdbms ipc reply                                                  newdb2                                                           oracle@newdb2 (TNS V1-V3)                     SYS                                         1135
    
    SQL>  select sid,
      2         SERIAL#,
      3         sql_id,
      4         event,
      5         MACHINE,
      6         PROGRAM,
      7         username,
      8         blocking_session from   v$session where sid = 1135; 
    
           SID    SERIAL# SQL_ID        EVENT                                                            MACHINE                                                          PROGRAM                                       USERNAME                BLOCKING_SESSION
    ---------- ---------- ------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ------------------------------------------------ ------------------------------ ----------------
          1135          1               GPnP Get Item                                                    newdb2                                                           oracle@newdb2 (RBAL)
    

    找到最终的原因:GPnP Get Item .

    4.解决问题

    有了上面的信息,再次查询MOS就可以匹配到 - Diskgroup Mount Hangs with RBAL Waiting on 'GPnP Get Item' and 'enq: DD - contention' (文档 ID 1375505.1) - "crsctl check cluster -all" command gives CRS-4404, CRS-4405 errors (文档 ID 1392934.1)

    参照MOS解决方案:

    1. Use the following command as root on Compute Node 3 to identify the current gpnpd.bin process:
      ps -ef | grep gpnpd.bin

    This will give you an output similar to:
    ps -ef | grep pnp
    oracle 31422 1 0 2011 ? 00:05:40 /u01/app/11.2.0.2/grid/bin/gpnpd.bin

    Please note the process ID number of the current gpnpd.bin process. In this example, = 31422

    1. Use the following command on Compute Node 3 to stop the current gpnpd.bin process:
      kill -HUP
      for example, kill -HUP 31422

    2. After a few moments, gpnpd.bin should be automatically restarted by GI, run the command from step # 1 to verify that a new gpnpd.bin process exists

    实际操作如下:

    [root@newdb2 dev]# ps -ef | grep gpnpd.bin
    grid      9835     1  0  2017 ?        00:23:19 /home/app/11.2.0/grid/bin/gpnpd.bin
    root     26300 28772  0 17:43 pts/0    00:00:00 grep gpnpd.bin
    [root@newdb2 dev]# kill -HUP  9835
    You have mail in /var/spool/mail/root
    [root@newdb2 dev]# 
    [root@newdb2 dev]# ps -ef | grep gpnpd.bin
    grid     26740     1  1 17:45 ?        00:00:00 /home/app/11.2.0/grid/bin/gpnpd.bin
    root     26762 28772  0 17:45 pts/0    00:00:00 grep gpnpd.bin
    [root@newdb2 dev]# su - grid
    [grid@newdb2 ~]$ sqlplus / as sysasm
    
    SQL*Plus: Release 11.2.0.3.0 Production on Mon Jan 29 17:46:02 2018
    
    Copyright (c) 1982, 2011, Oracle.  All rights reserved.
    
    
    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
    With the Real Application Clusters and Automatic Storage Management options
    
    SQL> set lines 350 trimspool on pages 300
    SQL> 
    SQL> select sid, state, event, seconds_in_wait, blocking_session
      2  from   v$session
      3  where  blocking_session is not null
      4  or sid in (select blocking_session 
      5             from   v$session 
      6             where  blocking_session is not null)
      7  order by sid;
    
    no rows selected
    
    SQL> select path from v$asm_disk;
    
    PATH
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    /dev/oracleasm/disks/FRADISK
    /dev/oracleasm/disks/DATADISK
    /dev/oracleasm/disks/CRSDISK3
    /dev/oracleasm/disks/CRSDISK2
    /dev/oracleasm/disks/CRSDISK1
    
    SQL>                    
    

    至此,故障节点的ASM实例的阻塞消失,再次验证查询v$asm_disk已经恢复正常,故障算是完美解决。

    后记:

    • 1.实际上起初的解决思路出现偏差,ASM的问题就应该直接从ASM实例找原因,而之前都是在DB层分析,自然没有得到好的结果。这一点很值得自己反思。
    • 2.最开始接到这个case是建议找停机时间直接重启节点2,这个方案简单粗暴且不需要深入分析问题,从最终解决方案看到,这种方法也是可以解决问题的。但恰恰由于客户坚持不能重启,反倒因此而学到了更多的技能点。
  • 相关阅读:
    Ansible概述
    iptables端口转发
    iptables配置实例
    iptables常用操作
    iptables常用命令
    每日总结3.15
    每日总结3.12
    每日总结3.11
    每日总结3.10
    每日总结3.9
  • 原文地址:https://www.cnblogs.com/jyzhao/p/8379720.html
Copyright © 2011-2022 走看看