zoukankan      html  css  js  c++  java
  • 经纪xx系统节点VIP案例介绍和深入分析异常

    系统环境    硬件平台 &  操作

    IBM 570

    操作系统版本号  AIX 5.3

    物理内存  32G

    Oracle 产品及版本号  10.2.0.5 RAC  

    业务类型  OLTP

     

     

    背景概述

    交易系统在xxxx 日。节点二VIP异常下线导致节点二数据库服务失

    效。接到请求后。第一时间进行连线处理。故障发生在凌晨 3点,并且

    AIXerrpt)、Oracle DBalert.log )、CRS crsd.log ocssd.logvip.log

    coredump )等均没有留下太多有效信息,情况很复杂。

     

    问题具体诊断及分析

    一、检查errpt日志:

    节点二VIP异常下线时,节点二上无报错产生。

     

    二、检查CRS日志:

    节点一: 

    2013-04-29 03:42:33.180: [  CRSRES][11376]32startRunnable: setting CLIvalues

     

    —  说明:节点一上仅有一行信息反映出故障期间节点一上CRS以前运行过命令。

    节点二: 

    2013-04-29 03:41:12.308: [  CRSAPP][11263]32CheckResource error forora.xxxxdb02.vip error code =1

    2013-04-29 03:41:12.335: [  CRSRES][11263]32In stateChanged,ora.xxxxdb02.vip target is ONLINE

    2013-04-29 03:41:12.335: [  CRSRES][11263]32ora.xxxxdb02.vip on xxxxdb02went OFFLINE unexpectedly

    2013-04-29 03:41:12.335: [  CRSRES][11263]32StopResource: setting CLIvalues

    2013-04-29 03:41:12.340: [  CRSRES][11263]32Attempting to stop`ora.xxxxdb02.vip` on member

    `xxxxdb02`

    2013-04-29 03:41:12.893: [  CRSRES][11269]32In stateChanged,ora.xxxxdb.xxxxdb1.xxxxdb2.srv target

    is ONLINE

    2013-04-29 03:41:12.894: [  CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srvon xxxxdb02 went OFFLINE

    unexpectedly

    2013-04-29 03:41:12.894: [  CRSRES][11269]32StopResource: setting CLIvalues

    2013-04-29 03:41:12.899: [  CRSRES][11269]32Attempting to stop`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on

    member `xxxxdb02`

    2013-04-29 03:41:12.958: [  CRSRES][11263]32Stop of `ora.xxxxdb02.vip` onmember `xxxxdb02`

    succeeded.

    2013-04-29 03:41:12.971: [  CRSRES][11263]32ora.xxxxdb02.vipRESTART_COUNT=0 RESTART_ATTEMPTS=0

    2013-04-29 03:41:12.976: [  CRSRES][11263]32ora.xxxxdb02.vip failed onxxxxdb02 relocating.

    2013-04-29 03:41:13.025: [  CRSRES][11263]32StopResource: setting CLIvalues

    2013-04-29 03:41:13.029: [  CRSRES][11263]32Attempting to stop

    `ora.xxxxdb02.LISTENER_XXXXDB02.lsnr` onmember `xxxxdb02`

    2013-04-29 03:41:13.146: [  CRSRES][11269]32Stop of`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on member

    `xxxxdb02` succeeded.

    2013-04-29 03:41:13.146: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srv RESTART_COUNT=0

    RESTART_ATTEMPTS=1

    2013-04-29 03:41:13.159: [  CRSRES][11269]32Restartingora.xxxxdb.xxxxdb1.xxxxdb2.srv on xxxxdb02

    2013-04-29 03:41:13.164: [  CRSRES][11269]32startRunnable: setting CLIvalues

    2013-04-29 03:41:13.164: [  CRSRES][11269]32Attempting to start  `ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on

    member `xxxxdb02`

    2013-04-29 03:41:45.618: [  CRSAPP][11269]32StartResource error forora.xxxxdb.xxxxdb1.xxxxdb2.srv

    error code = 1

    2013-04-29 03:41:45.799: [  CRSRES][11269]32Start of`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on member`xxxxdb02` failed.

    2013-04-29 03:41:45.820: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srv failed on xxxxdb02 relocating.

    2013-04-29 03:41:45.885: [  CRSRES][11269]32Cannot relocate ora.xxxxdb.xxxxdb1.xxxxdb2.srvStopping dependents

    2013-04-29 03:41:45.897: [  CRSRES][11269]32StopResource: setting CLIvalues

    2013-04-29 03:42:29.483: [  CRSRES][11263]32Stop of `ora.xxxxdb02.LISTENER_XXXXDB02.lsnr` on member`xxxxdb02` succeeded.

    2013-04-29 03:42:29.496: [  CRSRES][11263]32Attempting to start`ora.xxxxdb02.vip` on member`xxdb01np5`

    2013-04-29 03:42:32.036: [  CRSRES][11263]32Start of `ora.xxxxdb02.vip`on member `xxdb01np5`succeeded.

     

    —  说明:节点二是故障节点,上面信息反映出,因为VIP检測异常导致节点二上VIP被强

    OFFLINE。同一时候CRS将基于VIP 资源的监听和数据库服务也进行了OFFLINE操作。

     

    三、分析CRS状态:

    因为故障期间客户非常好的保留了现场环境。因此我们得以现场分析到CRS

    状态。

     

    Name           Type           Target    State    Host        

    ------------------------------------------------------------

    ora.xxxxdb.db  application   ONLINE    ONLINE    xxdb01np5   

    ora....sdb1.cs application    ONLINE   OFFLINE               

    ora....db1.srv application    ONLINE   ONLINE    xxdb01np5   

    ora....db2.srv application    ONLINE   ONLINE    xxxxdb02   

    ora....b1.inst application    ONLINE   ONLINE    xxdb01np5   

    ora....b2.inst application    ONLINE   ONLINE    xxxxdb02   

    ora....P5.lsnr application    ONLINE   ONLINE    xxdb01np5   

    ora....np5.gsd application    ONLINE   ONLINE    xxdb01np5   

    ora....np5.ons application    ONLINE   ONLINE    xxdb01np5   

    ora....np5.vip application    ONLINE   ONLINE    xxdb01np5   

    ora....P5.lsnr application    ONLINE   OFFLINE               

    ora....np5.gsd application    ONLINE   ONLINE    xxxxdb02   

    ora....np5.ons application    ONLINE   ONLINE    xxxxdb02   

    ora....np5.vip application    ONLINE   ONLINE    xxdb01np5   

     

    —  说明:故障期间。节点二上的数据库(实例)资源仍然在执行,Nodeapp资源正常,

    唯一被OFFLINE的是一个Servicelistener 资源。

    可是我们注意到节点二上VIP 资源

    已经被节点一接管过去。这说明。节点二的VIPxxx.xxx.xxx.4 )在排查问题的时候是

    没问题的。

     

    四、分析数据库告警日志

    Sun Apr 29 03:41:13 BEIST 2012

    ALTER SYSTEM SET service_names='xxxxdb'SCOPE=MEMORY SID='xxxxdb2';

    Sun Apr 29 03:41:55 BEIST 2012

    ALTER SYSTEM SET service_names=''SCOPE=MEMORY SID='xxxxdb2';

    Sun Apr 29 03:41:55 BEIST 2012

    Immediate Kill Session#: 1659, Serial#: 2

    Immediate Kill Session: sess:70000038f773bb8  OS pid: 1180972

    Sun Apr 29 03:41:55 BEIST 2012

    Process OS id : 1180972 alive after kill

    Errors in file 

    Immediate Kill Session#: 1660, Serial#: 2

    Immediate Kill Session: sess:7000003847623c0  OS pid: 1179690

    Sun Apr 29 03:41:56 BEIST 2012

    Process OS id : 1179690 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1661, Serial#: 2

    Immediate Kill Session: sess:7000003837620c0  OS pid: 1300264

    Sun Apr 29 03:41:56 BEIST 2012

    Process OS id : 1300264 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1662, Serial#: 2

    Immediate Kill Session: sess:70000038275bc80  OS pid: 1155624

    Sun Apr 29 03:41:57 BEIST 2012

    Process OS id : 1155624 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1663, Serial#: 2

    Immediate Kill Session: sess:700000387762ed0  OS pid: 1000484

    Sun Apr 29 03:41:57 BEIST 2012

    Process OS id : 1000484 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1664, Serial#: 2

    Immediate Kill Session: sess:7000003867923d8  OS pid: 1175586

    Sun Apr 29 03:41:58 BEIST 2012

    Process OS id : 1175586 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1665, Serial#: 2

    Immediate Kill Session: sess:7000003857a9d30  OS pid: 1296160

    Sun Apr 29 03:41:58 BEIST 2012

    Process OS id : 1296160 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1666, Serial#: 3

    Immediate Kill Session: sess:70000038f775130  OS pid: 1176862

    Sun Apr 29 03:41:59 BEIST 2012

    Process OS id : 1176862 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1667, Serial#: 2

    Immediate Kill Session: sess:700000384763938  OS pid: 1151516

    Sun Apr 29 03:42:00 BEIST 2012

    Process OS id : 1151516 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1668, Serial#: 2

    Immediate Kill Session: sess:700000383763638  OS pid: 693786

    Sun Apr 29 03:42:01 BEIST 2012

    Process OS id : 693786 alive after kill

    Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

    Immediate Kill Session#: 1669, Serial#: 2

    Immediate Kill Session: sess:70000038275d1f8  OS pid: 1292056 6

    —  说明:故障期间,节点二上数据库的ServiceOFFLINE,这一点正常。

    但随后CRS

    将连接到节点二上的全部client连接都强行Kill掉了。该问题和Oracle Bug 6955040

    关。

     

    五、分析RACGVIP

    因为问题原因定位到 VIP资源异常上,而VIP切换到节点一上是正常的,

    因此极有可能是因为VIP检查的时候出现异常导致CRS VIP OFFLINE。因为

    Oracle CRS在进行VIP检測的时候是通过RACGVIP这个 SHELL脚本实现功能

    的,因此我们分析和測试了RACGVIP脚本。

    我们注意到当中一段脚本,例如以下: 

     

      #Check the status of the interface thro' pinging gateway 

      if[ -n "$DEFAULTGW" ]

     then

       _RET=1

        #get base IP address of the interface

       tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`

        #get RX packets numbers

       _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \$5;exit}}"`

       x=$CHECK_TIMES

       while [ $x -gt 0 ]

       do

         if [ -n "$tmpIP" ]

         then

           $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1

         else

           $PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1

         fi

         _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \$5;exit}}"`

         if [ "$_O1" != "$_O2" ]

         then

           # RX packets numbers changed

           _RET=0

           break

         fi

         $SLEEP 1

         x=`$EXPR $x - 1`

       done

       if [ $_RET -ne 0 ]

       then

         logx "checkIf: RX packets checked if=$_IF failed"

       else

         logx "checkIf: RX packets checked if=$_IF OK"

       fi

     else

      

       logx "checkIf: Default gateway is not defined(host=$HOSTNAME)"

       if [ $FAIL_WHEN_DEFAULTGW_NO_FOUND -eq 1 ]

       then

         _RET=1

       else

         _RET=0

       fi

     fi 

     

    —  说明:这段脚本说明。CRS在做VIP check的时候机制例如以下,定期向缺省网关(Default

    Gateway)发包,假设发包前后检測到网卡上有包流量(output值前后不一样)则说

    明检測网络正常。VIP check通过。发包的操作通过AIX 命令运行,还原全部变量后的

    命令例如以下

    —  ping –S xxx.xxx.xxx.2 –c 1–w 1 xxx.xxx.xxx.254

    —  參数含义例如以下:

    -S hostname/IP addr

    Uses the IP address as the source addressin outgoing ping packets.

     

    -c Count

    Specifies the number of echo requests, asindicated by the Count

    variable, to be sent (and received).

     

    -w timeout

    This option works only with the -c option.It causes ping to wait

    for a maximum of 'timeout' seconds for areply (after sending the

    last packet).

     

    —  PING_TIMEOUT參数控制了-c -w的值,RACGVIP 脚本中设置例如以下:

    # timeout of ping in number of loops

    PING_TIMEOUT=" -c 1 -w 1"

     

    —  我们检查了10.2.0.3版本号下的设置,发如今10.2.0.3版本号下,PING_TIMEOUT參数设置

    为例如以下值:

    # timeout of ping in number of loops

    PING_TIMEOUT=" -c 1 -w 3"

     

     

    —  当前PING_TIMEOUT的设置意味着假设在pingdefault gateway 1秒内没有成功向

    外发送出一个包,那么CRS会经过一次SLEEP后再次尝试ping操作。假设还是没有成

    功。CRS即宣布网络异常,VIP OFFLINE

     

     

    结论

    Oracle 10.2.0.5RACGVIP脚本上的细微改变说明,Oracle觉得10.2.0.5版本号

    下对网络性能及稳定性更高了。

    可是将PING_TIMEOUT 1秒更改为3秒,会大

    大添加CRS 的敏感度,使得细微的网络异常即会造成CRS 中的 VIP检測失败。

     

    依据上述分析,我们觉得节点二故障原因例如以下:

    1 升级到10.2.0.5后,CRS 对网络延时更加敏感 

    2 429 日凌晨3点故障期间,网络出现了及其短暂的延时导致VIP检測

    失败,CRS VIP以及依赖于VIP的资源OFFLINE

    3 CRS OFFLINE数据库服务的时候命中Bug6955040造成client连接被

    所有杀光。 

     

    故障重现

     

    我方在生产环境模拟了这一段异常情况:

      步骤流程

    打开CRS VIPdebug

    2

    节点二上找出VIPxxx.xxx.xxx.4 )使用的网络设备(en10

    以及相应的ServiceIPxxx.xxx.xxx.2

    交换机上找出相应的port

    4 Block该port一小段时间

    观察CRS 和数据库日志

      

     

    我方我们观察到当中 CRS 日志和数据库日志信息和故障期间全然一致,这

    充分证明了前面的推断。 

    同一时候。VIPdebug 信息显演示样例如以下内容,我们能够从中清晰的看到,因为

    网络问题导致发包失败后,CRS 马上将VIP置于 OFFLINE状态: 

    2013-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:18 BEIST 2012 [ 1139710 ] Checkinginterface existance

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Calling getifbyip

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip:  started for xxx.xxx.xxx.4 

    2013-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:18 BEIST 2012 [ 1139710 ] getifbyip:checking if failover is happening (en10)

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip: failover is not happening

    (en10)

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip:  returning IP en10

    2013-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:18 BEIST 2012 [ 1139710 ] Completedgetifbyip en10

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Completed with initial interface test

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Broadcast = 130.0.255.255

    2013-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:18 BEIST 2012 [ 1139710 ] checkIf:start for if=en10

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] IsIfAlive: start for if=en10

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] defaultgw:  started  

    2013-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:18 BEIST 2012 [ 1139710 ]defaultgw:  completed withxxx.xxx.xxx.254

    Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] About to execute com mand:/usr/sbin/ping

    -S xxx.xxx.xxx.2  -c 1 -w 1 xxx.xxx.xxx.254  

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:20 BEIST 2012 [ 1139710 ] About toexecute command: /usr/sbin/ping -S

    xxx.xxx.xxx.2  -c 1 -w 1 xxx.xxx.xxx.254

    Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] IsIfAlive: RX packets checked if=en10 failed

     

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:22 BEIST 2012 [ 1139710 ] Interfaceen10 checked failed (host=xxxxdb02)

    Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] IsIfAlive: end for if=en10

    Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] checkIf: end for if=en10

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:22 BEIST 2012 [ 1139710 ] PerformingCRS_STAT testing

    Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] Completed CRS_STAT testing

    Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] Completed second gateway test  

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:22 BEIST 2012 [ 1139710 ] Interfacetests  Invalid parameters, or failed to bring upVIP (host=xxxxdb02)

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcexecut: envORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:   

    clsrcexecut: cmd =/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=5 54

    /oracle/product/10.2.0/crs_1/bin/racgvipcheck xxxxdb02

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcexecut: rc = 1, time = 4.304s

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcaction: restyp=0 act_typ=2 stat=1

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcaction:  init 0.000s

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:

    clsrcaction:  action failed, 4.438s

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcaction:  post 0.000s

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcaction: all 4.438s

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:end for

    resource = ora.xxxxdb02.vip, action =check, status = 1, time = 4.442s

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:clsrccln:

    exiting ora.xxxxdb02.vip refcount=1  

    2013-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

    clsrcprsrgter:gctx->prsrcfgref_clsrcgctx = 0

    2013-05-01 20:36:22.664: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcprsrgini:gctx->prsrcfgref_clsrcgctx = 0

    2013-05-01 20:36:22.664: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcini_ext: starting ora.xxxxdb02.viprefcount=1 global

    2013-05-01 20:36:22.665: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:begin for

    resource = ora.xxxxdb02.vip, action = stop

    2013-05-01 20:36:22.902: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrccssgetnodename: all 0.238s  

    2013-05-01 20:36:22.902: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcnodeapps: calling FAILSRVSA

    2013-05-01 20:36:22.907: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:

    clsrcrundetach: cmd =/oracle/product/10.2.0/crs_1/bin/racgmain ora.xxxxdb02.vip

    rundetach 1 failsrvsa xxxxdb02, rc = 0, time = 0.005s

    2013-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcnodeapps: Posting PNWDOWN_EVENT

    2013-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrccssgetnodename: all 0.000s

    2013-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrccssgetnodename: all 0.000s

    2013-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcssgetrhost: using cached local hostname 

    2013-05-01 20:36:22.909: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrccssgetlhost: all 0.001s

    2013-05-01 20:36:22.935: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:

    clsrcprsrgini:gctx->prsrcfgref_clsrcgctx = 0

    2013-05-01 20:36:22.935: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:

    clsrcini_ext: starting ora.xxxxdb02.viprefcount=1 global

    2013-05-01 20:36:22.935: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:begin for

    resource = ora.xxxxdb02.vip, action =rundetach

    2013-05-01 20:36:22.964: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_COUNT=0  

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = STATE=ONLINE on xxxxdb02

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = 

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = NAME=ora.xxxxdb.xxxxdb1.xxxxdb2.srv

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = FAILURE_THRESHOLD=0  

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TARGET=ONLINE

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = STATE=ONLINE on xxxxdb02  

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf =  

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_COUNT=0

    2013-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = FAILURE_THRESHOLD=0

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TARGET=ONLINE  

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = STATE=ONLINE on xxxxdb02  

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_ATTEMPTS=5

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = NAME=ora.xxxxdb02.ons

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TYPE=application

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_ATTEMPTS=3

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_COUNT=0

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = FAILURE_THRESHOLD=0  

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TARGET=ONLINE  

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = STATE=ONLINE on xxxxdb02 15

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = 

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = NAME=ora.xxxxdb02.vip

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TYPE=application

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_ATTEMPTS=0  

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_COUNT=0

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = FAILURE_THRESHOLD=0

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TARGET=ONLINE

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = STATE=ONLINE on xxxxdb02

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = 

    2013-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcqryapi:

    resname = NULL, host = xxxxdb02, time =0.007s

    2013-05-01 20:36:23.075: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcpostevt: EvmEventPost 1 0.000s

    2013-05-01 20:36:23.075: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcposthaevt: forward to EVM  0.166s

    2013-05-01 20:36:23.080: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

     clsrcnodeapp: prsr num_env = 0

    2013-05-01 20:36:23.080: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcnodeapp: settingORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1

    2013-05-01 20:36:23.111: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:

    clsrcpostevt: EvmEventPost 1 0.008s

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]: clsrcstatcb:

    buf = NAME=ora.xxxxdb02.vip

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TYPE=application  

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_ATTEMPTS=0

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = RESTART_COUNT=0

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = FAILURE_THRESHOLD=0

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = TARGET=ONLINE

    2013-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = STATE=ONLINE on xxxxdb02

    2013-05-01 20:36:23.115: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

    buf = 

    2013-05-01 20:36:23.115: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcqryapi:

    resname = ora.xxxxdb02.vip, host = NULL,time = 0.004s

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:23 BEIST 2012 [ 921812 ] Checkinginterface existance

    Tue May 1 20:36:23 BEIST 2012 [ 921812 ] Calling getifbyip

    Tue May 1 20:36:23 BEIST 2012 [ 921812 ] getifbyip:  started for xxx.xxx.xxx.4

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:23 BEIST 2012 [ 921812 ] getifbyip:checking if failover is happening ()

    Tue May 1 20:36:23 BEIST 2012 [ 921812 ] getifbyip: failover i s not happening()

    Tue May 1 20:36:23 BEIST 2012 [ 921812 ] Completed getifbyip

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May  1

    20:36:23 BEIST 2012 [ 921812 ] Completedwith initial interface test

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcexecut: envORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:

    clsrcexecut: cmd =/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=5 54

    /oracle/product/10.2.0/crs_1/bin/racgvipstop xxxxdb02

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcexecut: rc = 0, time = 0.204s

    2013-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcposthaevt: reason = failure

    2013-05-01 20:36:23.285: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:clsrccln:

    exiting ora.xxxxdb02.vip refcount=1

    2013-05-01 20:36:23.286: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

    clsrcprsrgter:gctx->prsrcfgref_clsrcgctx = 0

     

    解决方式

    依据分析结果,我们觉得,10.2.0.5CRS 对网络过于敏感,出现网络延时

    的时候会对数据库集群产生较大影响。针对眼下的情况。我们建议例如以下: 

     

    一、  详查网络问题,极偶然的丢包或延时在网络层面也属于常见情况。

    有可能是线缆问题。也可能是交换机、server网卡、网络配置等

    问题,须要具体检查网络情况。

    二、  改动过于敏感的CRS 配置,将发包超时设置为3秒( 10.2.0.5之前的

    值):

    改动$ORA_CRS_HOME/bin/racgvip 脚本例如以下部分

    # timeout of ping in number of loops

    PING_TIMEOUT=" -c 1 -w 1"

     

    改动成例如以下内容:

    # timeout of ping in number of loops

    PING_TIMEOUT=" -c 1 -w 3"

     

    三、  因为Bug 6955040VIP异常后被触发。眼下优先解决VIP异常问

    题。该Bug 能够忽略。 

     

    版权声明:本文博主原创文章,博客,未经同意不得转载。

  • 相关阅读:
    js深入研究之自定义混合Mixin函数
    js深入研究之克隆,属性,数组,对象,函数
    PHP交易详情有感
    PHP统计排行,分页
    php获取当前月月初至月末的时间戳,上个月月初至月末的时间戳
    php编程规范
    js深入研究之无法理解的js类代码,extend扩展
    js深入研究之神奇的匿名函数类生成方式
    【编程之美】双线程高效下载
    转: 连续数打乱判断出少了哪些数?
  • 原文地址:https://www.cnblogs.com/yxwkf/p/4802741.html
Copyright © 2011-2022 走看看