zoukankan      html  css  js  c++  java
  • What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)

    In this Document

      Purpose
      Scope
      Details
      1. Clusterware layer
      2. Real Application Cluster (database) layer
      Known Issues
      References

    APPLIES TO:

    Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
    Information in this document applies to any platform.

    PURPOSE

    This note is to explain what is split brain in an Oracle Real Application cluster and what errors/consequences are associated with it.

    SCOPE

    For DBA and Support engineer.

    DETAILS

    In generic term, split-brain indicates data inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and unifying their data to each other.

    There are two components in Oracle Real Application Cluster implementation could experience split brain.

    1. Clusterware layer

    Cluster nodes maintain their heartbeat via private network and voting disk. When there is a private network disruption, cluster nodes can not communicate to each other via private network for the time period of misscount setting, split brain will happen. In such case, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:

    a. The group with more cluster nodes survive
    b. The group with lower node member in case of same number of node(s) available in each group
    c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

    Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:

    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: 
    ###################################
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
    ###################################

    Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.

    Solution: Please engage network administrator to check private network layer to eliminate any network fault.

    2. Real Application Cluster (database) layer

    To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

    Common messages in instance alert log are similar to:

    alert log of instance 1:
    ---------
    Mon Dec 07 19:43:05 2011
    IPC Send timeout detected.Sender: ospid 26318
    Receiver: inst 2 binc 554466600 ospid 29940
    IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
    Mon Dec 07 19:43:07 2011
    Communications reconfiguration: instance_number 2
    Mon Dec 07 19:43:07 2011
    Trace dumping is performing id=[cdmp_20091207194307]
    Waiting for clusterware split-brain resolution
    Mon Dec 07 19:53:07 2011
    Evicting instance 2 from cluster
    Waiting for instances to leave: 

    ...

    alert log of instance 2:
    ---------
    Mon Dec 07 19:42:18 2011
    IPC Send timeout detected. Receiver ospid 29940
    Mon Dec 07 19:42:18 2011
    Errors in file 
    /u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
    Trace dumping is performing id=[cdmp_20091207194307]
    Mon Dec 07 19:42:20 2011
    Waiting for clusterware split-brain resolution
    Mon Dec 07 19:44:45 2011
    ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
    Mon Dec 07 19:44:51 2011
    ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
    Mon Dec 07 19:45:38 2011
    ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
    Mon Dec 07 19:52:27 2011
    Errors in file 
    /u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc  
    (incident=90153):
    ORA-29740: evicted by member 0, group incarnation 10
    Incident details in: 
    /u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

    In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

    a. Network problem
    b. Process hang
    c. Bug etc

    Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.

    In case of instance eviction, alert log and all background traces need to be checked to determine the root cause.

    Known Issues

    1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8
        Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1
        Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows

    2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load
        Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1
        Fixed in: 11.2.0.1

    3. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8
        Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1

    4. Bug 7587008 - Hung RAC instance not evicted from cluster Document  7587008.8
        Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release

    5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8
        Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows

    6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2
        BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
        BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
        Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1
        Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3

  • 相关阅读:
    从0系统学Android-2.6Activity间数据传递
    观察者模式详解
    从0系统学Android-2.5更多隐式Intent用法
    从 View 的四个构造方法说起
    ListView详细介绍与使用
    推荐一个程序员系统学习网址
    从 http协议角度解析okhttp
    从0系统学Android-2.4隐式Intent
    菜单布局记录篇
    轮播图记录篇
  • 原文地址:https://www.cnblogs.com/future2012lg/p/4317970.html
Copyright © 2011-2022 走看看