zoukankan      html  css  js  c++  java
  • CRS-2674: Start of 'ora.cssd' on 'rac2' failed 引发的rac集群服务起不来问题

    问题背景:客户反馈Oracle rac集群节点宕机

    1、首先查看宕机原因,归档日志满导致服务重启,查看归档日志路径是USE_DB_RECOVERY_FILE_DEST (默认路径),

    安装的时候没有做调整,应该调整单独的归档目录,首先清理归档日志然后修改归档路径

    2、节点一正常启动,节点二起不来    没有cluster服务
      检查集群服务
    在rac2节点上检查集群服务的状态报错

    1 [grid@rac2 ~]# /u01/app/11.2.0/grid/bin/crs_stat -t
    2 CRS-0184: Cannot communicate with the CRS daemon.


    根据上面报错,可以判断出crs是有问题。
    尝试启动也报错:注意需要使用root

    尝试启动crs服务

    复制代码
     1 root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
     2 CRS-4640: Oracle High Availability Services is already active
     3 CRS-4000: Command Start failed, or completed with errors.
     4 正常情况是:
     5 [root@rac2 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
     6 CRS-4123: Oracle High Availability Services has been started.
     7 检查crs服务,发现有问题:
     8 [grid@rac2 ~]$ crsctl check crs
     9 CRS-4638: Oracle High Availability Services is online
    10 CRS-4535: Cannot communicate with Cluster Ready Services
    11 CRS-4530: Communications failure contacting Cluster Synchronization Services demon
    12 CRS-4534: Cannot communicate with Event Manager‘
    复制代码

    然后节点rac2查看ip情况,发现vip和scan ip都已经不在,可以判断出节点rac已经脱离了集群。
    查看节点 ifconfig -a


    3、尝试重新注册节点2加入集群

    复制代码
     1 [root@rac2 ~]# sh /u01/app/11.2.0/grid/root.sh
     2 Performing root user operation for Oracle 11g
     3 
     4 The following environment variables are set as:
     5     ORACLE_OWNER= grid
     6     ORACLE_HOME=  /u01/app/11.2.0/grid
     7 Enter the full pathname of the local bin directory: [/usr/local/bin]:
     8 The contents of "dbhome" have not changed. No need to overwrite.
     9 The contents of "oraenv" have not changed. No need to overwrite.
    10 The contents of "coraenv" have not changed. No need to overwrite.
    11 Entries will be added to the /etc/oratab file as needed by
    12 Database Configuration Assistant when a database is created
    13 Finished running generic part of root script.
    14 Now product-specific root actions will be performed.
    15 Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
    16 User ignored Prerequisites during installation
    17 Installing Trace File Analyzer
    18 Configure Oracle Grid Infrastructure for a Cluster ... succeeded
    复制代码

    4、还是有问题,清理节点2的配置信息,然后重新运行root.sh

    复制代码
     1 [root@rac2 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
     2 [root@rac2 ~]# /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
     3 [root@rac2 bin]# /u01/app/11.2.0/grid/root.sh
     4 
     5 报错:
     6 [root@rac2 install]#  /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
     7 Can't locate Env.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . /u01/app/11.2.0/grid/crs/install) at crsconfig_lib.pm line 703.
     8 BEGIN failed--compilation aborted at crsconfig_lib.pm line 703.
     9 Compilation failed in require at /u01/app/11.2.0/grid/crs/install/roothas.pl line 166.
    10 BEGIN failed--compilation aborted at /u01/app/11.2.0/grid/crs/install/roothas.pl line 166.
    11 缺少依赖包  安装命令 yum install perl-Env
    12 
    13 已安装:
    14   perl-Env.noarch 0:1.04-2.el7
    复制代码

    5、清理节点2配置信息

    复制代码
     1 [root@rac2 install]#  /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
     2 Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
     3 CRS-4535: Cannot communicate with Cluster Ready Services
     4 CRS-4000: Command Stop failed, or completed with errors.
     5 CRS-4535: Cannot communicate with Cluster Ready Services
     6 CRS-4000: Command Delete failed, or completed with errors.
     7 CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac2'
     8 CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac2'
     9 CRS-2677: Stop of 'ora.mdnsd' on 'rac2' succeeded
    10 CRS-2673: Attempting to stop 'ora.crf' on 'rac2'
    11 CRS-2677: Stop of 'ora.crf' on 'rac2' succeeded
    12 CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
    13 CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
    14 CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac2'
    15 CRS-2677: Stop of 'ora.gpnpd' on 'rac2' succeeded
    16 CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac2' has completed
    17 CRS-4133: Oracle High Availability Services has been stopped.
    18 Successfully deconfigured Oracle Restart stack
    复制代码


    6、重新注册到集群中

    复制代码
     1 [root@rac2 install]# /u01/app/11.2.0/grid/root.sh
     2 Performing root user operation for Oracle 11g
     3 The following environment variables are set as:
     4     ORACLE_OWNER= grid
     5     ORACLE_HOME=  /u01/app/11.2.0/grid
     6 Enter the full pathname of the local bin directory: [/usr/local/bin]:
     7 The contents of "dbhome" have not changed. No need to overwrite.
     8 The contents of "oraenv" have not changed. No need to overwrite.
     9 The contents of "coraenv" have not changed. No need to overwrite.
    10 
    11 Entries will be added to the /etc/oratab file as needed by
    12 Database Configuration Assistant when a database is created
    13 Finished running generic part of root script.
    14 Now product-specific root actions will be performed.
    15 Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
    16 User ignored Prerequisites during installation
    17 Installing Trace File Analyzer
    18 OLR initialization - successful
    19 Adding Clusterware entries to inittab
    20 CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node rac1, number 1, and is terminating
    21 An active cluster was found during exclusive startup, restarting to join the cluster
    22 Start of resource "ora.cssd" failed
    23 CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac2'
    24 CRS-2672: Attempting to start 'ora.gipcd' on 'rac2'
    25 CRS-2676: Start of 'ora.cssdmonitor' on 'rac2' succeeded
    26 CRS-2676: Start of 'ora.gipcd' on 'rac2' succeeded
    27 CRS-2672: Attempting to start 'ora.cssd' on 'rac2'
    28 CRS-2672: Attempting to start 'ora.diskmon' on 'rac2'
    29 CRS-2676: Start of 'ora.diskmon' on 'rac2' succeeded
    30 CRS-2674: Start of 'ora.cssd' on 'rac2' failed
    31 CRS-2679: Attempting to clean 'ora.cssd' on 'rac2'
    32 CRS-2681: Clean of 'ora.cssd' on 'rac2' succeeded
    33 CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
    34 CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
    35 CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'rac2'
    36 CRS-2677: Stop of 'ora.cssdmonitor' on 'rac2' succeeded
    37 CRS-5804: Communication error with agent process
    38 CRS-4000: Command Start failed, or completed with errors.
    39 Failed to start Oracle Grid Infrastructure stack
    40 Failed to start Cluster Synchorinisation Service in clustered mode at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 1278.
    41 /u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
    42 依然失败
    复制代码


    7、CSSD没有在第二个节点上启动。$grid_home/log/rac2子目录中查找cssd日志文件。查看日志信息。

    复制代码
    1 /u01/app/11.2.0/grid/log/rac2/cssd
    2 2019-10-12 15:41:19.013: [    CSSD][3199571712]clssgmDiscEndpcl: gipcDestroy 0x8a28
    3 2019-10-12 15:41:19.064: [    CSSD][3181754112]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
    4 2019-10-12 15:41:19.844: [    CSSD][3186484992]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 464729747, wrtcnt, 8055111, LATS 336904, lastSeqNo 8055110, uniqueness 1569234927, timestamp 1570866136/3845241248
    5 2019-10-12 15:41:20.064: [    CSSD][3181754112]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
    6 2019-10-12 15:41:20.845: [    CSSD][3186484992]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 464729747, wrtcnt, 8055112, LATS 337904, lastSeqNo 8055111, uniqueness 1569234927, timestamp 1570866137/3845242248
    复制代码

    8、查看节点2的心跳

    1 [grid@rac2 /]$ ping 20.20.20.201  --节点1的priv
    2 PING 20.20.20.201 (20.20.20.201) 56(84) bytes of data.
    3 From 20.20.20.202 icmp_seq=1 Destination Host Unreachable
    4 From 20.20.20.202 icmp_seq=2 Destination Host Unreachable
    5 From 20.20.20.202 icmp_seq=3 Destination Host Unreachable
    6 From 20.20.20.202 icmp_seq=4 Destination Host Unreachable


     心跳不通、。。。。。心累,据客户说节点1的心跳出过好几次问题了,估计网卡有问题。

    征得客户同意,先尝试节点1的网卡重启下,然后把服务重启下,节点1/2服务都正常起来了,
    后续建议客户更换网卡消除隐患。

     9、绕了一大圈是因为心跳的问题,解决问题就应该大胆假设小心求证,对可能的原因排错最终顺藤摸瓜抓住本质。

  • 相关阅读:
    bootstrap入门基础
    java遇见的问题分析
    蓝桥杯练习
    win7 在文件夹上右键后 以管理员启动命令窗口
    渲染10万条数据的性能问题
    闲聊一下百度的Unit
    利用c# 多屏显示
    学习Xposed --记WX功能分析的过程
    从零开始打jar包--补充
    修改windows7 的管理员密码
  • 原文地址:https://www.cnblogs.com/shujuyr/p/13131461.html
Copyright © 2011-2022 走看看