zoukankan      html  css  js  c++  java
  • 有些尴尬的一次集群启动故障排错

    因为工作性质改变,有许久没动手处理故障了,今天的排错也是非生产环境,为验证一些测试临时搭的一套11g RAC环境,为了省时间,直接拿之前备份的vbox的环境拷贝,结果启动机器发现集群无法启动:

    [root@jystdrac1 ~]# su - grid
    [grid@jystdrac1 ~]$ crsctl stat res -t
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4000: Command Status failed, or completed with errors.
    [grid@jystdrac1 ~]$ crsctl stat res -t -init
    CRS-4639: Could not contact Oracle High Availability Services
    CRS-4000: Command Status failed, or completed with errors.
    

    查看集群alert日志报错:

    [grid@jystdrac1 jystdrac1]$ pwd
    /opt/app/11.2.0/grid/log/jystdrac1
    [grid@jystdrac1 jystdrac1]$ tail -20f alertjystdrac1.log
    2021-07-01 00:26:27.379:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4526)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:26:31.384:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:28:32.889:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4568)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:28:36.895:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:28:38.424:
    [mdnsd(4644)]CRS-5602:mDNS service stopping by request.
    2021-07-01 00:30:38.407:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4633)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:30:42.412:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:32:43.923:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4676)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:32:47.928:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:32:49.455:
    [mdnsd(4822)]CRS-5602:mDNS service stopping by request.
    

    进一步看mdns.log等最新报错信息(gpnp.log类似,为节省篇幅没有贴出):

    [grid@jystdrac1 mdnsd]$ pwd
    /opt/app/11.2.0/grid/log/jystdrac1/mdnsd
    [grid@jystdrac1 mdnsd]$ tail -20 mdnsd.log
    2021-06-30 22:50:59.275: [    MDNS][1534236416] mdnsd exit
    2021-06-30 22:53:03.989: [ default][1342412544]
    
    ================================================================================
    2021-06-30 22:53:03.989: [ default][1342412544]mdnsd START pid=2201
    [  clsdmt][1335961344]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
    2021-06-30 22:53:03.991: [  clsdmt][1335961344]PID for the Process [2201], connkey 9
    2021-06-30 22:53:03.991: [  clsdmt][1335961344]Creating PID [2201] file for home /opt/app/11.2.0/grid host jystdrac1 bin mdns to /opt/app/11.2.0/grid/mdns/init/
    2021-06-30 22:53:03.992: [  clsdmt][1335961344]Writing PID [2201] to the file [/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid]
    2021-06-30 22:53:03.992: [  clsdmt][1335961344]Failed to record pid for MDNSD
    2021-06-30 22:53:03.992: [  clsdmt][1335961344]Terminating process
    2021-06-30 22:53:03.992: [    MDNS][1335961344] clsdm requested mdnsd exit
    2021-06-30 22:53:03.992: [    MDNS][1335961344] mdnsd exit
    2021-06-30 22:57:14.236: [ default][747345664]
    
    ================================================================================
    2021-06-30 22:57:14.236: [ default][747345664]mdnsd START pid=2375
    [  clsdmt][740894464]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
    2021-06-30 22:57:14.239: [  clsdmt][740894464]PID for the Process [2375], connkey 9
    2021-06-30 22:57:14.239: [  clsdmt][740894464]Cr[grid@jystdrac1 mdnsd]$
    

    MOS 也有篇文章介绍了RAC起不来的五大问题:

    • Grid Infrastructure 启动的五大问题 (Doc ID 1526147.1)

    其中问题 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未运行,就和目前的现象很匹配。

    文档中描述了可能的原因和对应解决方案:

    可能的原因:
    
    1. orarootagent 缺少执行权限
    2. 缺少进程相关的 <node>.pid 文件或者这个文件的所有者/权限不对
    3. GRID_HOME 所有者/权限不对
    
    解决方案:
    
    1. 和一个好的GRID_HOME比较所有者/权限,并做相应的改正,或者以root用户执行:,
       # cd <GRID_HOME>/crs/install
       # ./rootcrs.pl -unlock
       # ./rootcrs.pl -patch
    这将停止集群软件,对需要的文件的所有者/权限设置为root用户,并且重启集群软件。
    2. 如果对应的 <node>.pid 不存在, 就用touch命令创建一个具有相应所有者/权限的文件, 否则就按要求改正文件<node>.pid的所有者/权限, 然后重启集群软件.
    这里是<GRID_HOME>下,所有者属于root:root 权限 644的<node>.pid 文件列表:
      ./ologgerd/init/<node>.pid
      ./osysmond/init/<node>.pid
      ./ctss/init/<node>.pid
      ./ohasd/init/<node>.pid
      ./crs/init/<node>.pid
    所有者属于<grid>:oinstall,权限644
      ./mdns/init/<node>.pid  
      ./evm/init/<node>.pid
      ./gipc/init/<node>.pid
      ./gpnp/init/<node>.pid
    
    3. 对第3种原因,请参考解决方案1
    

    可是依次排查下来发现均无问题,奇怪了,为啥权限都正确就是写不进去呢?

    手工vi试下看看呢?

    [grid@jystdrac1 jystdrac1]$ vi /opt/app/11.2.0/grid/mdns/init/jystdrac1.pid
    2201
    

    保存时发现报错:

    "/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid"
    "/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid" E514: write error (file system full?)
    Press ENTER or type command to continue
    

    什么?文件系统空间满了???

    [grid@jystdrac1 jystdrac1]$ df -h
    Filesystem                        Size  Used Avail Use% Mounted on
    /dev/mapper/vg_linuxbase-lv_root   28G   27G     0 100% /
    tmpfs                             1.5G     0  1.5G   0% /dev/shm
    /dev/sda1                         485M   39M  421M   9% /boot
    

    额,果然.. 好尴尬,居然是最初级的空间容量问题。
    赶紧清理下空间后重启集群再试是否正常启动?
    It's Ok!

    AlfredZhao©版权所有「从Oracle起航,领略精彩的IT技术。」
  • 相关阅读:
    UVa OJ 148 Anagram checker (回文构词检测)
    UVa OJ 134 LoglanA Logical Language (Loglan逻辑语言)
    平面内两条线段的位置关系(相交)判定与交点求解
    UVa OJ 130 Roman Roulette (罗马轮盘赌)
    UVa OJ 135 No Rectangles (没有矩形)
    混合函数继承方式构造函数
    html5基础(第一天)
    js中substr,substring,indexOf,lastIndexOf,split等的用法
    css的textindent属性实现段落第一行缩进
    普通的css普通的描边字
  • 原文地址:https://www.cnblogs.com/jyzhao/p/14957091.html
Copyright © 2011-2022 走看看