zoukankan      html  css  js  c++  java
  • 11G RAC 节点2 主机down(两个节点RAC)

    --节点2 数据库日志

    Mon Jul 01 06:38:22 2019
    SUCCESS: diskgroup SAS_ARCH was dismounted
    Mon Jul 01 06:38:22 2019
    Shutting down instance (abort)
    License high water mark = 1923
    USER (ospid: 82381): terminating the instance
    Mon Jul 01 06:38:22 2019
    opiodr aborting process unknown ospid (12589) as a result of ORA-1092
    Mon Jul 01 06:38:22 2019
    opiodr aborting process unknown ospid (45276) as a result of ORA-1092
    Mon Jul 01 06:38:22 2019
    opiodr aborting process unknown ospid (107399) as a result of ORA-1092
    Instance terminated by USER, pid = 82381
    Mon Jul 01 06:38:24 2019
    Instance shutdown complete

    --主机日志

    Jul 1 06:35:01 test2 auditd[16253]: Audit daemon rotating log files
    Jul 1 06:38:19 test2 init: oracle-ohasd main process (15639) killed by TERM signal
    Jul 1 06:38:19 test2 init: oracle-tfa main process (15638) killed by TERM signal
    Jul 1 06:38:19 test2 init: tty (/dev/tty2) main process (16997) killed by TERM signal
    Jul 1 06:38:19 test2 init: tty (/dev/tty3) main process (16999) killed by TERM signal
    Jul 1 06:38:19 test2 init: tty (/dev/tty4) main process (17004) killed by TERM signal
    Jul 1 06:38:19 test2 init: tty (/dev/tty5) main process (17006) killed by TERM signal
    Jul 1 06:38:19 test2 init: tty (/dev/tty6) main process (17008) killed by TERM signal
    Jul 1 06:38:19 test2 gnome-session[17110]: WARNING: Failed to send buffer
    Jul 1 06:38:19 test2 gnome-session[17110]: WARNING: Failed to send buffer
    Jul 1 06:38:23 test2 ntpd[90741]: Deleting interface #15 bond0:1, 10.1.11.103#123, interface stats: received=1410, sent=0, dropped=0, active_time=56169415 secs
    Jul 1 06:38:39 test2 pulseaudio[17164]: pid.c: Failed to open PID file '/var/lib/gdm/.pulse/45593399e441b14e2757581a00000028-runtime/pid': No such file or directory
    Jul 1 06:38:39 test2 pulseaudio[17164]: pid.c: Failed to open PID file '/var/lib/gdm/.pulse/45593399e441b14e2757581a00000028-runtime/pid': No such file or directory
    Jul 1 06:38:46 test2 ntpd[90741]: Deleting interface #14 bond1:1, 169.254.7.117#123, interface stats: received=0, sent=0, dropped=0, active_time=56169467 secs
    Jul 1 06:38:51 test2 abrtd: Got signal 15, exiting
    Jul 1 06:38:51 test2 xinetd[45495]: Exiting...
    Jul 1 06:38:51 test2 acpid: exiting
    Jul 1 06:38:51 test2 ntpd[90741]: ntpd exiting on signal 15
    Jul 1 06:38:53 test2 init: Disconnected from system bus
    Jul 1 06:38:53 test2 rtkit-daemon[17166]: Demoting known real-time threads.
    Jul 1 06:38:53 test2 rtkit-daemon[17166]: Demoted 0 threads.
    Jul 1 06:38:53 test2 auditd[16253]: The audit daemon is exiting.
    Jul 1 06:38:53 test2 kernel: type=1305 audit(1561934333.370:37053744): audit_pid=0 old=16253 auid=4294967295 ses=4294967295 res=1
    Jul 1 06:38:53 test2 kernel: type=1305 audit(1561934333.475:37053745): audit_enabled=0 old=1 auid=4294967295 ses=4294967295 res=1
    Jul 1 06:38:53 test2 kernel: Kernel logging (proc) stopped.
    Jul 1 06:38:53 test2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="16275" x-info="http://www.rsyslog.com"] exiting on signal 15.

    ---节点2 GRID 日志 /u01/11.2.0/grid/log/test2 下面的alertbapdb2.log
    2019-07-01 06:34:49.606:
    [client(75150)]CRS-0009:log file "/u01/11.2.0/grid/log/test2/client/olsnodes.log" reopened
    2019-07-01 06:34:49.606:
    [client(75150)]CRS-0019:file rotation terminated. log file: "/u01/11.2.0/grid/log/test2/client/olsnodes.log"
    2019-07-01 06:38:33.151:
    [/u01/11.2.0/grid/bin/orarootagent.bin(106660)]CRS-5822:Agent '/u01/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:52057} in /u01/11.2.0/grid/log/test2/agent/crsd/orarootagent_root//orarootagent_root.log.
    LFI-01523: rename() failed.

    2019-07-01 06:34:49.606:
    [client(75150)]CRS-0009:log file "/u01/11.2.0/grid/log/test2/client/olsnodes.log" reopened
    2019-07-01 06:34:49.606:
    [client(75150)]CRS-0019:file rotation terminated. log file: "/u01/11.2.0/grid/log/test2/client/olsnodes.log"
    2019-07-01 06:38:33.151:
    [/u01/11.2.0/grid/bin/orarootagent.bin(106660)]CRS-5822:Agent '/u01/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:52057} in /u01/11.2.0/grid/log/test2/agent/crsd/orarootagent_root//orarootagent_root.log.
    2019-07-01 06:38:33.887:
    [ctssd(104917)]CRS-2405:The Cluster Time Synchronization Service on host test2 is shutdown by user
    2019-07-01 06:38:33.892:
    [mdnsd(103640)]CRS-5602:mDNS service stopping by request.
    2019-07-01 06:38:45.860:
    [cssd(103758)]CRS-1603:CSSD on node test2 shutdown by user.
    2019-07-01 06:38:45.970:
    [ohasd(103446)]CRS-2767:Resource state recovery not attempted for 'ora.cssdmonitor' as its target state is OFFLINE
    2019-07-01 06:38:46.064:
    [cssd(103758)]CRS-1660:The CSS daemon shutdown has completed
    2019-07-01 06:38:49.592:
    [gpnpd(103651)]CRS-2329:GPNPD on node test2 shutdown.
    2019-07-01 09:28:04.022:
    [ohasd(17090)]CRS-2112:The OLR service started on node test2.
    2019-07-01 09:28:04.069:
    [ohasd(17090)]CRS-1301:Oracle High Availability Service started on node test2.

    rac是通过几个必要条件进行通信,时间,磁盘心跳,链路心跳,缺一不可。

    ---节点1 日志

    Mon Jul 01 06:38:24 2019
    Reconfiguration started (old inc 16, new inc 18)
    List of instances:
    1 (myinst: 1)
    Global Resource Directory frozen
    * dead instance detected - domain 0 invalid = TRUE
    Communication channels reestablished
    Master broadcasted resource hash value bitmaps
    Non-local Process blocks cleaned out
    Mon Jul 01 06:38:25 2019
    LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
    Mon Jul 01 06:38:25 2019
    LMS 3: 2 GCS shadows cancelled, 1 closed, 0 Xw survived
    Mon Jul 01 06:38:25 2019
    Mon Jul 01 06:38:25 2019
    LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
    LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
    Mon Jul 01 06:38:36 2019
    Set master node info
    Submitted all remote-enqueue requests
    Dwn-cvts replayed, VALBLKs dubious
    All grantable enqueues granted
    Post SMON to start 1st pass IR
    Mon Jul 01 06:38:36 2019
    Instance recovery: looking for dead threads
    Beginning instance recovery of 1 threads
    Mon Jul 01 06:38:52 2019
    parallel recovery started with 32 processes
    Started redo scan
    Completed redo scan
    read 12123 KB redo, 6138 data blocks need recovery
    Mon Jul 01 06:38:55 2019
    Submitted all GCS remote-cache requests
    Post SMON to start 1st pass IR
    Fix write in gcs resources
    Mon Jul 01 06:39:07 2019
    Reconfiguration complete
    Mon Jul 01 06:39:32 2019
    Started redo application at
    Thread 2: logseq 218275, block 1708335

    ---原因:
    2019-07-01 06:38:33.887:
    [ctssd(104917)]CRS-2405:The Cluster Time Synchronization Service on host test2 is shutdown by user

    主机test2上的集群时间同步服务由用户关闭

    主机 BIOS 时间不一致;

    [oracle@test1 ~]$ su - root
    Password:
    [root@test1 ~]# hwclock
    Mon 01 Jul 2019 11:27:27 AM CST -0.485777 seconds
    [root@test1 ~]# date
    Mon Jul 1 10:44:03 CST 2019

    [root@test2 ~]# hwclock
    Mon 01 Jul 2019 10:42:33 AM CST -0.219479 seconds
    [root@test2 ~]# date
    Mon Jul 1 10:42:36 CST 2019


    --同步方式

    --节点1 cat /etc/ntp.conf

    server pbsntp01.sx.com iburst
    server pbsntp02.sx.com iburst


    --节点2 修改后 cat /etc/ntp.conf
    server 10.0.10.2 iburst
    #server pbsntp02.sx.com iburst

    --解决办法:

    hwclock  -w

    如果时间不方便可以按照如下定时任务修改

    --root 用户
    vi hwclock.sh

    #! /bin/bash
    #Function refresh BIOS time
    exec >> /var/log/hwclock`date +%y%m%d%H`.log
    date
    sleep 3
    echo "This is system date"
    hwclock
    sleep 3
    echo "This is show hwclock"
    hwclock -w
    sleep 3
    echo "This is application hwclock to BIOS"
    #try agen
    hwclock
    sleep 3
    date

    [oracle@test1 ~]$ ntpq -p
    remote refid st t when poll reach delay offset jitter
    ==============================================================================
    *1.0.10.250 19.19.24.22 3 u 19 256 377 0.436 -3.817 6.576
    [oracle@test1 ~]$ ntpq -p
    remote refid st t when poll reach delay offset jitter
    ==============================================================================
    *1.0.10.250 19.19.24.22 3 u 32 256 377 0.436 -3.817 6.576


    [oracle@test2 ~]$ ntpq -p
    remote refid st t when poll reach delay offset jitter
    ==============================================================================
    *1.0.10.250 12.2.15.2 3 u 61 256 377 0.386 2.638 7.938
    [oracle@test2 ~]$ ntpq -p
    remote refid st t when poll reach delay offset jitter
    ==============================================================================
    *1.0.10.250 12.2.15.2 3 u 62 256 377 0.386 2.638 7.938

    remote:响应这个请求的NTP服务器的名称。
    refid:NTP服务器使用的上一级ntp服务器。
    st:remote远程服务器的级别.由于NTP是层型结构,有顶端的服务器,多层的Relay Server再到客户端.所以服务器从高到低级别可以设定为1-16.为了减缓负荷和网络堵塞,原则上应该避免直接连接到级别为1的服务器的.
    when:上一次成功请求之后到现在的秒数。
    poll:本地机和远程服务器多少时间进行一次同步(单位为秒).在一开始运行NTP的时候这个poll值会比较小,那样和服务器同步的频率也就增加了,可以尽快调整到正确的时间范围,之后poll值会逐渐增大,同步的频率也就会相应减小
    reach:这是一个八进制值,用来测试能否和服务器连接.每成功连接一次它的值就会增加
    delay:从本地机发送同步要求到ntp服务器的round trip time
    offset:主机通过NTP时钟同步与所同步时间源的时间偏移量,单位为毫秒(ms)。offset越接近于0,主机和ntp服务器的时间越接近
    jitter:这是一个用来做统计的值.它统计了在特定个连续的连接数里offset的分布情况.简单地说这个数值的绝对值越小,主机的时间就越精确


    ----重点查询 offset 这个值是否在本机一直在增长, 在100 以内表示没问题

    ---可以添加定时任务

    #ntpd
    * */1 * * * /bin/sh /home/oracle/shell/ntpq.sh &> /dev/null

    vi ntpq.sh

    #!/bin/bash
    source /home/oracle/.bash_profile
    exec >> /home/oracle/shell/ntpq_`date +%y%m%d%H`.log
    ntpq -p
    sleep 3
    ntpq -p

    chmod +x ntpq.sh

    ---本次故障实际原因:OS 层面 电源模块故障; 

  • 相关阅读:
    git版本超前了N个版本且落后了N个版本的解决办法
    CSS3与动画有关的属性transition、animation、transform对比
    禁止选中文本JS
    页面加载中jquery逐渐消失效果实现
    localstorage和sessionstorage上手使用记录
    点击除元素以外的任意地方隐藏元素js
    js准确获取当前页面url网址信息
    301、404、200、304、500HTTP状态
    对事件委托绑定click的事件的解绑
    RabbitMQ的安装和使用Python连接RabbitMQ
  • 原文地址:https://www.cnblogs.com/ss-33/p/11113335.html
Copyright © 2011-2022 走看看