本次测试模拟私有网卡down掉,rac节点驱逐分析。
可以参考导致实例逐出的五大问题 (Doc ID 1526186.1)
集群资源查看
[qdtais1]@ht01[/home/oracle]$crsctl status res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.DATA.dg ONLINE ONLINE ht01 ONLINE ONLINE ht02 ora.LISTENER.lsnr ONLINE ONLINE ht01 ONLINE ONLINE ht02 ora.OCR.dg ONLINE ONLINE ht01 ONLINE ONLINE ht02 ora.asm ONLINE ONLINE ht01 Started ONLINE ONLINE ht02 Started ora.gsd OFFLINE OFFLINE ht01 OFFLINE OFFLINE ht02 ora.net1.network ONLINE ONLINE ht01 ONLINE ONLINE ht02 ora.ons ONLINE ONLINE ht01 ONLINE ONLINE ht02 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE ht01 ora.cvu 1 ONLINE ONLINE ht01 ora.ht01.vip 1 ONLINE ONLINE ht01 ora.ht02.vip 1 ONLINE ONLINE ht02 ora.oc4j 1 ONLINE ONLINE ht01 ora.qdtais.db 1 ONLINE ONLINE ht01 Open 2 ONLINE ONLINE ht02 Open ora.scan1.vip 1 ONLINE ONLINE ht01 ora.yz.db 1 OFFLINE OFFLINE Instance Shutdown
查看hosts文件及网卡信息
[qdtais1]@ht01[/home/oracle]$ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:D0:2C:DC inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fed0:2cdc/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:60 errors:0 dropped:0 overruns:0 frame:0 TX packets:154 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7317 (7.1 KiB) TX bytes:20671 (20.1 KiB) eth1 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.200 Bcast:192.168.20.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fed7:4e75/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:7909 errors:0 dropped:0 overruns:0 frame:0 TX packets:6555 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:912161 (890.7 KiB) TX bytes:712119 (695.4 KiB) eth1:1 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.204 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth1:3 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.202 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth2 Link encap:Ethernet HWaddr 08:00:27:BB:03:40 inet addr:192.168.0.10 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:febb:340/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1407822 errors:0 dropped:0 overruns:0 frame:0 TX packets:1092372 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1046688365 (998.1 MiB) TX bytes:606254225 (578.1 MiB) eth2:1 Link encap:Ethernet HWaddr 08:00:27:BB:03:40 inet addr:169.254.67.75 Bcast:169.254.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:265652 errors:0 dropped:0 overruns:0 frame:0 TX packets:265652 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:143272867 (136.6 MiB) TX bytes:143272867 (136.6 MiB) [qdtais1]@ht01[/home/oracle]$cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.20.200 ht01 192.168.20.201 ht02 192.168.0.10 ht01-priv1 192.168.0.20 ht02-priv1 192.168.20.202 ht01-vip 192.168.20.203 ht02-vip 192.168.20.204 ht-scanip
关闭节点1心跳私有网卡eth2
[root@ht01 ~]# ifconfig eth2 down
查看网卡信息
[root@ht01 ~]# ifconfig -a eth0 Link encap:Ethernet HWaddr 08:00:27:D0:2C:DC inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fed0:2cdc/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:884 errors:0 dropped:0 overruns:0 frame:0 TX packets:1410 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:65160 (63.6 KiB) TX bytes:869140 (848.7 KiB) eth1 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.200 Bcast:192.168.20.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fed7:4e75/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:8729 errors:0 dropped:0 overruns:0 frame:0 TX packets:7292 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:983762 (960.7 KiB) TX bytes:817872 (798.7 KiB) eth1:1 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.204 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth1:2 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.203 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth1:3 Link encap:Ethernet HWaddr 08:00:27:D7:4E:75 inet addr:192.168.20.202 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth2 Link encap:Ethernet HWaddr 08:00:27:BB:03:40 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:1414086 errors:0 dropped:0 overruns:0 frame:0 TX packets:1097177 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1051368691 (1002.6 MiB) TX bytes:608947879 (580.7 MiB) eth2:1 Link encap:Ethernet HWaddr 08:00:27:BB:03:40 inet addr:169.254.67.75 Bcast:169.254.255.255 Mask:255.255.0.0 BROADCAST MULTICAST MTU:1500 Metric:1 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:267864 errors:0 dropped:0 overruns:0 frame:0 TX packets:267864 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:144385365 (137.6 MiB) TX bytes:144385365 (137.6 MiB)
日志分析
观察节点1 oracle alert日志
Thu Mar 26 10:55:29 2020 SKGXP: ospid 4149: network interface with IP address 169.254.67.75 no longer running (check cable) ---私有ip地址不运行 SKGXP: ospid 4149: network interface with IP address 169.254.67.75 is DOWN Thu Mar 26 10:55:47 2020 Reconfiguration started (old inc 4, new inc 6) ---开始重新分配资源 List of instances: 1 (myinst: 1) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Mar 26 10:55:47 2020 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Thu Mar 26 10:55:47 2020 minact-scn: Inst 1 is now the master inc#:6 mmon proc-id:4198 status:0x7 --Inst 1是主节点 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.0014ed0a gcalc-scn:0x0000.0014ed15 minact-scn: master found reconf/inst-rec before recscn scan old-inc#:6 new-inc#:6 Thu Mar 26 10:55:47 2020 Instance recovery: looking for dead threads Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Beginning instance recovery of 1 threads --实例开始recover 节点2上的redo Started redo scan Completed redo scan read 0 KB redo, 0 data blocks need recovery Started redo application at Thread 2: logseq 13, block 47971, scn 1371433 Recovery of Online Redo Log: Thread 2 Group 3 Seq 13 Reading mem 0 Mem# 0: +DATA/qdtais/onlinelog/group_3.268.1023987437 Mem# 1: +DATA/qdtais/onlinelog/group_3.269.1023987441 Completed redo application of 0.00MB Completed instance recovery at -- redo恢复完成 Thread 2: logseq 13, block 47971, scn 1391434 0 data blocks read, 0 data blocks written, 0 redo k-bytes read Thread 2 advanced to log sequence 14 (thread recovery) minact-scn: master continuing after IR Thu Mar 26 10:56:47 2020 Decreasing number of real time LMS from 1 to 0 Thu Mar 26 11:01:51 2020 db_recovery_file_dest_size of 4407 MB is 5.08% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup.
观察节点1的grid日志
2020-03-26 10:55:29.454: [cssd(3278)]CRS-1612:Network communication with node ht02 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.180 seconds ---和节点2的网络通信超时 2020-03-26 10:55:36.456: [cssd(3278)]CRS-1611:Network communication with node ht02 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.180 seconds 2020-03-26 10:55:41.458: [cssd(3278)]CRS-1610:Network communication with node ht02 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.170 seconds 2020-03-26 10:55:43.636: [cssd(3278)]CRS-1607:Node ht02 is being evicted in cluster incarnation 480633263; details at (:CSSNM00007:) in /u01/app/grid/log/ht01/cssd/ocssd.log. ---节点2被集群驱逐 2020-03-26 10:55:45.815: [cssd(3278)]CRS-1625:Node ht02, number 2, was manually shut down --节点2集群资源被关闭 2020-03-26 10:55:45.821: [cssd(3278)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ht01 . --cssd进程重新配置gc资源 2020-03-26 10:55:45.834: [ctssd(3421)]CRS-2407:The new Cluster Time Synchronization Service reference node is host ht01. 2020-03-26 10:55:57.079: [crsd(3564)]CRS-5504:Node down event reported for node 'ht02'. 2020-03-26 10:56:00.027: [crsd(3564)]CRS-2773:Server 'ht02' has been removed from pool 'Generic'. 2020-03-26 10:56:00.033: [crsd(3564)]CRS-2773:Server 'ht02' has been removed from pool 'ora.qdtais'.
观察节点2grid日志
2020-03-26 10:55:28.379: [cssd(3208)]CRS-1612:Network communication with node ht01 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.800 seconds ---和节点1的网络通信超时 2020-03-26 10:55:36.384: [cssd(3208)]CRS-1611:Network communication with node ht01 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.790 seconds 2020-03-26 10:55:40.385: [cssd(3208)]CRS-1610:Network communication with node ht01 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.790 seconds 2020-03-26 10:55:43.180: [cssd(3208)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/grid/log/ht02/ cssd/ocssd.log. 2020-03-26 10:55:43.180: [cssd(3208)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/log/ht02/cssd/ocssd.log --cssd守护进程被强制终止 2020-03-26 10:55:43.222: [cssd(3208)]CRS-1652:Starting clean up of CRSD resources. --清理crsd资源 2020-03-26 10:55:44.259: [cssd(3208)]CRS-1608:This node was evicted by node 1, ht01; details at (:CSSNM00005:) in /u01/app/grid/log/ht02/cssd/ocssd.log.
观察节点2oracle alert日志
Thu Mar 26 10:55:45 2020 NOTE: ASMB terminating --asmb进程终止导致数据库crash Errors in file /u01/app/db/diag/rdbms/qdtais/qdtais2/trace/qdtais2_asmb_3974.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 32 Serial number: 3 Errors in file /u01/app/db/diag/rdbms/qdtais/qdtais2/trace/qdtais2_asmb_3974.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 32 Serial number: 3 ASMB (ospid: 3974): terminating the instance due to error 15064 Instance terminated by ASMB, pid = 3974