经过两次测试,感觉RAC很脆弱。
1.拔除RAC1的public网线,站在RAC2旁边看变化,发现VIP很快转换到RAC2,用户仍然可以使用。
2.1分钟后,RAC2自动重启,察看原因是共享盘无法mount,此时另一同事正在config SAN,无法确定是否真的共享盘出了问题。
3.干脆来个更狠的测试,拔除两台DB的电源,再插回去,重新开机,发现CRS无法启动。
[root@racdb02 install]# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
4. 在网络上寻找了很多方法,均无效果,metalink上也没有合适的方法
5.在两个node上分别RUN root102.sh
[oracle@racdb01 ~]$ /u01/oracle/product/10.2/crs1/install/root102.sh
6. reboot两台 node
7. [root@racdb01 oracle]# crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.rac.db application ONLINE UNKNOWN racdb02
ora....s1.inst application ONLINE UNKNOWN racdb01
ora....s2.inst application ONLINE UNKNOWN racdb02
ora....esdb.cs application ONLINE OFFLINE
ora....es1.srv application ONLINE OFFLINE
ora....es2.srv application ONLINE OFFLINE
ora....01.lsnr application ONLINE ONLINE racdb01
ora....b01.gsd application ONLINE ONLINE racdb01
ora....b01.ons application ONLINE ONLINE racdb01
ora....b01.vip application ONLINE ONLINE racdb01
ora....02.lsnr application ONLINE ONLINE racdb02
ora....b02.gsd application ONLINE UNKNOWN racdb02
ora....b02.ons application ONLINE UNKNOWN racdb02
ora....b02.vip application ONLINE ONLINE racdb02
8.试图删除instance失败,删除service racdb失败,删除监听器失败
9.crs_start -all
无法启动的Service仍然无法启动.
10.卸载CRS,重新安装,升级到1204
11.CRS所有服务正常启动
12.srvctl add database -d rac -o /u01/oracle/product/10.2/db1/
13.srvctl add instance -d rac -i rac1 -n racdb01
srvctl add instance -d rac -i rac2 -n racdb02
14.在两个NODE上reboot
15.发现rac2启动正常,但rac1的instance无法启动
[oracle@racdb01 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.rac.db application ONLINE ONLINE racdb02
ora....s1.inst application OFFLINE OFFLINE
ora....s2.inst application ONLINE ONLINE racdb02
ora....01.lsnr application ONLINE ONLINE racdb01
ora....b01.gsd application ONLINE ONLINE racdb01
ora....b01.ons application ONLINE ONLINE racdb01
ora....b01.vip application ONLINE ONLINE racdb01
ora....02.lsnr application ONLINE ONLINE racdb02
ora....b02.gsd application ONLINE ONLINE racdb02
ora....b02.ons application ONLINE ONLINE racdb02
ora....b02.vip application ONLINE ONLINE racdb02
16.srvctl remove instance -d rac -i rac1
17.srvctl add instance -d rac -i rac1 -n racdb01
18.试图启动rac1的服务
[oracle@racdb01 ~]$ srvctl start instance -d rac -i rac1 -o mount;
PRKP-1001 : Error starting instance rac1 on node racdb01
CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.rac.rac1.inst' has placement error.
[oracle@racdb01 ~]$ crs_start ora.rac.rac1.inst
Attempting to start `ora.rac.rac1.inst` on member `racdb01`
`ora.rac.rac1.inst` on member `racdb01` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
CRS-0215: Could not start resource 'ora.rac.rac1.inst'.
[oracle@racdb01 admin]$ crs_start ora.rac.rac1.inst
CRS-1028: Dependency analysis failed because of:
'Resource in UNKNOWN state: ora.rac.rac1.inst'
CRS-0223: Resource 'ora.rac.rac1.inst' has placement error.
19.继续在网上找资料,折腾了半天,仍无效.
20.在metalink上看到一篇文章,说是TNSNArac.ora有问题
21.查看我的tnsnarac.ora,发现原来设的racdb这个service(for透明故障切换用的)仍在,而重新装CRS后并未设定,删除之
22.[oracle@racdb01 admin]$ crs_start ora.rac.rac1.inst
Attempting to start `ora.rac.rac1.inst` on member `racdb01`
Start of `ora.rac.rac1.inst` on member `racdb01` succeeded.
23.[oracle@racdb01 admin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.rac.db application ONLINE ONLINE racdb02
ora....s1.inst application ONLINE ONLINE racdb01
ora....s2.inst application ONLINE ONLINE racdb02
ora....01.lsnr application ONLINE ONLINE racdb01
ora....b01.gsd application ONLINE ONLINE racdb01
ora....b01.ons application ONLINE ONLINE racdb01
ora....b01.vip application ONLINE ONLINE racdb01
ora....02.lsnr application ONLINE ONLINE racdb02
ora....b02.gsd application ONLINE ONLINE racdb02
ora....b02.ons application ONLINE ONLINE racdb02
ora....b02.vip application ONLINE ONLINE racdb02
终于启动了,鼓掌.折腾了两天了!