在一次系统维护过程中,尝试启动RAC环境,结果RAC服务没有启动,在/tmp目录下发现了这个错误:
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address] [6]
前两天检查备份日志时发现,在释放CHANNEL的时候报错,进一步详细的检查发现,带库有一个DRIVE DOWN掉了,备份只能在一个CHANNEL上进行,因此备份日志中出现了错误,错误信息如下:
bash-3.00$ more /data/backup/backup_tradedb_081101.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat Nov 1 23:00:00 CST 2008 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog
RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=112 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
allocated channel: C2
channel C2: sid=146 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
Starting backup at 01-NOV-08
input backupset count=842 stamp=669081253 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/qaju2nl5_1_1
input backupset count=840 stamp=669080836 creation_time=25-OCT-08
channel C1: starting piece 1 at 01-NOV-08
channel C1: backup piece /data/backup/tradedb/q8ju2n84_1_1
piece handle=qaju2nl5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:03:35
deleted backup piece
backup piece handle=/data/backup/tradedb/qaju2nl5_1_1 recid=1446 stamp=669081254
input backupset count=841 stamp=669080836 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/q9ju2n84_1_1
piece handle=q9ju2n84_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:03:15
deleted backup piece
backup piece handle=/data/backup/tradedb/q9ju2n84_1_1 recid=1447 stamp=669080837
input backupset count=843 stamp=669081317 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/qbju2nn5_1_1
piece handle=qbju2nn5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:11:46
deleted backup piece
backup piece handle=/data/backup/tradedb/qbju2nn5_1_1 recid=1448 stamp=669081317
input backupset count=844 stamp=669081317 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/qcju2nn5_1_1
RMAN-03009: failure of backup command on C1 channel at 11/01/2008 23:27:19
ORA-19506: failed to create sequential file, name="q8ju2n84_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: network connection timed out
ORA-19600: input file is backup piece (/data/backup/tradedb/q8ju2n84_1_1)
ORA-19601: output file is backup piece (q8ju2n84_1_2)
channel C1 disabled, job failed on it will be run on another channel
piece handle=qcju2nn5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:21:41
deleted backup piece
backup piece handle=/data/backup/tradedb/qcju2nn5_1_1 recid=1449 stamp=669081322
input backupset count=840 stamp=669080836 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/q8ju2n84_1_1
piece handle=q8ju2n84_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:12:26
deleted backup piece
backup piece handle=/data/backup/tradedb/q8ju2n84_1_1 recid=1445 stamp=669080837
input backupset count=846 stamp=669083380 creation_time=26-OCT-08
.
.
.
channel C2: starting piece 1 at 02-NOV-08
channel C2: backup piece /data/backup/tradedb/qhju2q9f_1_1
piece handle=qhju2q9f_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 02-NOV-08
channel C2: backup set complete, elapsed time: 00:08:56
deleted backup piece
backup piece handle=/data/backup/tradedb/qhju2q9f_1_1 recid=1454 stamp=669083952
Finished backup at 02-NOV-08
released channel: C1
released channel: C2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of release command at 11/02/2008 00:44:39
RMAN-06012: channel: C1 not allocated
手头启动DRIVE,没有发现异常,但是一旦执行备份,这个DRIVE就DOWN掉了。尝试修改这个DRIVE的配置,发现DRIVE原本的路径对于NETBACKUP根本无法加载,看来可能是硬件问题导致了原因。
于是系统维护人员到现场解决问题,发现是光纤交换机出现了故障,于是重启了光纤交换机。由于RAC环境也依赖该光纤交换机,但是RAC环境配置了双路光纤交换机,因此重启光交的时候没有停RAC服务。
结果光纤交换机重启的结果导致RAC的一个节点服务器暂时无法启动,而另一个节点服务器也发生了重启。
由于RAC环境完全DOWN掉,于是尝试在目前可以启动的节点上启动RAC服务:
# /etc/init.d/init.crs start
Startup will be queued to init within 30 seconds.
服务启动后半天没有响应,检查后台经常没有任何的Oracle实例启动,感觉不太对劲,检查/tmp目录发现了上面的错误信息:
bash-3.00# cd /tmp
bash-3.00# ls
crsctl.4483 crsctl.4492 crsctl.4493 hsperfdata_noaccess hsperfdata_root ssh-sIvv2068
bash-3.00# ls -l
total 96
-rw-r--r-- 1 oracle oinstall 155 Nov 5 20:46 crsctl.4483
-rw-r--r-- 1 oracle oinstall 155 Nov 5 20:46 crsctl.4492
-rw-r--r-- 1 oracle oinstall 155 Nov 5 20:46 crsctl.4493
drwxr-xr-x 2 noaccess noaccess 178 Nov 5 19:53 hsperfdata_noaccess
drwxr-xr-x 2 root root 117 Nov 5 19:54 hsperfdata_root
drwx------ 2 root root 184 Nov 5 19:57 ssh-sIvv2068
bash-3.00# more crsctl.4483
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address]
Oracle的共享存储是通过VERITAS的VOLUMN CLUSTER MANAGER进行管理的,目前DOWN掉的节点是VOLUMN CLUSTER MANAGER的主节点,但是在当前节点上可以看到OCR裸设备、VOT裸设备以及所有的控制文件、日志文件、数据文件和参数文件的裸设备,这些裸设备的访问路径都是正常的,为什么还会导致这个错误呢。
查询了METALINK,发现可能是bug:Bug No. 3613622中描述的问题:
The problem here is that no node cannot rely on its perception of the network,since the network may be broken in an undetectable manner, so the node must have access to the voting disk. When access to the voting disk is lost, or the I/O takes 'too long', the node must fail.
When Veritas CVM runs with Vendor Clusterware, then the Vendor Clusterware is the primary driver of node reconfiguration,@ not the miss count setting of CSS. As John mentioned above,@ on Sun Cluster by default CSS tolerates up to almost 10 minutes@ of Veritas CVM I/O suspension. It is Veritas's problem to fix.
看来问题很可能是由于VERITAS的CVM引起的,而且在一段时间后,这个节点上的RAC确实可以启动了,不过由于当时节点1恰好也可以正确启动了,所以不好确定是否是由于主节点的启动导致了问题消失,还是由于等待时间超过了10分钟,使得这个问题得以解决。
先记录这个问题,以后如果有机会的话,还要验证一下。