估计使用Red Hat或者CentOS做HA集群的朋友多数都会选择RedHat Cluster Suite(RHCS)这个套件来做吧。本篇主要记录构建及测试时候的情况。
poweroff 和 reboot 这种常规操作的服务切换取决于 recovery="relocate" 这个参数,在图形化界面中队应的位置是:
主节点断电(或者硬件故障)的故障切换取决于fence设备,我用的是最不被推荐的手动fence(manual fence),此时在 /var/log/messages 中将会有提示:
tail -f /var/log/messages Apr 29 22:30:41 govmail01 fence_manual: Node govmail02 needs to be reset before recovery can procede. Waiting for govmail02 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n govmail02)
此时的状态是:服务停止(主节点都挂了,服务不能切过来)
发生这种情况,可以按照这个提示,手工输入命令,告诉RHCS "已经将问题节点隔离"
[root@govmail01 ~]# fence_ack_manual -n govmail02 Warning: If the node "govmail02" has not been manually fenced (i.e. power cycled or disconnected from shared storage devices) the GFS file system may become corrupted and all its data unrecoverable! Please verify that the node shown above has been reset or disconnected from storage. Are you certain you want to continue? [yN] y done
完成后服务就会在备节点上启动了。