作者简介:刘亚辉,现就职于北京海天起点新疆办事处,OGG项目组成员。具有5年以上电信行业系统运维经验,熟悉Oracle数据库、Linux操作系统日常维护工作,擅长于OGG搭建方案的制定、运维优化及故障诊断。
OGG数据实时同步的特点决定了它将被经常使用在对数据实时性要求较高的系统中,所以对于其稳定性以及故障处理时间的要求是很高的。我所在的工作环境中就有几套OGG是在生产环境之间同步业务数据,如果源端进程出现异常,那么将影响所有的目标端数据。有一段时间,由于源端系统OGG版本较低,以及业务数据的复杂性,EXTRACT进程经常ABEND或者宕住,直接影响目标端系统的业务受理。所以,如果源端进程运行不稳定,人工维护工作量还是很高的,需要尽快找出问题并解决。
除了保障自身的稳定性,还要考虑到,其他故障也会影响OGG。简单的说,OGG是通过连接数据库读取事务日志工作的。如果数据库不可用,或者更严重些,OGG所在的主机崩溃了,那么OGG还是会受到影响。那么问题来了,如何实现OGG的高可用?
在Oracle RAC集群中,是通过集群软件统一管理多实例及其他资源的方式实现Oracle数据库的高可用。OGG的高可用原理也与此类似,只不过同时运行的OGG进程只有一套。在当前运行的节点发生故障时,可以自动切换到其他节点
OGG高可用原理
上面已经说到,OGG可以在发生故障时实现自动切换,那么具体的实现方式是什么?
- 在RAC数据库的每个节点的相同目录下都部署OGG,并且配置相同的进程及参数;
- 为OGG在集群中配置VIP地址,保证PUMP投递进程与目标主机的正常通信;
- 将OGG作为资源注册到Oracle集群中,使用集群软件管理OGG;
- 在发生故障时,可以通过集群的relocate resource功能自动切换到其他节点。
那么问题又来了,OGG在切换后是如何保证数据一致性的呢?假设在发生故障后,进程已经ABEND,如何让其他节点的OGG进程获取到故障节点OGG读取/写入的位置?这就涉及OGG的检查点文件。
大家都知道,通过ggsci命令行的 info <进程名> showch 命令可以查看进程的读检查点及写检查点的位置(以EXTRACT进程为例):
GGSCI (rac2) 3> info EXTRACT showch
EXTRACT EXTRACT Last Started 2017-04-09 01:08 Status RUNNING
…
Current Checkpoint Detail:
Read Checkpoint #1
Oracle Threaded Redo Log
Startup Checkpoint (starting position in the data source):
Thread #: 1
Sequence #: 55762
RBA: 3686956048
Timestamp: 2017-04-08 23:20:27.000000
SCN: 3460.1956090641 (14862542934801)
Redo File: +RECOC1/rac/onlinelog/group_10.285.864850395
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Thread #: 1
Sequence #: 59439
RBA: 1029172752
Timestamp: 2017-05-17 11:15:10.000000
SCN: 3465.1843266486 (14883904947126)
Redo File: Not Available
Current Checkpoint (position of last record read in the data source):
Thread #: 1
Sequence #: 59440
RBA: 627996012
Timestamp: 2017-05-17 11:31:11.000000
SCN: 3465.1854259346 (14883915939986)
Redo File: Not Available
…
Write Checkpoint #1
GGS Log Trail
Current Checkpoint (current write position):
Sequence #: 28785
RBA: 105063747
Timestamp: 2017-05-17 11:31:14.734831
Extract Trail: ./dirdat/LA
Trail Type: EXTTRAIL
读检查点记录了上次启动的位置、下次恢复的位置以及当前读的位置;写检查点记录了当前在TRAIL文件中写的位置。这些信息都记录在dirchk目录的检查点文件中。所以只要让其他进程能够读取到dirchk目录下的检查点文件,即可从下次恢复的位置开始读取日志。
(可以做个小实验,将当前OGG的dirchk目录拷贝到其他测试OGG的目录下,不拷贝参数文件,查看相同进程的检查点信息与原进程是否一致。)
为了让TRAIL文件能够继续被读取/写入,TRAIL文件也要能够被其他节点进程访问到。所以如果划分出一块共享存储,将检查点文件及TRIAL文件放在上面,将OGG软件分别部署在每个节点上,即可实现OGG的高可用性及数据一致性。
OGG HA架构图
图中描述了两端均为RAC双节点的情况,其中:
源端和目标端都通过共享存储访问检查点文件及TRAIL文件;
源端和目标端的VIP地址均可漂移到其他节点;
源端通过目标端的VIP,向目标端传送TRAIL文件,而无需关心实际主机名或地址。
配置OGG HA
1.准备共享存储
Oracle推荐使用ACFS作为OGG的共享存储。
注:ACFS支持的数据库及OS版本可参考Metalink[ID 1369107.1]
2.准备集群及数据库
安装Oracle集群及数据库,并在每个节点的相同目录下,使用同一用户安装OGG。
3.配置检查点路径及TRAIL文件路径
使用Linux的符号链接命令将OGG安装目录下的dirchk及dirdat目录替换成指向共享存储目录的链接。
[oracle@rac1 ~]$ cd /acfs
[oracle@rac1 ~]$ mkdir dirchk dirdat
[oracle@rac1 ~]cdcdOGG_HOME
[oracle@rac1 ~]$ rm –rf dirchk
[oracle@rac1 ~]$ ln –s /acfs/dirchk dirchk
[oracle@rac1 ~]$ rm –rf dirdat
[oracle@rac1 ~]$ ln –s /acfs/dirdat dirdat
4.创建OGG VIP
注:配置VIP需要11gR2及以上。
1)使用root用户创建VIP
[root@rac1 ~]# /u01/app/11.2.0/grid/bin/appvipcfg create -network=1
> -ip=10.0.0.8
> -vipname=oggvip
> -user=root
其中,network为网络号,查看网络号命令:
[root@rac1 ~]# crsctl stat res -p |grep -ie .network -ie subnet |grep -ie name -ie subnet
NAME=ora.net1.network
USR_ORA_SUBNET=10.0.0.0
net1:表示网络号为1,USR_ORA_SUBNET表示VIP所在的子网范围。
ip:VIP的地址,必须在上面确定的子网范围内。
vipname:指定VIP名称为oggvip。
可以使用以下命令修改VIP的属性:
crsctl modify resource ggs-vip -attr "RESTART_ATTEMPTS=3,START_TIMEOUT=300,STOP_TIMEOUT=300, CHECK_INTERVAL=10"
2)授予管理用户(grid)启动该VIP的权限
$GRID_HOME/bin/crsctl setperm resource oggvip -u user:grid:r-x
[root@rac1 ~]# /u01/app/11.2.0/grid/bin/crsctl setperm resource oggvip -u user:grid:r-x
3)以grid用户的身份,启动该 VIP 资源
$GRID_HOME/bin/crsctl start resource oggvip
[grid@rac1 ~]$ /u01/app/11.2.0/grid/bin/crsctl start resource oggvip
CRS-2672: Attempting to start 'oggvip' on 'rac2'
CRS-2676: Start of 'oggvip' on 'rac2' succeeded
4)检查VIP状态
$GRID_HOME/bin/crsctl status resource oggvip
[grid@rac1 ~]$ /u01/app/11.2.0/grid/bin/crsctl status resource oggvip
NAME=oggvip
TYPE=app.appvip.type
TARGET=ONLINE
STATE=ONLINE on rac2
5.创建代理ACTION脚本
Oracle集群通过agent代理运行与资源相关的命令。通过创建代理ACTION脚本,用于后续集群的调用。
该代理脚本接受以下参数值:
start、stop、check、clean 和 abort。
必须将脚本保存在所有节点的相同路径,且以grid用户为属主,并具有执行权限(例如可以放在$GRID_HOME/crs/script/)。
[root@rac1 script]# chown grid:oinstall 11gr2_ogg_action.scr
[root@rac1 script]# chmod 755 11gr2_ogg_action.scr
具体脚本内容:
#!/bin/sh
# When you change the environment(hosts,databases ...etc),change the environment variables below too.
#set the Oracle Goldengate installation directory
export OGG_HOME=/home/ogg/ogg
#set the oracle home to the database to ensure GoldenGate will get the
#right environment settings to be able to connect to the database
export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/db_1/
#specify delay after start before checking for successful start
start_delay_secs=5
#Include the GoldenGate home in the library path to start GGSCI
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:/lib:/usr/lib
#check_process validates that a manager process is running at the PID
#that GoldenGate specifies.
check_process () {
if ( [ -f "${OGG_HOME}/dirpcs/MGR.pcm" ] )
then
pid=`cut -f8 "${OGG_HOME}/dirpcs/MGR.pcm"`
if [ {pid} = `ps -e |greppid=‘ps−e∣grep{pid} |grep mgr |cut -d " " -f2` ]
then
#manager process is running on the PID exit success
exit 0
else
if [ {pid} = `ps -e |greppid=‘ps−e∣grep{pid} |grep mgr |cut -d " " -f1` ]
then
#manager process is running on the PID exit success
exit 0
else
#manager process is not running on the PID
exit 1
fi
fi
else
#manager is not running because there is no PID file
exit 1
fi
}
#call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
ggsci_command=$1
ggsci_output=`${OGG_HOME}/ggsci<<EOF
${ggsci_command}
exit
EOF`
}
case $1 in
'start')
#start manager
call_ggsci 'start manager'
#there is a small delay between issuing the start manager command
#and the process being spawned on the OS. wait before checking
sleep ${start_delay_secs}
#check whether manager is running and exit accordingly
check_process
;;
'stop')
#attempt a clean stop for all non-manager processes
#call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
'check')
check_process
;;
'clean')
#attempt a clean stop for all non-manager processes
#call_ggsci 'stop er *'
#ensure everything is stopped
#call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
'abort')
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
esac
注:从脚本内容可以看出,脚本只启动manager进程,所以务必在manager进程中配置AUTOSTART参数。
6.在集群中注册OGG
以grid用户执行以下命令:
$GRID_HOME/bin/crsctl add resource oggapp -type cluster_resource
-attr "ACTION_SCRIPT=/u01/app/11.2.0/grid/crs/script/11gr2_ogg_action.scr,
CHECK_INTERVAL=30, START_DEPENDENCIES='hard(oggvip,ora.asm)
pullup(oggvip)', STOP_DEPENDENCIES='hard(oggvip)' SCRIPT_TIMEOUT=300"
[grid@rac1 script]# /u01/app/11.2.0/grid/bin/crsctl add resource oggapp -type cluster_resource
> -attr "ACTION_SCRIPT=/u01/app/11.2.0/grid/crs/script/11gr2_ogg_action.scr,
> CHECK_INTERVAL=30, START_DEPENDENCIES='hard(oggvip,ora.asm)
> pullup(oggvip)', STOP_DEPENDENCIES='hard(oggvip)' SCRIPT_TIMEOUT=300"
注:START_DEPENDENCIES:启动oggapp时与oggvip、ora.asm设置为强关联关系;
STOP_DEPENDENCIES:停止oggapp时与oggvip设置为强关联关系。
如果不使用VIP(源端与目标端在同一主机),则以以下方式注册:
$GRID_HOME/bin/crsctl add resource oggapp
-type cluster_resource
-attr "ACTION_SCRIPT=/u01/app/11.2.0/grid/crs/script/11gr2_ogg_action.scr, CHECK_INTERVAL=30, START_DEPENDENCIES='hard(ora.asm)' SCRIPT_TIMEOUT=300"
如果需要以其他用户管理,需要授权:
管理权限:
crsctl setperm resource oggapp -o <username>
启动权限:
crsctl setperm resource oggapp -u user: <username>:r-x
7.启动OGG资源
添加该资源后,使用 crsctl集群命令启动 Oracle GoldenGate。以 grid用户登录,启动命令:
$GRID_HOME/bin/crsctl start resource oggapp
1)启动资源前,先确认ogg进程均已stop:
[oracle@rac2 ogg]$ ./ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 12.2.0.1.1 OGGCORE_12.2.0.1.0_PLATFORMS_151211.1401_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Dec 12 2015 00:54:38
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.
GGSCI (rac2) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER STOPPED
EXTRACT STOPPED EXT01 00:00:00 00:01:01
REPLICAT STOPPED REP01 00:00:00 00:01:37
2)启动资源:
[grid@rac2 ~]$ crsctl start resource oggapp
CRS-2672: Attempting to start 'oggapp' on 'rac2'
CRS-2676: Start of 'oggapp' on 'rac2' succeeded
3)再次查看ogg进程状态:
GGSCI (rac2) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT01 00:01:03 00:00:06
REPLICAT RUNNING REP01 00:01:45 00:00:03
4)检查app状态,输入以下命令:
$GRID_HOME/bin/crsctl status resource oggapp
[grid@rac2 ~]$ crsctl status resource oggapp
NAME=oggapp
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on rac2
8.管理OGG资源
要手工切换OGG到其他节点上,可使用带 force 选项的relocate resource命令。可以在集群中任何节点上以grid用户身份运行此命令。
1)执行前查看资源目前运行的节点:
[grid@rac2 ~]$ crsctl status resource oggapp
NAME=oggapp
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on rac2
2)执行relocate操作:
[root@rac2 ~]# crsctl relocate resource oggapp -f
CRS-2673: Attempting to stop 'oggapp' on 'rac2'
CRS-2677: Stop of 'oggapp' on 'rac2' succeeded
CRS-2673: Attempting to stop 'oggvip' on 'rac2'
CRS-2677: Stop of 'oggvip' on 'rac2' succeeded
CRS-2672: Attempting to start 'oggvip' on 'rac1'
CRS-2676: Start of 'oggvip' on 'rac1' succeeded
CRS-2672: Attempting to start 'oggapp' on 'rac1'
CRS-2676: Start of 'oggapp' on 'rac1' succeeded
3)在节点1查看ogg状态:
GGSCI (rac1) 2> !
info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT01 00:01:15 00:00:09
REPLICAT RUNNING REP01 00:02:11 00:00:02
4)通过crsctl查看集群资源也可以看到:
[grid@rac1 ~]$ crsctl stat res -t
-----------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
-----------------------------------------------------------------------
Cluster Resources
-----------------------------------------------------------------------
oggapp
1 ONLINE ONLINE rac1
oggvip
1 ONLINE ONLINE rac1
故障模拟测试案例
数据库主机宕机
查看OGG运行状态:
[grid@rac1 ~]$ crsctl status resource oggapp
NAME=oggapp
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on rac1
OGG在1节点运行,模拟1节点宕机(关闭1节点主机),查看oggapp状态:
[root@rac2 ~]# crsctl status resource oggapp
NAME=oggapp
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on rac2
发现已经自动relocate至2节点。进入ggsci命令行查看进程状态:
GGSCI (rac2) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:15 00:00:04
REPLICAT RUNNING REP01 00:00:21 00:00:01
进程已经正常启动。