MySQL MHA架构介绍:
MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。
该软件由两部分组成:MHA Manager(管理节点)和MHA Node(数据节点)。MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群,也可以部署在一台slave节点上。MHA Node运行在每台MySQL服务器上,MHA Manager会定时探测集群中的master节点,当master出现故障时,它可以自动将最新数据的slave提升为新的master,然后将所有其他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。
在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。
目前MHA主要支持一主多从的架构,要搭建MHA,要求一个复制集群中必须最少有三台数据库服务器,一主二从,即一台充当master,一台充当备用master,另外一台充当从库,因为至少需要三台服务器,出于机器成本的考虑,淘宝也在该基础上进行了改造,目前淘宝TMHA已经支持一主一从。(出自:《深入浅出MySQL(第二版)》)
架构图:
MHA工作原理总结为以下几条:
(1)从宕机崩溃的master保存二进制日志事件(binlog events);
(2)识别含有最新更新的slave;
(3)应用差异的中继日志(relay log) 到其他slave;
(4)应用从master保存的二进制日志事件(binlog events);
(5)提升一个slave为新master;
(6)使用其他的slave连接新的master进行复制。
官方介绍:https://code.google.com/p/mysql-master-ha/
实验环境:(centos6.2 MySQL版本5.5)
角色 ip地址 主机名 server_id 类型
Monitor host 192.168.2.131 server01 - 监控复制组
Master 192.168.2.128 server02 1 写入
Candicate master 192.168.2.129 server03 2 读
Slave 192.168.2.130 server04 3 读
server03和server04是server02的slave,复制环境搭建后面会简单演示,其中master对外提供写服务,备选master(实际的slave,主机名server03)提供读服务,slave也提供相关的读服务,一旦master宕机,将会把备选master提升为新的master,slave指向新的master
1、部署MHA过程:
方法一
在所有节点都要安装MHA node所需的perl模块(DBD:mysql),可以通过yum安装,如果没epel源,先安装epel源,在如下:(温馨提示:系统时间一定要是最新的,否则安装时会出各种奇葩问题)
在server02(192.168.2.128)操作:
192.168.2.128 [root ~]$ rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
warning: /var/tmp/rpm-tmp.SAbcKl: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Preparing... ########################################### [100%]
1:epel-release ########################################### [100%]
192.168.2.128 [root ~]$ yum install perl-DBD-MySQL -y
在server03(192.168.2.129)操作:
192.168.2.129 [root ~]$ rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
warning: /var/tmp/rpm-tmp.gsdYwg: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Preparing... ########################################### [100%]
1:epel-release ########################################### [100%]
192.168.2.129 [root ~]$ yum install perl-DBD-MySQL -y
在server04(192.168.2.130)操作:
192.168.2.130 [root ~]$ rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
warning: /var/tmp/rpm-tmp.TUeiym: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Preparing... ########################################### [100%]
1:epel-release ########################################### [100%]
192.168.2.130 [root ~]$ yum install perl-DBD-MySQL -y
(2)在所有的节点安装MHA node:(下面以server02为例,记得server03和server04也一样的操作),MHA node和MHA Manager都在要官网下载,
下载地址:https://code.google.com/p/mysql-master-ha/wiki/Downloads?tm=2(自备梯子)
192.168.2.128 [root ~]$ tar xf mha4mysql-node-0.56.tar.gz
192.168.2.128 [root ~]$ cd mha4mysql-node-0.56
192.168.2.128 [root mha4mysql-node-0.56]$ perl Makefile.PL
Can't locate ExtUtils/MakeMaker.pm in @INC (@INC contains: inc /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at inc/Module/Install/Makefile.pm line 4.
192.168.2.128 [root mha4mysql-node-0.56]$ yum install -y perl-devel
192.168.2.128 [root mha4mysql-node-0.56]$ perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
Can't locate CPAN.pm in @INC (@INC contains: inc /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at inc/Module/AutoInstall.pm line 279.
192.168.2.128 [root mha4mysql-node-0.56]$ yum install -y perl-CPAN
192.168.2.128 [root mha4mysql-node-0.56]$ perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI ...loaded. (1.609)
- DBD::mysql ...loaded. (4.013)
*** Module::AutoInstall configuration finished.
Checking if your kit is complete...
Looks good
Writing Makefile for mha4mysql::node
192.168.2.128 [root mha4mysql-node-0.56]$ make && make install
方法二、
安装MHA node所需的perl模块(DBD:mysql)也可以通过脚本安装,安装脚本如下:(个人不建议用这样的方法安装,安装时间比较长,本人试过,蛋碎一地,太多问题了,要花时间去找资料,而且我们不能确定这些依赖带来的问题是否影响后面的使用)
192.168.2.128 [root ~]$ cat install.sh
#!/bin/bash
wget http://xrl.us/cpanm --no-check-certificate
mv cpanm /usr/bin
chmod 755 /usr/bin/cpanm
cat > /root/list << EOF
install DBD::mysql
EOF
for package in `cat /root/list`
do
cpanm $package
done
192.168.2.128 [root ~]$
再安装MHA node节点:
192.168.2.128 [root ~]$ tar xf mha4mysql-node-0.56.tar.gz
192.168.2.128 [root ~]$ cd mha4mysql-node-0.56
192.168.2.128 [root mha4mysql-node-0.56]$ perl Makefile.PL
192.168.2.128 [root mha4mysql-node-0.56]$ make && make install
安装完成后会在/usr/local/bin目录下生成以下脚本文件:
192.168.2.128 [root mha4mysql-node-0.56]$ cd /usr/local/bin/
192.168.2.128 [root bin]$ pwd
/usr/local/bin
192.168.2.128 [root bin]$ ll
总用量 40576
-r-xr-xr-x 1 root root 15498 1月 18 11:02 apply_diff_relay_logs
-r-xr-xr-x 1 root root 4807 1月 18 11:02 filter_mysqlbinlog
-r-xr-xr-x 1 root root 7401 1月 18 11:02 purge_relay_logs
-r-xr-xr-x 1 root root 7263 1月 18 11:02 save_binary_logs
192.168.2.128 [root bin]$
Node脚本说明:(这些工具通常由MHA Manager的脚本触发,无需人为操作)
save_binary_logs //保存和复制master的二进制日志
apply_diff_relay_logs //识别差异的中继日志事件并将其差异的事件应用于其他的slave
filter_mysqlbinlog //去除不必要的ROLLBACK事件(MHA已不再使用这个工具)
purge_relay_logs //清除中继日志(不会阻塞SQL线程)
2.安装MHA Manager,在MHA Manager的主机也是需要安装MHA Node,MHA Manger也依赖于perl模块
(1)在MHA Manager的主机也是需要安装MHA Node,所以以下的步骤和上面的操作一样,如下(在server01 192.168.2.131操作):
192.168.2.131 [root ~]$ rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Preparing... ########################################### [100%]
package epel-release-6-8.noarch is already installed
192.168.2.131 [root ~]$ yum install perl-DBD-MySQL -y
192.168.2.131 [root ~]$ yum install -y perl-devel perl-CPAN
192.168.2.131 [root ~]$ tar xf mha4mysql-node-0.56.tar.gz
192.168.2.131 [root ~]$ cd mha4mysql-node-0.56
192.168.2.131 [root mha4mysql-node-0.56]$ perl Makefile.PL
192.168.2.131 [root mha4mysql-node-0.56]$ make && make install
(2)安装MHA Manager。首先安装MHA Manger依赖的perl模块(我这里使用yum安装):
192.168.2.131 [root mha4mysql-node-0.56]$ yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y
安装MHA Manager软件包:
192.168.2.131 [root mha4mysql-node-0.56]$ cd
192.168.2.131 [root ~]$ tar xf mha4mysql-manager-0.56.tar.gz
192.168.2.131 [root ~]$ cd mha4mysql-manager-0.56
192.168.2.131 [root mha4mysql-manager-0.56]$ perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI ...loaded. (1.609)
- DBD::mysql ...loaded. (4.013)
- Time::HiRes ...loaded. (1.9721)
- Config::Tiny ...loaded. (2.12)
- Log::Dispatch ...loaded. (2.26)
- Parallel::ForkManager ...loaded. (0.7.9)
- MHA::NodeConst ...loaded. (0.56)
*** Module::AutoInstall configuration finished.
Writing Makefile for mha4mysql::manager
192.168.2.131 [root mha4mysql-manager-0.56]$ make && make install
安装完成后,在/usr/local/bin会产生相关的脚本:
192.168.2.131 [root bin]$ pwd
/usr/local/bin
192.168.2.131 [root bin]$ ll
总用量 37364
-r-xr-xr-x. 1 root root 15498 1月 11 22:55 apply_diff_relay_logs
-r-xr-xr-x. 1 root root 4807 1月 11 22:55 filter_mysqlbinlog
-r-xr-xr-x. 1 root root 1995 1月 11 22:55 masterha_check_repl
-r-xr-xr-x. 1 root root 1779 1月 11 22:55 masterha_check_ssh
-r-xr-xr-x. 1 root root 1865 1月 11 22:55 masterha_check_status
-r-xr-xr-x. 1 root root 3201 1月 11 22:55 masterha_conf_host
-r-xr-xr-x. 1 root root 2517 1月 11 22:55 masterha_manager
-r-xr-xr-x. 1 root root 2165 1月 11 22:55 masterha_master_monitor
-r-xr-xr-x. 1 root root 2373 1月 11 22:55 masterha_master_switch
-r-xr-xr-x. 1 root root 3749 1月 11 22:55 masterha_secondary_check
-r-xr-xr-x. 1 root root 1739 1月 11 22:55 masterha_stop
-r-xr-xr-x. 1 root root 7401 1月 11 22:55 purge_relay_logs
-r-xr-xr-x. 1 root root 7263 1月 11 22:55 save_binary_logd
复制相关脚本到/usr/local/bin目录(软件包解压缩后就有了,不是必须,因为这些脚本不完整,需要自己修改,这是软件开发着留给我们自己发挥的,如果开启下面的任何一个脚本对应的参数,而对应这里的脚本又没有修改,则会抛错,自己被坑的很惨)
192.168.2.131 [root scripts]$ pwd
/root/mha4mysql-manager-0.56/samples/scripts
192.168.2.131 [root scripts]$ ll
总用量 32
-rwxr-xr-x. 1 root root 3443 1月 8 2012 master_ip_failover //自动切换时vip管理的脚本,不是必须,如果我们使用keepalived的,我们可以自己编写脚本完成对vip的管理,比如监控mysql,如果mysql异常, 我们停止keepalived就行,这样vip就会自动漂移
-rwxr-xr-x. 1 root root 9186 1月 8 2012 master_ip_online_change //在线切换时vip的管理,不是必须,同样可以可以自行编写简单的shell完成
-rwxr-xr-x. 1 root root 11867 1月 8 2012 power_manager //故障发生后关闭主机的脚本,不是必须
-rwxr-xr-x. 1 root root 1360 1月 8 2012 send_report //因故障切换后发送报警的脚本,不是必须,可自行编写简单的shell完成
192.168.2.131 [root scripts]$ cp * /usr/local/bin/
3.配置SSH登录无密码验证(使用key登录,工作中常用,最好不要禁掉密码登录,如果禁了,可能会有问题)
在server02 192.168.2.131操作(Monitor):
192.168.2.131 [root ~]$ ssh-keygen -t rsa
192.168.2.131 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.128
192.168.2.131 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.129
192.168.2.131 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.130
在server02 192.168.2.128操作(Master):
192.168.2.128 [root ~]$ ssh-keygen -t rsa
192.168.2.128 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.129
192.168.2.128 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.130
在server03 192.168.2.129操作(slave):
192.168.2.129 [root ~]$ ssh-keygen -t rsa
192.168.2.129 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.128
192.168.2.129 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.130
在server04 192.168.2.130操作(slave):
192.168.2.130 [root ~]$ ssh-keygen -t rsa
192.168.2.130 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.128
192.168.2.130 [root ~]$ ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.2.129
4.搭建主从复制环境
注意:binlog-do-db 和 replicate-ignore-db 设置必须相同。 MHA 在启动时候会检测过滤规则,如果过滤规则不同,MHA 不启动监控和故障转移。
(1)在Master 192.168.2.128(server02)上备份一份完整的数据:
192.168.2.128 [root ~]$ mysqldump -uroot -p123456 --master-data=2 --single-transaction -R --triggers -A > all.sql
其中--master-data=2代表备份时刻记录master的Binlog位置和Position,--single-transaction意思是获取一致性快照,-R意思是备份存储过程和函数,--triggres的意思是备份触发器,-A代表备份所有的库。更多信息请自行mysqldump --help查看。
(2)在Master 192.168.2.128(server02)上创建复制用户:
mysql> grant replication slave on *.* to 'repl'@'192.168.2.%' identified by '123456';
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
(3)查看主库备份时的binlog名称和位置,MASTER_LOG_FILE和MASTER_LOG_POS:
192.168.2.128 [root ~]$ head -n 30 all.sql | grep 'CHANGE MASTER TO'
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000004', MASTER_LOG_POS=245;
(4)把备份复制到192.168.2.129和192.168.2.130
192.168.2.128 [root ~]$ scp all.sql 192.168.2.129:/root/
all.sql 100% 500KB 500.5KB/s 00:00
192.168.2.128 [root ~]$ scp all.sql 192.168.2.130:/root/
all.sql
(5)分别在两台服务器上导入备份,执行复制相关命令
在slave主机server03 192.168.2.129上操作:
192.168.2.129 [root ~]$ mysql -uroot -p123456 < ./all.sql
mysql> stop slave;
Query OK, 0 rows affected (0.01 sec)
mysql> CHANGE MASTER TO MASTER_HOST='192.168.2.128',MASTER_USER='repl', MASTER_PASSWORD='123456',MASTER_LOG_FILE='mysql-bin.000004',MASTER_LOG_POS=245;
Query OK, 0 rows affected (0.01 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.2.128
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000004
Read_Master_Log_Pos: 472
Relay_Log_File: localhost-relay-bin.000002
Relay_Log_Pos: 480
Relay_Master_Log_File: mysql-bin.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
在slave master04 192.168.2.130上操作,导入备份,执行同步操作,如下:
192.168.2.130 [root ~]$ mysql -uroot -p123456 < ./all.sql
mysql> slave stop;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> CHANGE MASTER TO MASTER_HOST='192.168.2.128',MASTER_USER='repl', MASTER_PASSWORD='123456',MASTER_LOG_FILE='mysql-bin.000004',MASTER_LOG_POS=245;
Query OK, 0 rows affected (0.04 sec)
mysql> slave start;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.2.128
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000004
Read_Master_Log_Pos: 472
Relay_Log_File: localhost-relay-bin.000002
Relay_Log_Pos: 480
Relay_Master_Log_File: mysql-bin.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
(6)两台slave服务器设置read_only(从库对外提供读服务,之所以没有写进配置文件,是因为随时slave会提升为master)
192.168.2.129 [root ~]$ mysql -uroot -p123456 -e "set global read_only=1"
192.168.2.130 [root ~]$ mysql -uroot -p123456 -e "set global read_only=1"
(7)创建监控用户(在master上执行,也就是server02 192.168.2.128):
mysql> grant all privileges on *.* to 'root'@'192.168.2.%' identified by '123456';
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.01 sec)
mysql>
到这里整个集群环境已经搭建完毕,剩下的就是配置MHA软件了。
5.配置MHA
(1)创建MHA的工作目录,并且创建相关配置文件(在软件包解压后的目录里面有样例配置文件)。
192.168.2.131 [root ~]$ mkdir -p /etc/masterha
192.168.2.131 [root ~]$ cp mha4mysql-manager-0.56/samples/conf/app1.cnf /etc/masterha/
修改app1.cnf配置文件,修改后的文件内容如下(注意,配置文件中的注释需要去掉,我这里是为了解释清楚):
[root@192.168.2.131 ~]# cat /etc/masterha/app1.cnf
[server default]
manager_workdir=/var/log/masterha/app1.log //设置manager的工作目录
manager_log=/var/log/masterha/app1/manager.log //设置manager的日志
master_binlog_dir=/data/mysql //设置master 保存binlog的位置,以便MHA可以找到master的日志,我这里的也就是mysql的数据目录
master_ip_failover_script= /usr/local/bin/master_ip_failover //设置自动failover时候的切换脚本
master_ip_online_change_script= /usr/local/bin/master_ip_online_change //设置手动切换时候的切换脚本
password=123456 //设置mysql中root用户的密码,这个密码是前文中创建监控用户的那个密码
user=root 设置监控用户root
ping_interval=1 //设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行railover
remote_workdir=/tmp //设置远端mysql在发生切换时binlog的保存位置
repl_password=123456 //设置复制用户的密码
repl_user=repl //设置复制环境中的复制用户名
report_script=/usr/local/bin/send_report //设置发生切换后发送的报警的脚本
secondary_check_script= /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306 //一旦MHA到server02的监控之间出现问题,MHA Manager将会尝试从server03登录到server02
shutdown_script="" //设置故障发生后关闭故障主机脚本(该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用)
ssh_user=root //设置ssh的登录用户名
[server1]
hostname=192.168.2.128
port=3306
[server2]
hostname=192.168.2.129
port=3306
candidate_master=1
//设置为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave
check_repl_delay=0 //默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master
[server3]
hostname=192.168.2.130
port=3306
(2)设置relay log的清除方式(在每个slave节点上):
在slave master03 192.168.2.129操作:
192.168.2.129 [root ~]$ mysql -uroot -p123456 -e "set global relay_log_purge=0"
在slave master04 192.168.2.130操作:
192.168.2.130 [root ~]$ mysql -uroot -p123456 -e "set global relay_log_purge=0"
注意:
MHA在发生切换的过程中,从库的恢复过程中依赖于relay log的相关信息,所以这里要将relay log的自动清除设置为OFF,采用手动清除relay log的方式。在默认情况下,从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中,这些中继日志在恢复其他从服务器时可能会被用到,因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下,删除大的文件需要一定的时间,会导致严重的复制延时。为了避免复制延时,需要暂时为中继日志创建硬链接,因为在linux系统中通过硬链接删除大文件速度会很快。(在mysql数据库中,删除大表时,通常也采用建立硬链接的方式)
设置定期清理relay脚本(两台slave服务器):
在slave master03 192.168.2.129操作:
192.168.2.129 [root ~]$ cat purge_relay_log.sh
#!/bin/bash
user=root
passwd=123456
port=3306
log_dir='/data/masterha/log'
work_dir='/data'
purge='/usr/local/bin/purge_relay_logs'
if [ ! -d $log_dir ]
then
mkdir $log_dir -p
fi
$purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1
192.168.2.129 [root ~]$ crontab -l
0 4 * * * /bin/bash /root/purge_relay_log.sh
在slave master03 192.168.2.130操作跟上面是一样的,这里不演示了。
参数说明:
--user mysql //用户名
--password mysql //密码
--port //端口号
--workdir //指定创建relay log的硬链接的位置,默认是/var/tmp,由于系统不同分区创建硬链接文件会失败,故需要执行硬链接具体位置,成功执行脚本后,硬链接的中继日志文件被删除
--disable_relay_log_purge //默认情况下,如果relay_log_purge=1,脚本会什么都不清理,自动退出,通过设定这个参数,当relay_log_purge=1的情况下会将relay_log_purge设置为0。清理relay log之后,最后将参数设置为OFF。
purge_relay_logs脚本删除中继日志不会阻塞SQL线程。下面我们手动执行看看什么情况:
192.168.2.129 [root ~]$ purge_relay_logs --user=root --password=123456 --port=3306 -disable_relay_log_purge --workdir=/data/
2015-01-18 12:30:51: purge_relay_logs script started.
Found relay_log.info: /data/mysql/relay-log.info
Removing hard linked relay log files localhost-relay-bin* under /data/.. done.
Current relay log file: /data/mysql/localhost-relay-bin.000002
Archiving unused relay log files (up to /data/mysql/localhost-relay-bin.000001) ...
Creating hard link for /data/mysql/localhost-relay-bin.000001 under /data//localhost-relay-bin.000001 .. ok.
Creating hard links for unused relay log files completed.
Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.
Removing hard linked relay log files localhost-relay-bin* under /data/.. done.
2015-01-18 12:30:54: All relay log purging operations succeeded.
6.检查SSH配置(server01 192.168.2.131 Monitor 监控节点上操作),如下:
192.168.2.131 [root ~]$ masterha_check_ssh --conf=/etc/masterha/app1.cnf
Sun Jan 18 12:31:48 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Jan 18 12:31:48 2015 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Sun Jan 18 12:31:48 2015 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Sun Jan 18 12:31:48 2015 - [info] Starting SSH connection tests..
Sun Jan 18 12:31:49 2015 - [debug]
Sun Jan 18 12:31:48 2015 - [debug] Connecting via SSH from root@192.168.2.128(192.168.2.128:22) to root@192.168.2.129(192.168.2.129:22)..
Sun Jan 18 12:31:49 2015 - [debug] ok.
Sun Jan 18 12:31:49 2015 - [debug] Connecting via SSH from root@192.168.2.128(192.168.2.128:22) to root@192.168.2.130(192.168.2.130:22)..
Sun Jan 18 12:31:49 2015 - [debug] ok.
Sun Jan 18 12:31:50 2015 - [debug]
Sun Jan 18 12:31:49 2015 - [debug] Connecting via SSH from root@192.168.2.129(192.168.2.129:22) to root@192.168.2.128(192.168.2.128:22)..
Sun Jan 18 12:31:49 2015 - [debug] ok.
Sun Jan 18 12:31:49 2015 - [debug] Connecting via SSH from root@192.168.2.129(192.168.2.129:22) to root@192.168.2.130(192.168.2.130:22)..
Sun Jan 18 12:31:50 2015 - [debug] ok.
Sun Jan 18 12:31:50 2015 - [debug]
Sun Jan 18 12:31:49 2015 - [debug] Connecting via SSH from root@192.168.2.130(192.168.2.130:22) to root@192.168.2.128(192.168.2.128:22)..
Sun Jan 18 12:31:50 2015 - [debug] ok.
Sun Jan 18 12:31:50 2015 - [debug] Connecting via SSH from root@192.168.2.130(192.168.2.130:22) to root@192.168.2.129(192.168.2.129:22)..
Sun Jan 18 12:31:50 2015 - [debug] ok.
Sun Jan 18 12:31:50 2015 - [info] All SSH connection tests passed successfully.
可以看见各个节点ssh验证都是ok的。
7.检查整个复制环境状况(server01 192.168.2.131 Monitor 监控节点上操作),如下:
192.168.2.131 [root ~]$ masterha_check_repl --conf=/etc/masterha/app1.cnf
Sun Jan 18 13:08:11 2015 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000004
Sun Jan 18 13:08:11 2015 - [info] Connecting to root@192.168.2.128(192.168.2.128)..
Creating /tmp if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data/mysql, up to mysql-bin.000004
Sun Jan 18 13:08:11 2015 - [info] Master setting check done.
Sun Jan 18 13:08:11 2015 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Sun Jan 18 13:08:11 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.2.129 --slave_ip=192.168.2.129 --slave_port=3306 --workdir=/tmp --target_version=5.5.60-log --manager_version=0.56 --relay_log_info=/data/mysql/relay-log.info --relay_dir=/data/mysql/ --slave_pass=xxx
Sun Jan 18 13:08:11 2015 - [info] Connecting to root@192.168.2.129(192.168.2.129:22)..
Can't exec "mysqlbinlog": 没有那个文件或目录 at /usr/local/share/perl5/MHA/BinlogManager.pm line 99.
mysqlbinlog version not found!
at /usr/local/bin/apply_diff_relay_logs line 463
Sun Jan 18 13:08:12 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln193] Slaves settings check failed!
Sun Jan 18 13:08:12 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln372] Slave configuration failed.
Sun Jan 18 13:08:12 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations. at /usr/local/bin/masterha_check_repl line 48
Sun Jan 18 13:08:12 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.
Sun Jan 18 13:08:12 2015 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
192.168.2.131 [root ~]$
如果发现如下错误:
Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 99.
mysqlbinlog version not found!
Testing mysql connection and privileges..sh: mysql: command not found
mysql command failed with rc 127:0!
可以通过以下方法解决(在所有节点上执行):
192.168.2.128 [root ~]$ ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
192.168.2.128 [root ~]$ ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
192.168.2.129 [root ~]$ ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
192.168.2.129 [root ~]$ ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
192.168.2.130 [root ~]$ ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
192.168.2.130 [root ~]$ ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
再进行检查
192.168.2.131 [root ~]$ masterha_check_repl --conf=/etc/masterha/app1.cnf
Sun Jan 18 13:19:41 2015 - [info] Checking replication health on 192.168.2.129..
Sun Jan 18 13:19:41 2015 - [info] ok.
Sun Jan 18 13:19:41 2015 - [info] Checking replication health on 192.168.2.130..
Sun Jan 18 13:19:41 2015 - [info] ok.
Sun Jan 18 13:19:41 2015 - [info] Checking master_ip_failover_script status:
Sun Jan 18 13:19:41 2015 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.2.128 --orig_master_ip=192.168.2.128 --orig_master_port=3306
Bareword "FIXME_xxx" not allowed while "strict subs" in use at /usr/local/bin/master_ip_failover line 88.
Execution of /usr/local/bin/master_ip_failover aborted due to compilation errors.
Sun Jan 18 13:19:41 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln214] Failed to get master_ip_failover_script status with return code 255:0.
Sun Jan 18 13:19:41 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations. at /usr/local/bin/masterha_check_repl line 48
Sun Jan 18 13:19:41 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.
Sun Jan 18 13:19:41 2015 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
192.168.2.131 [root ~]$
还是报错,纠结N久,才发现原因是:原来Failover两种方式:一种是虚拟IP地址,一种是全局配置文件。MHA并没有限定使用哪一种方式,而是让用户自己选择,虚拟IP地址的方式会牵扯到其它的软件,比如keepalive软件,而且还要修改脚本master_ip_failover。
所以先暂时注释master_ip_failover_script= /usr/local/bin/master_ip_failover这个选项。后面引入keepalived后和修改该脚本以后再开启该选项
192.168.2.131 [root ~]$ grep master_ip_failover /etc/masterha/app1.cnf
#master_ip_failover_script= /usr/local/bin/master_ip_failover
再次进行状态查看:
192.168.2.131 [root ~]$ masterha_check_repl --conf=/etc/masterha/app1.cnf
Sun Jan 18 13:23:57 2015 - [info] Slaves settings check done.
Sun Jan 18 13:23:57 2015 - [info]
192.168.2.128 (current master)
+--192.168.2.129
+--192.168.2.130
Sun Jan 18 13:23:57 2015 - [info] Checking replication health on 192.168.2.129..
Sun Jan 18 13:23:57 2015 - [info] ok.
Sun Jan 18 13:23:57 2015 - [info] Checking replication health on 192.168.2.130..
Sun Jan 18 13:23:57 2015 - [info] ok.
Sun Jan 18 13:23:57 2015 - [warning] master_ip_failover_script is not defined.
Sun Jan 18 13:23:57 2015 - [warning] shutdown_script is not defined.
Sun Jan 18 13:23:57 2015 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
192.168.2.131 [root ~]$
已经没有明显报错,只有两个警告而已,复制也显示正常了,哈哈,没报错了,先乐一会^0^
8.检查MHA Manager的状态 通过master_check_status脚本查看Manager的状态:
192.168.2.131 [root ~]$ masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
注意:如果正常,会显示"PING_OK",否则会显示"NOT_RUNNING",这代表MHA监控没有开启。
9.开启MHA Manager监控(server01 192.168.2.131操作)如下:
192.168.2.131 [root ~]$ mkdir -p /var/log/masterha/app1/
192.168.2.131 [root ~]$ nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
[1] 13014
192.168.2.131 [root ~]$
启动参数说明:
--remove_dead_master_conf //该参数代表当发生主从切换后,老的主库的ip将会从配置文件中移除。
--manger_log //日志存放位置
--ignore_last_failover //在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为--ignore_last_failover。
查看MHA Manager监控是否正常:
192.168.2.131 [root ~]$ masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:13014) is running(0:PING_OK), master:192.168.2.128
可以看见已经在监控了,而且master的主机为192.168.2.128
10.查看启动日志(server01 192.168.2.131操作)如下:
192.168.2.131 [root ~]$ tail -n20 /var/log/masterha/app1/manager.log
Sun Jan 18 13:27:22 2015 - [info] Connecting to root@192.168.2.130(192.168.2.130:22)..
Checking slave recovery environment settings..
Opening /data/mysql/relay-log.info ... ok.
Relay log found at /data/mysql, up to localhost-relay-bin.000002
Temporary relay log file is /data/mysql/localhost-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Sun Jan 18 13:27:22 2015 - [info] Slaves settings check done.
Sun Jan 18 13:27:22 2015 - [info]
192.168.2.128 (current master)
+--192.168.2.129
+--192.168.2.130
Sun Jan 18 13:27:22 2015 - [warning] master_ip_failover_script is not defined.
Sun Jan 18 13:27:22 2015 - [warning] shutdown_script is not defined.
Sun Jan 18 13:27:22 2015 - [info] Set master ping interval 1 seconds.
Sun Jan 18 13:27:22 2015 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306
Sun Jan 18 13:27:22 2015 - [info] Starting ping health check on 192.168.2.128(192.168.2.128:3306)..
Sun Jan 18 13:27:22 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
192.168.2.131 [root ~]$
其中"Ping(SELECT) succeeded, waiting until MySQL doesn't respond.."说明整个系统已经开始监控了。
11.关闭MHA Manage监控(server01 192.168.2.131操作)如下:
关闭很简单,使用masterha_stop命令完成。(只是演示关闭,在测试中,必须是开启的状态,如果关了,在测试的时候务必记得开启)
192.168.2.131 [root ~]$ masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+ Exit 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1
192.168.2.131 [root ~]$
12.配置VIP vip配置可以采用两种方式,一种通过keepalived的方式管理虚拟ip的浮动;另外一种通过脚本方式启动虚拟ip的方式(即不需要keepalived或者heartbeat类似的软件)。 下面先介绍通过安装keepalived来管理虚拟IP的浮动:
(1)下载软件进行并进行安装(两台master,准确的说一台是master,另外一台是备选master,在没有切换以前是slave)server02 192.168.2.128操作:
192.168.2.128 [root ~]$ wget http://www.keepalived.org/software/keepalived-1.2.12.tar.gz
192.168.2.128 [root ~]$ tar xf keepalived-1.2.12.tar.gz
192.168.2.128 [root ~]$ cd keepalived-1.2.12
192.168.2.128 [root keepalived-1.2.12]$ ./configure --prefix=/usr/local/keepalived
192.168.2.128 [root keepalived-1.2.12]$ make && make install
192.168.2.128 [root keepalived-1.2.12]$ cp /usr/local/keepalived/etc/rc.d/init.d/keepalived /etc/init.d/
192.168.2.128 [root keepalived-1.2.12]$ cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
192.168.2.128 [root keepalived-1.2.12]$ mkdir /etc/keepalived
192.168.2.128 [root keepalived-1.2.12]$ cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
192.168.2.128 [root keepalived-1.2.12]$ cp /usr/local/keepalived/sbin/keepalived /usr/sbin/
server03 192.168.2.129也要执行上面的操作,安装是一样的,配置文件不一样,这里不演示,自已安装哈
(2)配置keepalived的配置文件,在master上配置(server02 192.168.2.128)操作如下:
192.168.2.128 [root keepalived-1.2.12]$ cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email {
saltstack@163.com
}
notification_email_from dba@dbserver.com
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id MySQL-HA
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 150
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.2.88
}
}
192.168.2.128 [root keepalived-1.2.12]$
其中router_id MySQL HA表示设定keepalived组的名称,将192.168.2.88这个虚拟ip绑定到该主机的eth0网卡上,并且设置了状态为backup模式,将keepalived的模式设置为非抢占模式(nopreempt),priority 150表示设置的优先级为150。下面的配置略有不同,但是都是一个意思。(还有一个细节要注意的,要看清楚自己的网卡是eth0做模拟VIP,还是eth1)
在候选master上配置(server03 192.168.2.129)操作如下:
192.168.2.129 [root keepalived-1.2.12]$ cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email {
saltstack@163.com
}
notification_email_from dba@dbserver.com
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id MySQL-HA
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 120
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.2.88
}
}
192.168.2.129 [root keepalived-1.2.12]$
(3)启动keepalived服务,在master上启动并查看日志(server02 192.168.2.128)操作如下:
192.168.2.128 [root keepalived-1.2.12]$ /etc/init.d/keepalived start
正在启动 keepalived: [确定]
192.168.2.128 [root keepalived-1.2.12]$ tail -f /var/log/messages
Jan 18 13:47:20 localhost Keepalived_healthcheckers[4638]: Registering Kernel netlink reflector
Jan 18 13:47:20 localhost Keepalived_healthcheckers[4638]: Registering Kernel netlink command channel
Jan 18 13:47:20 localhost Keepalived_healthcheckers[4638]: Opening file '/etc/keepalived/keepalived.conf'.
Jan 18 13:47:20 localhost Keepalived_healthcheckers[4638]: Configuration is using : 7105 Bytes
Jan 18 13:47:20 localhost Keepalived_healthcheckers[4638]: Using LinkWatch kernel netlink reflector...
Jan 18 13:47:23 localhost Keepalived_vrrp[4639]: VRRP_Instance(VI_1) Transition to MASTER STATE
Jan 18 13:47:24 localhost Keepalived_vrrp[4639]: VRRP_Instance(VI_1) Entering MASTER STATE
Jan 18 13:47:24 localhost Keepalived_vrrp[4639]: VRRP_Instance(VI_1) setting protocol VIPs.
Jan 18 13:47:24 localhost Keepalived_vrrp[4639]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.2.88
Jan 18 13:47:24 localhost Keepalived_healthcheckers[4638]: Netlink reflector reports IP 192.168.2.88 added
Jan 18 13:47:29 localhost Keepalived_vrrp[4639]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.2.88
启动候选master的keepalived(server03 192.168.2.129)操作如下:
192.168.2.129 [root keepalived-1.2.12]$ /etc/init.d/keepalived start
正在启动 keepalived: [确定]
192.168.2.129 [root keepalived-1.2.12]$
192.168.2.129 [root keepalived-1.2.12]$ tail -f /var/log/messages
Jan 18 13:52:31 localhost Keepalived_vrrp[4990]: Registering gratuitous ARP shared channel
Jan 18 13:52:31 localhost Keepalived_healthcheckers[4989]: Registering Kernel netlink command channel
Jan 18 13:52:31 localhost Keepalived_vrrp[4990]: Opening file '/etc/keepalived/keepalived.conf'.
Jan 18 13:52:31 localhost Keepalived_healthcheckers[4989]: Opening file '/etc/keepalived/keepalived.conf'.
Jan 18 13:52:31 localhost Keepalived_healthcheckers[4989]: Configuration is using : 7105 Bytes
Jan 18 13:52:31 localhost Keepalived_healthcheckers[4989]: Using LinkWatch kernel netlink reflector...
Jan 18 13:52:31 localhost Keepalived_vrrp[4990]: Configuration is using : 62850 Bytes
Jan 18 13:52:31 localhost Keepalived_vrrp[4990]: Using LinkWatch kernel netlink reflector...
Jan 18 13:52:31 localhost Keepalived_vrrp[4990]: VRRP_Instance(VI_1) Entering BACKUP STATE
Jan 18 13:52:31 localhost Keepalived_vrrp[4990]: VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(10,11)]
Jan 18 13:52:34 localhost Keepalived_vrrp[4990]: VRRP_Instance(VI_1) Transition to MASTER STATE
Jan 18 13:52:35 localhost Keepalived_vrrp[4990]: VRRP_Instance(VI_1) Entering MASTER STATE
Jan 18 13:52:35 localhost Keepalived_vrrp[4990]: VRRP_Instance(VI_1) setting protocol VIPs.
Jan 18 13:52:35 localhost Keepalived_vrrp[4990]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.2.88
Jan 18 13:52:35 localhost Keepalived_healthcheckers[4989]: Netlink reflector reports IP 192.168.2.88 added
Jan 18 13:52:40 localhost Keepalived_vrrp[4990]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.2.88
(4)查看绑定情况
192.168.2.128 [root keepalived-1.2.12]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:0c:29:86:dc:2a brd ff:ff:ff:ff:ff:ff
inet 192.168.2.128/24 brd 192.168.2.255 scope global eth0
inet 192.168.2.88/32 scope global eth0
inet6 fe80::20c:29ff:fe86:dc2a/64 scope link
valid_lft forever preferred_lft forever
192.168.2.128 [root keepalived-1.2.12]$
发现已经将虚拟IP 192.168.2.88绑定了master02 192.168.2.128的网卡eth0上了
从上面的信息可以看到keepalived已经配置成功。
注意: 上面两台服务器的keepalived都设置为了BACKUP模式,在keepalived中2种模式,分别是master->backup模式和backup->backup模式。这两种模式有很大区别。在master->backup模式下,一旦主库宕机,虚拟ip会自动漂移到从库,当主库修复后,keepalived启动后,还会把虚拟ip抢占过来,即使设置了非抢占模式(nopreempt)抢占ip的动作也会发生。在backup->backup模式下,当主库宕机后虚拟ip会自动漂移到从库上,当原主库恢复和keepalived服务启动后,并不会抢占新主的虚拟ip,即使是优先级高于从库的优先级别,也不会发生抢占。为了减少ip漂移次数,通常是把修复好的主库当做新的备库。
(5)MHA引入keepalived(MySQL服务进程挂掉时通过MHA 停止keepalived):
要想把keepalived服务引入MHA,我们只需要修改切换是触发的脚本文件master_ip_failover即可,在该脚本中添加在master发生宕机时对keepalived的处理。
1、编辑脚本/usr/local/bin/master_ip_failover,修改后如下(server01 192.168.2.131)操作:
192.168.2.131 [root ~]$ cat /usr/local/bin/master_ip_failover
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Getopt::Long;
my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);
my $vip = '192.168.2.88';
my $ssh_start_vip = "/etc/init.d/keepalived start";
my $ssh_stop_vip = "/etc/init.d/keepalived stop";
GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
);
exit &main();
sub main {
print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";
if ( $command eq "stop" || $command eq "stopssh" ) {
my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {
my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}
sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}
sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}
192.168.2.131 [root ~]$
把#master_ip_failover_script= /usr/local/bin/master_ip_failover打开
192.168.2.131 [root ~]$ grep 'master_ip_failover_script' /etc/masterha/app1.cnf
master_ip_failover_script= /usr/local/bin/master_ip_failover
执行检测:
192.168.2.131 [root ~]$ masterha_check_repl --conf=/etc/masterha/app1.cnf
Sun Jan 18 14:00:43 2015 - [info] Slaves settings check done.
Sun Jan 18 14:00:43 2015 - [info]
192.168.2.128 (current master)
+--192.168.2.129
+--192.168.2.130
Sun Jan 18 14:00:43 2015 - [info] Checking replication health on 192.168.2.129..
Sun Jan 18 14:00:43 2015 - [info] ok.
Sun Jan 18 14:00:43 2015 - [info] Checking replication health on 192.168.2.130..
Sun Jan 18 14:00:43 2015 - [info] ok.
Sun Jan 18 14:00:43 2015 - [info] Checking master_ip_failover_script status:
Sun Jan 18 14:00:43 2015 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.2.128 --orig_master_ip=192.168.2.128 --orig_master_port=3306
Unmatched right curly bracket at /usr/local/bin/master_ip_failover line 76, at end of line
syntax error at /usr/local/bin/master_ip_failover line 76, near "}"
Execution of /usr/local/bin/master_ip_failover aborted due to compilation errors.
Sun Jan 18 14:00:43 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln214] Failed to get master_ip_failover_script status with return code 255:0.
Sun Jan 18 14:00:43 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations. at /usr/local/bin/masterha_check_repl line 48
Sun Jan 18 14:00:43 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.
Sun Jan 18 14:00:43 2015 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
报以上的错,折腾了N多人,因为好多人不懂perl,看到模板就复制别人的代码,就是在复制的进去的时候,弄乱了,又手动调一下,导致各种各样的问题,我上面就是不小心导致的报错,手动修改了(cp的时候有一行多了一个#号),报错的大部份原因是master_ip_failover脚本导致的,而不要过多花时间纠结自己是否安装时安装少了东西,怀疑自己搭建的环境问题
再次执行检查:
192.168.2.131 [root ~]$ masterha_check_repl --conf=/etc/masterha/app1.cnf
Sun Jan 18 14:02:21 2015 - [info] Slaves settings check done.
Sun Jan 18 14:02:21 2015 - [info]
192.168.2.128 (current master)
+--192.168.2.129
+--192.168.2.130
Sun Jan 18 14:02:21 2015 - [info] Checking replication health on 192.168.2.129..
Sun Jan 18 14:02:21 2015 - [info] ok.
Sun Jan 18 14:02:21 2015 - [info] Checking replication health on 192.168.2.130..
Sun Jan 18 14:02:21 2015 - [info] ok.
Sun Jan 18 14:02:21 2015 - [info] Checking master_ip_failover_script status:
Sun Jan 18 14:02:21 2015 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.2.128 --orig_master_ip=192.168.2.128 --orig_master_port=3306
IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
Checking the Status of the script.. OK
Sun Jan 18 14:02:21 2015 - [info] OK.
Sun Jan 18 14:02:21 2015 - [warning] shutdown_script is not defined.
Sun Jan 18 14:02:21 2015 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
192.168.2.131 [root ~]$
可以看见已经没有报错了,再乐一会吧,哈哈……
/usr/local/bin/master_ip_failover添加或者修改的内容意思是当主库数据库发生故障时,会触发MHA切换,MHA Manager会停掉主库上的keepalived服务,触发虚拟ip漂移到备选从库,从而完成切换。当然可以在keepalived里面引入脚本,这个脚本监控mysql是否正常运行,如果不正常,则调用该脚本杀掉keepalived进程。
2、以下进行模拟主Master(192.168.2.128)down了:
192.168.2.128 [root keepalived-1.2.12]$ /etc/init.d/mysqld stop
Shutting down MySQL... SUCCESS!
192.168.2.128 [root keepalived-1.2.12]$
在管理节点(server01 192.168.2.131)查看日志:(报错)
192.168.2.131 [root ~]$ tail -f /var/log/masterha/app1/manager.log
192.168.2.128 (current master)
+--192.168.2.129
+--192.168.2.130
Sun Jan 18 13:32:37 2015 - [warning] master_ip_failover_script is not defined.
Sun Jan 18 13:32:37 2015 - [warning] shutdown_script is not defined.
Sun Jan 18 13:32:37 2015 - [info] Set master ping interval 1 seconds.
Sun Jan 18 13:32:37 2015 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306
Sun Jan 18 13:32:37 2015 - [info] Starting ping health check on 192.168.2.128(192.168.2.128:3306)..
Sun Jan 18 13:32:37 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Sun Jan 18 14:32:03 2015 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Sun Jan 18 14:32:03 2015 - [info] Executing seconary network check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306 --user=root --master_host=192.168.2.128 --master_ip=192.168.2.128 --master_port=3306
Sun Jan 18 14:32:03 2015 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --binlog_prefix=mysql-bin
Sun Jan 18 14:32:03 2015 - [info] HealthCheck: SSH to 192.168.2.128 is reachable.
Sun Jan 18 14:32:04 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Jan 18 14:32:04 2015 - [warning] Connection failed 1 time(s)..
Sun Jan 18 14:32:05 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Jan 18 14:32:05 2015 - [warning] Connection failed 2 time(s)..
Sun Jan 18 14:32:06 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Jan 18 14:32:06 2015 - [warning] Connection failed 3 time(s)..
ssh: Could not resolve hostname server03: Name or service not known
Monitoring server server03 is NOT reachable!
在管理节服务器192.168.2.131上添加hosts:
192.168.2.131 [root ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.2.128 server01
192.168.2.129 server02
192.168.2.130 server03
再查看日志(点下面加号可以查看日志):
192.168.2.131 [root ~]$ tail -f /var/log/masterha/app1/manager.log
IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
Checking the Status of the script.. OK
Sun Jan 18 17:11:19 2015 - [info] OK.
Sun Jan 18 17:11:19 2015 - [warning] shutdown_script is not defined.
Sun Jan 18 17:11:19 2015 - [info] Set master ping interval 1 seconds.
Sun Jan 18 17:11:19 2015 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306
Sun Jan 18 17:11:19 2015 - [info] Starting ping health check on 192.168.2.128(192.168.2.128:3306)..
Sun Jan 18 17:11:19 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Sun Jan 18 17:11:48 2015 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Sun Jan 18 17:11:48 2015 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.53 --binlog_prefix=mysql-bin
Sun Jan 18 17:11:48 2015 - [info] Executing seconary network check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306 --user=root --master_host=192.168.2.128 --master_ip=192.168.2.128 --master_port=3306
Sun Jan 18 17:11:48 2015 - [info] HealthCheck: SSH to 192.168.2.128 is reachable.
Monitoring server server03 is reachable, Master is not reachable from server03. OK.
Monitoring server server02 is reachable, Master is not reachable from server02. OK.
Sun Jan 18 17:11:48 2015 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sun Jan 18 17:11:49 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Jan 18 17:11:49 2015 - [warning] Connection failed 1 time(s)..
Sun Jan 18 17:11:50 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Jan 18 17:11:50 2015 - [warning] Connection failed 2 time(s)..
Sun Jan 18 17:11:51 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Jan 18 17:11:51 2015 - [warning] Connection failed 3 time(s)..
Sun Jan 18 17:11:51 2015 - [warning] Master is not reachable from health checker!
Sun Jan 18 17:11:51 2015 - [warning] Master 192.168.2.128(192.168.2.128:3306) is not reachable!
Sun Jan 18 17:11:51 2015 - [warning] SSH is reachable.
Sun Jan 18 17:11:51 2015 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..
Sun Jan 18 17:11:51 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Jan 18 17:11:51 2015 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Sun Jan 18 17:11:51 2015 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Sun Jan 18 17:11:51 2015 - [info] Dead Servers:
Sun Jan 18 17:11:51 2015 - [info] 192.168.2.128(192.168.2.128:3306)
Sun Jan 18 17:11:51 2015 - [info] Alive Servers:
Sun Jan 18 17:11:51