Mysql 主主复制失败
故障描述
原因描述 | 因为机柜PDU老化, 导致整个机柜掉电. |
---|---|
故障时间 | 20160923-10:09 |
发现时间 | 20160929-13:56 |
架构信息
Tomcat | Memcache | Keepalive | Mysql主主复制 |
---|
节点信息
序号 | 节点名称 | IP地址 | 报错信息 |
---|---|---|---|
1 | aipprd1 | 10.66.1.52 | Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000084’ at 91941417, the last event read from ‘/aip/mysql/data/log/mysql-bin.000084’ at 91941783, the last byte read from ‘/aip/mysql/data/log/mysql-bin.000084’ at 91942912.’ |
2 | aipprd2 | 10.66.1.51 | Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000082’ at 6369026, the last event read from ‘/aip/mysql/data/log/mysql-bin.000082’ at 6369026, the last byte read from ‘/aip/mysql/data/log/mysql-bin.000082’ at 6369280.’ |
故障分析
- 由于Zabbix的Mysql监控脚本的缘故, 没有触发事件, 所以直至20160929检查Zabbix日志的时候, 才发现该故障, 当时
Keepalive
的VIP
在aipprd1
上, 这个节点上, 数据库是对外服务的; - 那么首先以
aipprd1
为主, 先将aipprd2
的从环境同步起来; - 待
aipprd2
从环境同步完成后, 再将aipprd1
的从环境同步起来.
同步AIPPRD2
的从环境
检查aipprd2
的Slave状态
根据代码区的描述, Last_Errno
的错误代码为1062
, 需要手动修改Position
.
[root@aipprd2 ~]# mysql -uzabbixmoniter -ppassw0rd -hlocalhost -e "show slave statusG;" *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.66.1.52 Master_User: root Master_Port: 3388 Connect_Retry: 60 Master_Log_File: mysql-bin.000082 Read_Master_Log_Pos: 6369026 Relay_Log_File: mysql-relay-bin.000257 Relay_Log_Pos: 8999 Relay_Master_Log_File: mysql-bin.000082 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Error 'Duplicate entry '93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604' for key 'PRIMARY'' on query. Default database: 'aipprd'. Query: 'INSERT INTO eahttpsession ( sessionid, username, account, createtime, loginip,userid,explorer,userDomain,computerName,computerUserName) VALUES ('93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604', '李花', 'XS003_4200', '2016-09-23 12:10:33', , '10752920','MSIE 7.0','','','')' Skip_Counter: 0 Exec_Master_Log_Pos: 6367510 Relay_Log_Space: 11099 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.000082' at 6369026, the last event read from '/aip/mysql/data/log/mysql-bin.000082' at 6369026, the last byte read from '/aip/mysql/data/log/mysql-bin.000082' at 6369280.' Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604' for key 'PRIMARY'' on query. Default database: 'aipprd'. Query: 'INSERT INTO eahttpsession ( sessionid, username, account, createtime, loginip,userid,explorer,userDomain,computerName,computerUserName) VALUES ('93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604', '李花', 'XS003_4200', '2016-09-23 12:10:33', , '10752920','MSIE 7.0','','','')' Replicate_Ignore_Server_Ids: Master_Server_Id: 1
在aipprd2
上按照Last_IO_Error
中的Posfile
和Pos
修改.
按照报错给出的提示Posfile
和Pos
修改后, 报错依旧.
mysql> slave stop; mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000082', master_log_pos=6369026; mysql> slave start;
在aipprd2
检查Mysql
的日志 Mysql
的日志中记录了Crash
开始的时间, 并给出了建议,Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000082' position 6367510
160923 12:35:22 [Note] Starting crash recovery... 160923 12:35:22 [Note] Crash recovery finished. 160923 12:36:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000082' position 6367510
在aipprd2
上按照日志中建议的Posfile
和Pos
修改.
按照日志中建议的Posfile
和Pos
修改后, 报错依旧.
mysql> slave stop; mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000082', master_log_pos=6367510; mysql> slave start;
在aipprd1
上检查posfile
首先检查show slave statusG;
中给出的pos, 发现日志中根本不存在.
[root@aipprd1 log]# mysqlbinlog --no-defaults --start-position=6369026 mysql-bin.000083 /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #160923 12:38:59 server id 1 end_log_pos 107 Start: binlog v 4, server v 5.5.24-log created 160923 12:38:59 at startup ROLLBACK/*!*/; BINLOG ' 47HkVw8BAAAAZwAAAGsAAAAAAAQANS41LjI0LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAADjseRXEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA== '/*!*/; ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 912208946, event_type: 71 ERROR: Could not read entry at offset 6369026: Error in log format or read error. DELIMITER # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
此后检查Mysql日志
中建议的pos, 发现posfile
中是存在此记录的, 而此posfile
的最后一个pos
是6368660
, 而show slave statusG;
的pos
是6369026
, 显然不存在日志文件中.
[root@aipprd1 log]# mysqlbinlog --no-defaults --start-position=6367510 mysql-bin.000082 /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #160923 10:39:15 server id 1 end_log_pos 107 Start: binlog v 4, server v 5.5.24-log created 160923 10:39:15 at startup # Warning: this binlog is either in use or was not closed properly. ROLLBACK/*!*/; BINLOG ' 05XkVw8BAAAAZwAAAGsAAAABAAQANS41LjI0LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAADTleRXEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA== '/*!*/; # at 6367510 # at 6367585 # at 6367963 # at 6367990 # at 6368073 # at 6368163 # at 6368275 # at 6368358 # at 6368558 # at 6368585 # at 6368660 # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
最后在mysql-bin.000083
中检查show slave statusG;
提示pos:6369026
, 也不存在.
[root@aipprd1 log]# mysqlbinlog --no-defaults --start-position=6369026 mysql-bin.000083
在aipprd2
上重新发起修改posfile
和pos
操作
检查aipprd1
上的日志, 既然mysql-bin.000082
日志末不存在pos:6369026
, 且mysql-bin.000083
为下一个日志, 那么重新发起修改posfile
和pos
的操作.
[root@aipprd1 log]# ll total 526908 -rw-rw---- 1 mysql mysql 104858346 Sep 19 23:04 mysql-bin.000079 -rw-rw---- 1 mysql mysql 104859112 Sep 22 13:01 mysql-bin.000080 -rw-rw---- 1 mysql mysql 42558345 Sep 23 10:23 mysql-bin.000081 -rw-rw---- 1 mysql mysql 6369280 Sep 23 12:13 mysql-bin.000082 -rw-rw---- 1 mysql mysql 104857953 Sep 26 15:02 mysql-bin.000083 -rw-rw---- 1 mysql mysql 70844542 Sep 27 17:42 mysql-bin.000084 -rw-rw---- 1 mysql mysql 104858026 Sep 29 15:48 mysql-bin.000085 -rw-rw---- 1 mysql mysql 297019 Sep 29 15:54 mysql-bin.000086 -rw-rw---- 1 mysql mysql 296 Sep 29 15:48 mysql-bin.log.index -rw-rw---- 1 mysql mysql 107 Sep 27 18:05 mysql-relay-bin.000182 -rw-rw---- 1 mysql mysql 43 Sep 27 18:05 mysql-relay-bin.index
重新修改posfile
为mysql-bin.000083
, pos
为0
, 启动Slave
后, 现在同步正常.
mysql> slave stop; mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000083', master_log_pos=0; mysql> slave start; mysql> show slave statusG; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.66.1.52 Master_User: root Master_Port: 3388 Connect_Retry: 60 Master_Log_File: mysql-bin.000086 Read_Master_Log_Pos: 18585 Relay_Log_File: mysql-relay-bin.000002 Relay_Log_Pos: 8119286 Relay_Master_Log_File: mysql-bin.000083 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 8119140 Relay_Log_Space: 280580804 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 521370 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 1 row in set (0.00 sec)
-
在同步的过程中, 发现有几个
Last_SQL_Error: Error ‘Duplicate entry 1026
的SQL Error
, 这个是因为重复主键导致Slave
停止工作, 执行以下操作解决(如果有多条重复的主键, 需要执行多次):mysql> slave stop; mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=1; mysql> slave start;
-
还有另一种办法就是修改mysql配置文件
/etc/my.cnf
在[mysqld]
下加一行slave_skip_errors = 1062
,保存后重启mysql
,mysql slave
可以正常同步了.
同步AIPPRD1
的从环境
-
检查
aipprd1
的Slave状态mysql> show slave statusG; *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.66.1.51 Master_User: root Master_Port: 3388 Connect_Retry: 60 Master_Log_File: mysql-bin.000084 Read_Master_Log_Pos: 91941783 Relay_Log_File: mysql-relay-bin.000002 Relay_Log_Pos: 296 Relay_Master_Log_File: mysql-bin.000084 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 91941783 Relay_Log_Space: 452 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.000084' at 91941417, the last event read from '/aip/mysql/data/log/mysql-bin.000084' at 91941783, the last byte read from '/aip/mysql/data/log/mysql-bin.000084' at 91942912.' Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 2 1 row in set (0.00 sec)
-
在
aipprd2
上检查日志文件
在aipprd1
上检查show slave statusG;
后, 提示需要修改posfile为mysql-bin.000084
和pos91942912
, 因为在aipprd2
同步完成后, 实际同步的数据是从aipprd1
过来的, 这些数据在aipprd1
上本身就存在的.[root@aipprd2 log]# ll total 653832 -rw-rw---- 1 mysql mysql 104857774 Sep 21 13:12 mysql-bin.000083 -rw-rw---- 1 mysql mysql 91942912 Sep 23 10:21 mysql-bin.000084 -rw-rw---- 1 mysql mysql 6295552 Sep 23 12:10 mysql-bin.000085 -rw-rw---- 1 mysql mysql 8250 Sep 27 17:39 mysql-bin.000086 -rw-rw---- 1 mysql mysql 104858014 Sep 29 17:20 mysql-bin.000087 -rw-rw---- 1 mysql mysql 104857971 Sep 29 19:18 mysql-bin.000088 -rw-rw---- 1 mysql mysql 55744926 Sep 29 20:23 mysql-bin.000089 -rw-rw---- 1 mysql mysql 259 Sep 29 19:18 mysql-bin.log.index -rw-rw---- 1 mysql mysql 70844718 Sep 29 15:35 mysql-relay-bin.000005 -rw-rw---- 1 mysql mysql 104857729 Sep 29 15:44 mysql-relay-bin.000006 -rw-rw---- 1 mysql mysql 708 Sep 29 15:44 mysql-relay-bin.000007 -rw-rw---- 1 mysql mysql 199 Sep 29 15:44 mysql-relay-bin.000008 -rw-rw---- 1 mysql mysql 25188420 Sep 30 08:40 mysql-relay-bin.000009 -rw-rw---- 1 mysql mysql 215 Sep 29 18:42 mysql-relay-bin.index
-
在
aipprd2
上检查show master status;
后, 记录Posfile
为mysql-bin.000089
, 既然aipprd1
的数据为最新的, 且aipprd2
已经从aipprd1
后同步完成了(通过检查show slave statusG;
中的Seconds_Behind_Master:
, 如果此项值很小, 应该是同步完成了.), 那么两边的数据应该差不多的.mysql> show master status; +------------------+-----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +------------------+-----------+--------------+------------------+ | mysql-bin.000089 | 104341982 | | | +------------------+-----------+--------------+------------------+ 1 row in set (0.00 sec)
-
在
aipprd1
上发起修改posfile
和pos
操作
所以在此用posfile
的mysql-bin.000089
和pos
的0
来修改, 启动Slave
后, 开始同步.mysql> slave stop; Query OK, 0 rows affected (0.11 sec) mysql> CHANGE MASTER TO master_host='10.66.1.51', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000089', master_log_pos=0; Query OK, 0 rows affected (0.06 sec) mysql> slave start; Query OK, 0 rows affected (0.00 sec) mysql> show slave statusG; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.66.1.51 Master_User: root Master_Port: 3388 Connect_Retry: 60 Master_Log_File: mysql-bin.000089 Read_Master_Log_Pos: 55744926 Relay_Log_File: mysql-relay-bin.000002 Relay_Log_Pos: 253 Relay_Master_Log_File: mysql-bin.000089 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 55744926 Relay_Log_Space: 409 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 2 1 row in set (0.00 sec)
转自
Mysql 主主复制失败 - bluetom520的博客 - CSDN博客
http://blog.csdn.net/bluetom520/article/details/54893183