zoukankan      html  css  js  c++  java
  • MHA failover GTID 专题

    https://yq.aliyun.com/articles/238882?spm=5176.8067842.tagmain.18.73PjU3

    摘要: MHA failover GTID 专题 这里以masterha_master_switch为背景详解各种可能遇到的场景 假定环境(经典三节点) host_1(host_1:3306) (current master) +--host_2(host_2:3306 slave[candidat...

    MHA failover GTID 专题

    这里以masterha_master_switch为背景详解各种可能遇到的场景

    假定环境(经典三节点)

    host_1(host_1:3306) (current master)
     +--host_2(host_2:3306 slave[candidate master])
     +--host_3(host_3:3306 etl)
    
    
    

    一、Master : MySQL down

    1.1 etl 延迟8小时

    配置文件中加上no_check_delay=0 即可忽略报错
    
    

    1.2 slave(候选master)比etl还要落后更多

    • 1.2.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
    ### 模拟现场,现场的3台DB gtid状态
    
    * master host_2
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | host_1.000002 |     2885 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    
    * slave (candidate master) host_1
    
               Retrieved_Gtid_Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446353
                    Auto_Position: 1
    
    * etl (other slave) host_3
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353-446356
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446356
                    Auto_Position: 1
    
    
    
    ### 切换日志
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    Thu Nov  9 10:43:49 2017 - [info] MHA::MasterFailover version 0.56.
    Thu Nov  9 10:43:49 2017 - [info] Starting master failover.
    Thu Nov  9 10:43:49 2017 - [info]
    Thu Nov  9 10:43:49 2017 - [info] * Phase 1: Configuration Check Phase..
    Thu Nov  9 10:43:49 2017 - [info]
    Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Thu Nov  9 10:43:50 2017 - [info] Binlog server host_2 is reachable.
    Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Thu Nov  9 10:43:50 2017 - [info] Binlog server host_1 is reachable.
    Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to host_3 is reachable.
    Thu Nov  9 10:43:50 2017 - [info] Binlog server host_3 is reachable.
    Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
    Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Thu Nov  9 10:43:51 2017 - [info] GTID failover mode = 1
    Thu Nov  9 10:43:51 2017 - [info] Dead Servers:
    Thu Nov  9 10:43:51 2017 - [info]   host_2(host_2:3306)
    Thu Nov  9 10:43:51 2017 - [info] Checking master reachability via MySQL(double check)...
    Thu Nov  9 10:43:51 2017 - [info]  ok.
    Thu Nov  9 10:43:51 2017 - [info] Alive Servers:
    Thu Nov  9 10:43:51 2017 - [info]   host_1(host_1:3306)
    Thu Nov  9 10:43:51 2017 - [info]   host_3(host_3:3306)
    Thu Nov  9 10:43:51 2017 - [info] Alive Slaves:
    Thu Nov  9 10:43:51 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:43:51 2017 - [info]     GTID ON
    Thu Nov  9 10:43:51 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 10:43:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 10:43:51 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:43:51 2017 - [info]     GTID ON
    Thu Nov  9 10:43:51 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 10:43:51 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
    Thu Nov  9 10:43:51 2017 - [info]   done.
    Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Thu Nov  9 10:43:51 2017 - [info]   done.
    Thu Nov  9 10:43:51 2017 - [info] Starting GTID based failover.
    Thu Nov  9 10:43:51 2017 - [info]
    Thu Nov  9 10:43:51 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Thu Nov  9 10:43:51 2017 - [info]
    Thu Nov  9 10:43:51 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Thu Nov  9 10:43:51 2017 - [info]
    Thu Nov  9 10:43:51 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Thu Nov  9 10:43:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Thu Nov  9 10:43:51 2017 - [info] Executing master IP deactivation script:
    Thu Nov  9 10:43:51 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
    Thu Nov  9 10:43:53 2017 - [info]  done.
    Thu Nov  9 10:43:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Thu Nov  9 10:43:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info] * Phase 3: Master Recovery Phase..
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info] The latest binary log file/position on all slaves is host_1.000002:1115
    Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,
    Thu Nov  9 10:43:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Thu Nov  9 10:43:53 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:43:53 2017 - [info]     GTID ON
    Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 10:43:53 2017 - [info] The oldest binary log file/position on all slaves is host_1.000002:230
    Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353
    Thu Nov  9 10:43:53 2017 - [info] Oldest slaves:
    Thu Nov  9 10:43:53 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:43:53 2017 - [info]     GTID ON
    Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info] Searching new master from slaves..
    Thu Nov  9 10:43:53 2017 - [info]  Candidate masters from the configuration file:
    Thu Nov  9 10:43:53 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:43:53 2017 - [info]     GTID ON
    Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 10:43:53 2017 - [info]  Non-candidate masters:
    Thu Nov  9 10:43:53 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:43:53 2017 - [info]     GTID ON
    Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 10:43:53 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Thu Nov  9 10:43:53 2017 - [info]   Not found.
    Thu Nov  9 10:43:53 2017 - [info]  Searching from all candidate_master slaves..
    Thu Nov  9 10:43:53 2017 - [info] New master is host_1(host_1:3306)
    Thu Nov  9 10:43:53 2017 - [info] Starting master failover..
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Thu Nov  9 10:43:53 2017 - [info]
    Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied..
    Thu Nov  9 10:43:53 2017 - [info]   done.
    Thu Nov  9 10:43:53 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..
    Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied on the latest slave..
    Thu Nov  9 10:43:53 2017 - [info]  Resetting slave host_1(host_1:3306) and starting replication from the new master host_3(host_3:3306)..
    Thu Nov  9 10:43:53 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 10:43:54 2017 - [info]  Slave started.
    Thu Nov  9 10:43:54 2017 - [info]  Waiting to execute all relay logs on host_1(host_1:3306)..
    Thu Nov  9 10:43:54 2017 - [info]  master_pos_wait(host_3.000049:18041) completed on host_1(host_1:3306). Executed 0 events.
    Thu Nov  9 10:43:54 2017 - [info]   done.
    Thu Nov  9 10:43:54 2017 - [info]   done.
    Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host host_2 started, pid: 150294
    Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host host_1 started, pid: 150295
    Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host host_3 started, pid: 150297
    Thu Nov  9 10:43:54 2017 - [info]
    Thu Nov  9 10:43:54 2017 - [info] Log messages from host_1 ...
    Thu Nov  9 10:43:54 2017 - [info]
    Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_1..
    Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Thu Nov  9 10:43:54 2017 - [info] End of log messages from host_1.
    Thu Nov  9 10:43:54 2017 - [warning] Got error from host_1.
    Thu Nov  9 10:43:54 2017 - [info]
    Thu Nov  9 10:43:54 2017 - [info] Log messages from host_3 ...
    Thu Nov  9 10:43:54 2017 - [info]
    Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_3..
    Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Thu Nov  9 10:43:54 2017 - [info] End of log messages from host_3.
    Thu Nov  9 10:43:54 2017 - [warning] Got error from host_3.
    Thu Nov  9 10:43:55 2017 - [info]
    Thu Nov  9 10:43:55 2017 - [info] Log messages from host_2 ...
    Thu Nov  9 10:43:55 2017 - [info]
    Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_2..
    Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 10:43:55 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109104349.binlog succeeded.
    Thu Nov  9 10:43:55 2017 - [info] End of log messages from host_2.
    Thu Nov  9 10:43:55 2017 - [info] Saved mysqlbinlog size from host_2 is 6047 bytes.
    Thu Nov  9 10:43:55 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109104349.binlog ..
    Thu Nov  9 10:43:55 2017 - [info] Differential log apply from binlog server succeeded.
    Thu Nov  9 10:43:55 2017 - [info] Getting new master's binlog name and position..
    Thu Nov  9 10:43:55 2017 - [info]  tjtx-126-164.000053:3624
    Thu Nov  9 10:43:55 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Thu Nov  9 10:43:55 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000053, 3624, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    Thu Nov  9 10:43:55 2017 - [info] Executing master IP activate script:
    Thu Nov  9 10:43:55 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Thu Nov  9 10:43:57 2017 - [info]  OK.
    Thu Nov  9 10:43:57 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
    Thu Nov  9 10:43:57 2017 - [info]  ok.
    Thu Nov  9 10:43:57 2017 - [info] ** Finished master recovery successfully.
    Thu Nov  9 10:43:57 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Thu Nov  9 10:43:57 2017 - [info]
    Thu Nov  9 10:43:57 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Thu Nov  9 10:43:57 2017 - [info]
    Thu Nov  9 10:43:57 2017 - [info]
    Thu Nov  9 10:43:57 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Thu Nov  9 10:43:57 2017 - [info]
    Thu Nov  9 10:43:57 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 155162. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109104349.log if it takes time..
    Thu Nov  9 10:43:58 2017 - [info]
    Thu Nov  9 10:43:58 2017 - [info] Log messages from host_3 ...
    Thu Nov  9 10:43:58 2017 - [info]
    Thu Nov  9 10:43:57 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
    Thu Nov  9 10:43:57 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 10:43:58 2017 - [info]  Slave started.
    Thu Nov  9 10:43:58 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    Thu Nov  9 10:43:58 2017 - [info] End of log messages from host_3.
    Thu Nov  9 10:43:58 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Thu Nov  9 10:43:58 2017 - [info] All new slave servers recovered successfully.
    Thu Nov  9 10:43:58 2017 - [info]
    Thu Nov  9 10:43:58 2017 - [info] * Phase 5: New master cleanup phase..
    Thu Nov  9 10:43:58 2017 - [info]
    Thu Nov  9 10:43:58 2017 - [info] Resetting slave info on the new master..
    Thu Nov  9 10:43:58 2017 - [info]  host_1: Resetting slave info succeeded.
    Thu Nov  9 10:43:58 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
    Thu Nov  9 10:43:58 2017 - [info]
    Thu Nov  9 10:43:58 2017 - [info] Sending mail..
    
    
    
    • 1.2.2 当master的所有日志已经传递到1个etl,这时候master 上的MySQL挂了
    
    ### 模拟现场,现场的3台DB gtid状态
    
    * master host_1
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | tjtx-126-164.000053 |     5229 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    
    
    * slave (candidate master) host_2
    
               Retrieved_Gtid_Set:
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362
                    Auto_Position: 1
    
    * etl (other slave) host_3
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446357-446362
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362
                    Auto_Position: 1
    
    
    ### 切换日志
    
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    Thu Nov  9 10:59:14 2017 - [info] MHA::MasterFailover version 0.56.
    Thu Nov  9 10:59:14 2017 - [info] Starting master failover.
    Thu Nov  9 10:59:14 2017 - [info]
    Thu Nov  9 10:59:14 2017 - [info] * Phase 1: Configuration Check Phase..
    Thu Nov  9 10:59:14 2017 - [info]
    Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Thu Nov  9 10:59:15 2017 - [info] Binlog server host_2 is reachable.
    Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Thu Nov  9 10:59:15 2017 - [info] Binlog server host_1 is reachable.
    Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to host_3 is reachable.
    Thu Nov  9 10:59:16 2017 - [info] Binlog server host_3 is reachable.
    Thu Nov  9 10:59:16 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
    Thu Nov  9 10:59:16 2017 - [info] GTID failover mode = 1
    Thu Nov  9 10:59:16 2017 - [info] Dead Servers:
    Thu Nov  9 10:59:16 2017 - [info]   host_1(host_1:3306)
    Thu Nov  9 10:59:16 2017 - [info] Checking master reachability via MySQL(double check)...
    Thu Nov  9 10:59:16 2017 - [info]  ok.
    Thu Nov  9 10:59:16 2017 - [info] Alive Servers:
    Thu Nov  9 10:59:16 2017 - [info]   host_2(host_2:3306)
    Thu Nov  9 10:59:16 2017 - [info]   host_3(host_3:3306)
    Thu Nov  9 10:59:16 2017 - [info] Alive Slaves:
    Thu Nov  9 10:59:16 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:59:16 2017 - [info]     GTID ON
    Thu Nov  9 10:59:16 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 10:59:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 10:59:16 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:59:16 2017 - [info]     GTID ON
    Thu Nov  9 10:59:16 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 10:59:16 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 10:59:16 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..
    Thu Nov  9 10:59:16 2017 - [info]   done.
    Thu Nov  9 10:59:16 2017 - [info] Starting GTID based failover.
    Thu Nov  9 10:59:16 2017 - [info]
    Thu Nov  9 10:59:16 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Thu Nov  9 10:59:16 2017 - [info]
    Thu Nov  9 10:59:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Thu Nov  9 10:59:16 2017 - [info]
    Thu Nov  9 10:59:16 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Thu Nov  9 10:59:16 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Thu Nov  9 10:59:16 2017 - [info] Executing master IP deactivation script:
    Thu Nov  9 10:59:16 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
    Thu Nov  9 10:59:20 2017 - [info]  done.
    Thu Nov  9 10:59:20 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Thu Nov  9 10:59:20 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info] * Phase 3: Master Recovery Phase..
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000053:5229
    Thu Nov  9 10:59:20 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,
    Thu Nov  9 10:59:20 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Thu Nov  9 10:59:20 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:59:20 2017 - [info]     GTID ON
    Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 10:59:20 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000053:3624
    Thu Nov  9 10:59:20 2017 - [info] Oldest slaves:
    Thu Nov  9 10:59:20 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:59:20 2017 - [info]     GTID ON
    Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info] Searching new master from slaves..
    Thu Nov  9 10:59:20 2017 - [info]  Candidate masters from the configuration file:
    Thu Nov  9 10:59:20 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:59:20 2017 - [info]     GTID ON
    Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 10:59:20 2017 - [info]  Non-candidate masters:
    Thu Nov  9 10:59:20 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 10:59:20 2017 - [info]     GTID ON
    Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 10:59:20 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Thu Nov  9 10:59:20 2017 - [info]   Not found.
    Thu Nov  9 10:59:20 2017 - [info]  Searching from all candidate_master slaves..
    Thu Nov  9 10:59:20 2017 - [info] New master is host_2(host_2:3306)
    Thu Nov  9 10:59:20 2017 - [info] Starting master failover..
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Thu Nov  9 10:59:20 2017 - [info]
    Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied..
    Thu Nov  9 10:59:20 2017 - [info]   done.
    Thu Nov  9 10:59:20 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..
    Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied on the latest slave..
    Thu Nov  9 10:59:20 2017 - [info]  Resetting slave host_2(host_2:3306) and starting replication from the new master host_3(host_3:3306)..
    Thu Nov  9 10:59:20 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 10:59:21 2017 - [info]  Slave started.
    Thu Nov  9 10:59:21 2017 - [info]  Waiting to execute all relay logs on host_2(host_2:3306)..
    Thu Nov  9 10:59:21 2017 - [info]  master_pos_wait(host_3.000049:22035) completed on host_2(host_2:3306). Executed 0 events.
    Thu Nov  9 10:59:21 2017 - [info]   done.
    Thu Nov  9 10:59:21 2017 - [info]   done.
    Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host host_2 started, pid: 184482
    Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host host_1 started, pid: 184483
    Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host host_3 started, pid: 184487
    Thu Nov  9 10:59:21 2017 - [info]
    Thu Nov  9 10:59:21 2017 - [info] Log messages from host_2 ...
    Thu Nov  9 10:59:21 2017 - [info]
    Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_2..
    Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Thu Nov  9 10:59:21 2017 - [info] End of log messages from host_2.
    Thu Nov  9 10:59:21 2017 - [warning] Got error from host_2.
    Thu Nov  9 10:59:21 2017 - [info]
    Thu Nov  9 10:59:21 2017 - [info] Log messages from host_3 ...
    Thu Nov  9 10:59:21 2017 - [info]
    Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_3..
    Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Thu Nov  9 10:59:21 2017 - [info] End of log messages from host_3.
    Thu Nov  9 10:59:21 2017 - [warning] Got error from host_3.
    Thu Nov  9 10:59:22 2017 - [info]
    Thu Nov  9 10:59:22 2017 - [info] Log messages from host_1 ...
    Thu Nov  9 10:59:22 2017 - [info]
    Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_1..
    Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 10:59:22 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171109105914.binlog succeeded.
    Thu Nov  9 10:59:22 2017 - [info] End of log messages from host_1.
    Thu Nov  9 10:59:22 2017 - [info] Saved mysqlbinlog size from host_1 is 800 bytes.
    Thu Nov  9 10:59:22 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171109105914.binlog ..
    Thu Nov  9 10:59:22 2017 - [info] Differential log apply from binlog server succeeded.
    Thu Nov  9 10:59:22 2017 - [info] Getting new master's binlog name and position..
    Thu Nov  9 10:59:22 2017 - [info]  host_1.000003:1680
    Thu Nov  9 10:59:22 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Thu Nov  9 10:59:22 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000003, 1680, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
    Thu Nov  9 10:59:22 2017 - [info] Executing master IP activate script:
    Thu Nov  9 10:59:22 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Thu Nov  9 10:59:24 2017 - [info]  OK.
    Thu Nov  9 10:59:24 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..
    Thu Nov  9 10:59:24 2017 - [info]  ok.
    Thu Nov  9 10:59:24 2017 - [info] ** Finished master recovery successfully.
    Thu Nov  9 10:59:24 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Thu Nov  9 10:59:24 2017 - [info]
    Thu Nov  9 10:59:24 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Thu Nov  9 10:59:24 2017 - [info]
    Thu Nov  9 10:59:24 2017 - [info]
    Thu Nov  9 10:59:24 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Thu Nov  9 10:59:24 2017 - [info]
    Thu Nov  9 10:59:24 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 189393. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109105914.log if it takes time..
    Thu Nov  9 10:59:25 2017 - [info]
    Thu Nov  9 10:59:25 2017 - [info] Log messages from host_3 ...
    Thu Nov  9 10:59:25 2017 - [info]
    Thu Nov  9 10:59:24 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
    Thu Nov  9 10:59:24 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 10:59:25 2017 - [info]  Slave started.
    Thu Nov  9 10:59:25 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
    Thu Nov  9 10:59:25 2017 - [info] End of log messages from host_3.
    Thu Nov  9 10:59:25 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Thu Nov  9 10:59:25 2017 - [info] All new slave servers recovered successfully.
    Thu Nov  9 10:59:25 2017 - [info]
    Thu Nov  9 10:59:25 2017 - [info] * Phase 5: New master cleanup phase..
    Thu Nov  9 10:59:25 2017 - [info]
    Thu Nov  9 10:59:25 2017 - [info] Resetting slave info on the new master..
    Thu Nov  9 10:59:25 2017 - [info]  host_2: Resetting slave info succeeded.
    Thu Nov  9 10:59:25 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
    Thu Nov  9 10:59:25 2017 - [info]
    Thu Nov  9 10:59:25 2017 - [info] Sending mail..
    
    
    

    1.3 slave(候选master)的日志是最新的,比etl要多

    • 1.3.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    Tue Nov  7 17:11:29 2017 - [info] MHA::MasterFailover version 0.56.
    Tue Nov  7 17:11:29 2017 - [info] Starting master failover.
    Tue Nov  7 17:11:29 2017 - [info]
    Tue Nov  7 17:11:29 2017 - [info] * Phase 1: Configuration Check Phase..
    Tue Nov  7 17:11:29 2017 - [info]
    Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Tue Nov  7 17:11:29 2017 - [info] Binlog server host_2 is reachable.
    Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Tue Nov  7 17:11:30 2017 - [info] Binlog server host_1 is reachable.
    Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to host_3 is reachable.
    Tue Nov  7 17:11:30 2017 - [info] Binlog server host_3 is reachable.
    Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
    Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Tue Nov  7 17:11:30 2017 - [info] GTID failover mode = 1
    Tue Nov  7 17:11:30 2017 - [info] Dead Servers:
    Tue Nov  7 17:11:30 2017 - [info]   host_1(host_1:3306)
    Tue Nov  7 17:11:30 2017 - [info] Checking master reachability via MySQL(double check)...
    Tue Nov  7 17:11:30 2017 - [info]  ok.
    Tue Nov  7 17:11:30 2017 - [info] Alive Servers:
    Tue Nov  7 17:11:30 2017 - [info]   host_2(host_2:3306)
    Tue Nov  7 17:11:30 2017 - [info]   host_3(host_3:3306)
    Tue Nov  7 17:11:30 2017 - [info] Alive Slaves:
    Tue Nov  7 17:11:30 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 17:11:30 2017 - [info]     GTID ON
    Tue Nov  7 17:11:30 2017 - [info]     Replicating from host_1(host_1:3306)
    Tue Nov  7 17:11:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Nov  7 17:11:30 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 17:11:30 2017 - [info]     GTID ON
    Tue Nov  7 17:11:30 2017 - [info]     Replicating from host_1(host_1:3306)
    Tue Nov  7 17:11:30 2017 - [info]     Not candidate for the new Master (no_master is set)
    Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..
    Tue Nov  7 17:11:30 2017 - [info]   done.
    Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Tue Nov  7 17:11:30 2017 - [info]   done.
    Tue Nov  7 17:11:30 2017 - [info] Starting GTID based failover.
    Tue Nov  7 17:11:30 2017 - [info]
    Tue Nov  7 17:11:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Tue Nov  7 17:11:30 2017 - [info]
    Tue Nov  7 17:11:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Tue Nov  7 17:11:30 2017 - [info]
    Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Tue Nov  7 17:11:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Tue Nov  7 17:11:31 2017 - [info] Executing master IP deactivation script:
    Tue Nov  7 17:11:31 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
    Tue Nov  7 17:11:33 2017 - [info]  done.
    Tue Nov  7 17:11:33 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Tue Nov  7 17:11:33 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] * Phase 3: Master Recovery Phase..
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000051:13508
    Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:3-8
    Tue Nov  7 17:11:33 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Tue Nov  7 17:11:33 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 17:11:33 2017 - [info]     GTID ON
    Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)
    Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Nov  7 17:11:33 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000051:11918
    Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:2-3,
    Tue Nov  7 17:11:33 2017 - [info] Oldest slaves:
    Tue Nov  7 17:11:33 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 17:11:33 2017 - [info]     GTID ON
    Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)
    Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Searching new master from slaves..
    Tue Nov  7 17:11:33 2017 - [info]  Candidate masters from the configuration file:
    Tue Nov  7 17:11:33 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 17:11:33 2017 - [info]     GTID ON
    Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)
    Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Nov  7 17:11:33 2017 - [info]  Non-candidate masters:
    Tue Nov  7 17:11:33 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 17:11:33 2017 - [info]     GTID ON
    Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)
    Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)
    Tue Nov  7 17:11:33 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Tue Nov  7 17:11:33 2017 - [info] New master is host_2(host_2:3306)
    Tue Nov  7 17:11:33 2017 - [info] Starting master failover..
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info]  Waiting all logs to be applied..
    Tue Nov  7 17:11:33 2017 - [info]   done.
    Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host host_2 started, pid: 54677
    Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host host_1 started, pid: 54681
    Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host host_3 started, pid: 54683
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Log messages from host_3 ...
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_3..
    Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Tue Nov  7 17:11:33 2017 - [info] End of log messages from host_3.
    Tue Nov  7 17:11:33 2017 - [warning] Got error from host_3.
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Log messages from host_2 ...
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_2..
    Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Tue Nov  7 17:11:33 2017 - [info] End of log messages from host_2.
    Tue Nov  7 17:11:33 2017 - [warning] Got error from host_2.
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Log messages from host_1 ...
    Tue Nov  7 17:11:33 2017 - [info]
    Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_1..
    Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 17:11:33 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171107171129.binlog succeeded.
    Tue Nov  7 17:11:33 2017 - [info] End of log messages from host_1.
    Tue Nov  7 17:11:33 2017 - [info] Saved mysqlbinlog size from host_1 is 8578 bytes.
    Tue Nov  7 17:11:33 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171107171129.binlog ..
    Tue Nov  7 17:11:33 2017 - [info] Differential log apply from binlog server succeeded.
    Tue Nov  7 17:11:33 2017 - [info] Getting new master's binlog name and position..
    Tue Nov  7 17:11:33 2017 - [info]  host_1.000001:5048
    Tue Nov  7 17:11:33 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Tue Nov  7 17:11:33 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000001, 5048, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    Tue Nov  7 17:11:33 2017 - [info] Executing master IP activate script:
    Tue Nov  7 17:11:33 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Tue Nov  7 17:11:36 2017 - [info]  OK.
    Tue Nov  7 17:11:36 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..
    Tue Nov  7 17:11:36 2017 - [info]  ok.
    Tue Nov  7 17:11:36 2017 - [info] ** Finished master recovery successfully.
    Tue Nov  7 17:11:36 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Tue Nov  7 17:11:36 2017 - [info]
    Tue Nov  7 17:11:36 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Tue Nov  7 17:11:36 2017 - [info]
    Tue Nov  7 17:11:36 2017 - [info]
    Tue Nov  7 17:11:36 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Tue Nov  7 17:11:36 2017 - [info]
    Tue Nov  7 17:11:36 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 58422. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171107171129.log if it takes time..
    Tue Nov  7 17:11:37 2017 - [info]
    Tue Nov  7 17:11:37 2017 - [info] Log messages from host_3 ...
    Tue Nov  7 17:11:37 2017 - [info]
    Tue Nov  7 17:11:36 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
    Tue Nov  7 17:11:36 2017 - [info]  Executed CHANGE MASTER.
    Tue Nov  7 17:11:37 2017 - [info]  Slave started.
    Tue Nov  7 17:11:37 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
    Tue Nov  7 17:11:37 2017 - [info] End of log messages from host_3.
    Tue Nov  7 17:11:37 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Tue Nov  7 17:11:37 2017 - [info] All new slave servers recovered successfully.
    Tue Nov  7 17:11:37 2017 - [info]
    Tue Nov  7 17:11:37 2017 - [info] * Phase 5: New master cleanup phase..
    Tue Nov  7 17:11:37 2017 - [info]
    Tue Nov  7 17:11:37 2017 - [info] Resetting slave info on the new master..
    Tue Nov  7 17:11:37 2017 - [info]  host_2: Resetting slave info succeeded.
    Tue Nov  7 17:11:37 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
    Tue Nov  7 17:11:37 2017 - [info]
    Tue Nov  7 17:11:37 2017 - [info] Sending mail..
    
    
    • 1.3.2 当master的所有日志已经传递slave,这时候master 上的MySQL挂了
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    Tue Nov  7 15:56:11 2017 - [info] MHA::MasterFailover version 0.56.
    Tue Nov  7 15:56:11 2017 - [info] Starting master failover.
    Tue Nov  7 15:56:11 2017 - [info]
    Tue Nov  7 15:56:11 2017 - [info] * Phase 1: Configuration Check Phase..
    Tue Nov  7 15:56:11 2017 - [info]
    Tue Nov  7 15:56:11 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Tue Nov  7 15:56:12 2017 - [info] Binlog server host_2 is reachable.
    Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Tue Nov  7 15:56:12 2017 - [info] Binlog server host_1 is reachable.
    Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
    Tue Nov  7 15:56:13 2017 - [info] Binlog server host_3 is reachable.
    Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
    Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Tue Nov  7 15:56:13 2017 - [info] GTID failover mode = 1
    Tue Nov  7 15:56:13 2017 - [info] Dead Servers:
    Tue Nov  7 15:56:13 2017 - [info]   host_2(host_2:3306)
    Tue Nov  7 15:56:13 2017 - [info] Checking master reachability via MySQL(double check)...
    Tue Nov  7 15:56:13 2017 - [info]  ok.
    Tue Nov  7 15:56:13 2017 - [info] Alive Servers:
    Tue Nov  7 15:56:13 2017 - [info]   host_1(host_1:3306)
    Tue Nov  7 15:56:13 2017 - [info]   host_3(host_3:3306)
    Tue Nov  7 15:56:13 2017 - [info] Alive Slaves:
    Tue Nov  7 15:56:13 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 15:56:13 2017 - [info]     GTID ON
    Tue Nov  7 15:56:13 2017 - [info]     Replicating from host_2(host_2:3306)
    Tue Nov  7 15:56:13 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Nov  7 15:56:13 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 15:56:13 2017 - [info]     GTID ON
    Tue Nov  7 15:56:13 2017 - [info]     Replicating from host_2(host_2:3306)
    Tue Nov  7 15:56:13 2017 - [info]     Not candidate for the new Master (no_master is set)
    Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
    Tue Nov  7 15:56:13 2017 - [info]   done.
    Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Tue Nov  7 15:56:13 2017 - [info]   done.
    Tue Nov  7 15:56:13 2017 - [info] Starting GTID based failover.
    Tue Nov  7 15:56:13 2017 - [info]
    Tue Nov  7 15:56:13 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Tue Nov  7 15:56:13 2017 - [info]
    Tue Nov  7 15:56:13 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Tue Nov  7 15:56:13 2017 - [info]
    Tue Nov  7 15:56:13 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Tue Nov  7 15:56:13 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Tue Nov  7 15:56:13 2017 - [info] Executing master IP deactivation script:
    Tue Nov  7 15:56:13 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
    Tue Nov  7 15:56:16 2017 - [info]  done.
    Tue Nov  7 15:56:16 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Tue Nov  7 15:56:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] * Phase 3: Master Recovery Phase..
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] The latest binary log file/position on all slaves is host_1.000049:11291
    Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446352
    Tue Nov  7 15:56:16 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Tue Nov  7 15:56:16 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 15:56:16 2017 - [info]     GTID ON
    Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)
    Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Nov  7 15:56:16 2017 - [info] The oldest binary log file/position on all slaves is host_1.000049:10703
    Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446350
    Tue Nov  7 15:56:16 2017 - [info] Oldest slaves:
    Tue Nov  7 15:56:16 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 15:56:16 2017 - [info]     GTID ON
    Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)
    Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Searching new master from slaves..
    Tue Nov  7 15:56:16 2017 - [info]  Candidate masters from the configuration file:
    Tue Nov  7 15:56:16 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 15:56:16 2017 - [info]     GTID ON
    Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)
    Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Tue Nov  7 15:56:16 2017 - [info]  Non-candidate masters:
    Tue Nov  7 15:56:16 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Tue Nov  7 15:56:16 2017 - [info]     GTID ON
    Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)
    Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)
    Tue Nov  7 15:56:16 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Tue Nov  7 15:56:16 2017 - [info] New master is host_1(host_1:3306)
    Tue Nov  7 15:56:16 2017 - [info] Starting master failover..
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Tue Nov  7 15:56:16 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info]  Waiting all logs to be applied..
    Tue Nov  7 15:56:16 2017 - [info]   done.
    Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host host_2 started, pid: 79759
    Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host host_1 started, pid: 79768
    Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host host_3 started, pid: 79770
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:17 2017 - [info] Log messages from host_1 ...
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_1..
    Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_1.
    Tue Nov  7 15:56:17 2017 - [warning] Got error from host_1.
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:17 2017 - [info] Log messages from host_3 ...
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_3..
    Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_3.
    Tue Nov  7 15:56:17 2017 - [warning] Got error from host_3.
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:17 2017 - [info] Log messages from host_2 ...
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_2..
    Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 15:56:17 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog succeeded.
    Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_2.
    Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from host_2 is 768 bytes.
    Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog ..
    Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.
    Tue Nov  7 15:56:17 2017 - [info] Getting new master's binlog name and position..
    Tue Nov  7 15:56:17 2017 - [info]  tjtx-126-164.000051:11449
    Tue Nov  7 15:56:17 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Tue Nov  7 15:56:17 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000051, 11449, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,
    Tue Nov  7 15:56:17 2017 - [info] Executing master IP activate script:
    Tue Nov  7 15:56:17 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Tue Nov  7 15:56:20 2017 - [info]  OK.
    Tue Nov  7 15:56:20 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
    Tue Nov  7 15:56:20 2017 - [info]  ok.
    Tue Nov  7 15:56:20 2017 - [info] ** Finished master recovery successfully.
    Tue Nov  7 15:56:20 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Tue Nov  7 15:56:20 2017 - [info]
    Tue Nov  7 15:56:20 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Tue Nov  7 15:56:20 2017 - [info]
    Tue Nov  7 15:56:20 2017 - [info]
    Tue Nov  7 15:56:20 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Tue Nov  7 15:56:20 2017 - [info]
    Tue Nov  7 15:56:20 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 85941. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171107155611.log if it takes time..
    Tue Nov  7 15:56:21 2017 - [info]
    Tue Nov  7 15:56:21 2017 - [info] Log messages from host_3 ...
    Tue Nov  7 15:56:21 2017 - [info]
    Tue Nov  7 15:56:20 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
    Tue Nov  7 15:56:20 2017 - [info]  Executed CHANGE MASTER.
    Tue Nov  7 15:56:21 2017 - [info]  Slave started.
    Tue Nov  7 15:56:21 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,
    Tue Nov  7 15:56:21 2017 - [info] End of log messages from host_3.
    Tue Nov  7 15:56:21 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Tue Nov  7 15:56:21 2017 - [info] All new slave servers recovered successfully.
    Tue Nov  7 15:56:21 2017 - [info]
    Tue Nov  7 15:56:21 2017 - [info] * Phase 5: New master cleanup phase..
    Tue Nov  7 15:56:21 2017 - [info]
    Tue Nov  7 15:56:21 2017 - [info] Resetting slave info on the new master..
    Tue Nov  7 15:56:21 2017 - [info]  host_1: Resetting slave info succeeded.
    Tue Nov  7 15:56:21 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
    Tue Nov  7 15:56:21 2017 - [info]
    Tue Nov  7 15:56:21 2017 - [info] Sending mail..
    
    

    1.4 slave(候选master)上面有大事务在跑

    • 1000s的大查询
    无影响,正常切换
    
    
    • flush tables with readlock
    
    无影响,正常切换
    
    

    1.5 binlog server 不同场景的测试

    dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl (这是最严苛的情况)

    • binlog server 写3台
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    Tue Nov  7 15:56:17 2017 - [info] Log messages from host_1 ...
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_1..
    Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_1.
    Tue Nov  7 15:56:17 2017 - [warning] Got error from host_1.
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:17 2017 - [info] Log messages from host_3 ...
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_3..
    Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_3.
    Tue Nov  7 15:56:17 2017 - [warning] Got error from host_3.
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:17 2017 - [info] Log messages from host_2 ...
    Tue Nov  7 15:56:17 2017 - [info]
    Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_2..
    Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Tue Nov  7 15:56:17 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog succeeded.
    Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_2.
    Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from host_2 is 768 bytes.
    Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog ..
    Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.
    
    
    • binlog server 只写master
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    Thu Nov  9 11:20:04 2017 - [info] -- Saving binlog from host host_2 started, pid: 117389
    Thu Nov  9 11:20:05 2017 - [info]
    Thu Nov  9 11:20:05 2017 - [info] Log messages from host_2 ...
    Thu Nov  9 11:20:05 2017 - [info]
    Thu Nov  9 11:20:04 2017 - [info] Fetching binary logs from binlog server host_2..
    Thu Nov  9 11:20:04 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000004  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 11:20:05 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109111957.binlog succeeded.
    Thu Nov  9 11:20:05 2017 - [info] End of log messages from host_2.
    Thu Nov  9 11:20:05 2017 - [info] Saved mysqlbinlog size from host_2 is 4444 bytes.
    Thu Nov  9 11:20:05 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109111957.binlog ..
    Thu Nov  9 11:20:05 2017 - [info] Differential log apply from binlog server succeeded.
    
    • binlog server 只写slave
    ### 3台服务器的GTID状态
    
    * master  host_1
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | tjtx-126-164.000055 |     6016 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    * slave host_2
    
    
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
                    Auto_Position: 1
    
    
    * etl host_3
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446366-446369
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
                    Auto_Position: 1
    
    
    ### 切换日志
    
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    Thu Nov  9 15:00:09 2017 - [info] MHA::MasterFailover version 0.56.
    Thu Nov  9 15:00:09 2017 - [info] Starting master failover.
    Thu Nov  9 15:00:09 2017 - [info]
    Thu Nov  9 15:00:09 2017 - [info] * Phase 1: Configuration Check Phase..
    Thu Nov  9 15:00:09 2017 - [info]
    Thu Nov  9 15:00:09 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Thu Nov  9 15:00:09 2017 - [info] Binlog server host_2 is reachable.
    Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
    Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Thu Nov  9 15:00:10 2017 - [info] GTID failover mode = 1
    Thu Nov  9 15:00:10 2017 - [info] Dead Servers:
    Thu Nov  9 15:00:10 2017 - [info]   host_1(host_1:3306)
    Thu Nov  9 15:00:10 2017 - [info] Checking master reachability via MySQL(double check)...
    Thu Nov  9 15:00:10 2017 - [info]  ok.
    Thu Nov  9 15:00:10 2017 - [info] Alive Servers:
    Thu Nov  9 15:00:10 2017 - [info]   host_2(host_2:3306)
    Thu Nov  9 15:00:10 2017 - [info]   host_3(host_3:3306)
    Thu Nov  9 15:00:10 2017 - [info] Alive Slaves:
    Thu Nov  9 15:00:10 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 15:00:10 2017 - [info]     GTID ON
    Thu Nov  9 15:00:10 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 15:00:10 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 15:00:10 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 15:00:10 2017 - [info]     GTID ON
    Thu Nov  9 15:00:10 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 15:00:10 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..
    Thu Nov  9 15:00:10 2017 - [info]   done.
    Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Thu Nov  9 15:00:10 2017 - [info]   done.
    Thu Nov  9 15:00:10 2017 - [info] Starting GTID based failover.
    Thu Nov  9 15:00:10 2017 - [info]
    Thu Nov  9 15:00:10 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Thu Nov  9 15:00:10 2017 - [info]
    Thu Nov  9 15:00:10 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Thu Nov  9 15:00:10 2017 - [info]
    Thu Nov  9 15:00:10 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Thu Nov  9 15:00:10 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Thu Nov  9 15:00:10 2017 - [info] Executing master IP deactivation script:
    Thu Nov  9 15:00:10 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
    Thu Nov  9 15:00:17 2017 - [info]  done.
    Thu Nov  9 15:00:17 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Thu Nov  9 15:00:17 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info] * Phase 3: Master Recovery Phase..
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000055:4090
    Thu Nov  9 15:00:17 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,
    Thu Nov  9 15:00:17 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Thu Nov  9 15:00:17 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 15:00:17 2017 - [info]     GTID ON
    Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 15:00:17 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000055:2806
    Thu Nov  9 15:00:17 2017 - [info] Oldest slaves:
    Thu Nov  9 15:00:17 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 15:00:17 2017 - [info]     GTID ON
    Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info] Searching new master from slaves..
    Thu Nov  9 15:00:17 2017 - [info]  Candidate masters from the configuration file:
    Thu Nov  9 15:00:17 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 15:00:17 2017 - [info]     GTID ON
    Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 15:00:17 2017 - [info]  Non-candidate masters:
    Thu Nov  9 15:00:17 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 15:00:17 2017 - [info]     GTID ON
    Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)
    Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 15:00:17 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Thu Nov  9 15:00:17 2017 - [info]   Not found.
    Thu Nov  9 15:00:17 2017 - [info]  Searching from all candidate_master slaves..
    Thu Nov  9 15:00:17 2017 - [info] New master is host_2(host_2:3306)
    Thu Nov  9 15:00:17 2017 - [info] Starting master failover..
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Thu Nov  9 15:00:17 2017 - [info]
    Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied..
    Thu Nov  9 15:00:17 2017 - [info]   done.
    Thu Nov  9 15:00:17 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..
    Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied on the latest slave..
    Thu Nov  9 15:00:17 2017 - [info]  Resetting slave host_2(host_2:3306) and starting replication from the new master host_3(host_3:3306)..
    Thu Nov  9 15:00:17 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 15:00:18 2017 - [info]  Slave started.
    Thu Nov  9 15:00:18 2017 - [info]  Waiting to execute all relay logs on host_2(host_2:3306)..
    Thu Nov  9 15:00:18 2017 - [info]  master_pos_wait(host_3.000049:25843) completed on host_2(host_2:3306). Executed 0 events.
    Thu Nov  9 15:00:18 2017 - [info]   done.
    Thu Nov  9 15:00:18 2017 - [info]   done.
    Thu Nov  9 15:00:18 2017 - [info] -- Saving binlog from host host_2 started, pid: 175683
    Thu Nov  9 15:00:18 2017 - [info]
    Thu Nov  9 15:00:18 2017 - [info] Log messages from host_2 ...
    Thu Nov  9 15:00:18 2017 - [info]
    Thu Nov  9 15:00:18 2017 - [info] Fetching binary logs from binlog server host_2..
    Thu Nov  9 15:00:18 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000055  --start_pos=4090 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109150009.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Thu Nov  9 15:00:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Thu Nov  9 15:00:18 2017 - [info] End of log messages from host_2.
    Thu Nov  9 15:00:18 2017 - [warning] Got error from host_2.
    Thu Nov  9 15:00:18 2017 - [info] Getting new master's binlog name and position..
    Thu Nov  9 15:00:18 2017 - [info]  host_1.000005:1390
    Thu Nov  9 15:00:18 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Thu Nov  9 15:00:18 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000005, 1390, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
    Thu Nov  9 15:00:18 2017 - [info] Executing master IP activate script:
    Thu Nov  9 15:00:18 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Thu Nov  9 15:00:22 2017 - [info]  OK.
    Thu Nov  9 15:00:22 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..
    Thu Nov  9 15:00:22 2017 - [info]  ok.
    Thu Nov  9 15:00:22 2017 - [info] ** Finished master recovery successfully.
    Thu Nov  9 15:00:22 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Thu Nov  9 15:00:22 2017 - [info]
    Thu Nov  9 15:00:22 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Thu Nov  9 15:00:22 2017 - [info]
    Thu Nov  9 15:00:22 2017 - [info]
    Thu Nov  9 15:00:22 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Thu Nov  9 15:00:22 2017 - [info]
    Thu Nov  9 15:00:22 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 180681. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109150009.log if it takes time..
    Thu Nov  9 15:00:23 2017 - [info]
    Thu Nov  9 15:00:23 2017 - [info] Log messages from host_3 ...
    Thu Nov  9 15:00:23 2017 - [info]
    Thu Nov  9 15:00:22 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
    Thu Nov  9 15:00:22 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 15:00:23 2017 - [info]  Slave started.
    Thu Nov  9 15:00:23 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
    Thu Nov  9 15:00:23 2017 - [info] End of log messages from host_3.
    Thu Nov  9 15:00:23 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Thu Nov  9 15:00:23 2017 - [info] All new slave servers recovered successfully.
    Thu Nov  9 15:00:23 2017 - [info]
    Thu Nov  9 15:00:23 2017 - [info] * Phase 5: New master cleanup phase..
    Thu Nov  9 15:00:23 2017 - [info]
    Thu Nov  9 15:00:23 2017 - [info] Resetting slave info on the new master..
    Thu Nov  9 15:00:23 2017 - [info]  host_2: Resetting slave info succeeded.
    Thu Nov  9 15:00:23 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
    Thu Nov  9 15:00:23 2017 - [info]
    Thu Nov  9 15:00:23 2017 - [info] Sending mail..
    
    
    
    

    结论: 由于binlog server没有配置master,所以会丢失master没有传递过来的事务日志
    好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

    • binlog server 啥都不写
    ### 3台DB的GTID状态
    
    * master host_2
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | host_1.000005 |     5785 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446378 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    
    * slave host_1
    
               Retrieved_Gtid_Set:
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
                    Auto_Position: 1
    
    * etl host_3
    
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372
                    Auto_Position: 1
    
    
    ### 切换日志
    
    Thu Nov  9 16:22:41 2017 - [info] MHA::MasterFailover version 0.56.
    Thu Nov  9 16:22:41 2017 - [info] Starting master failover.
    Thu Nov  9 16:22:41 2017 - [info]
    Thu Nov  9 16:22:41 2017 - [info] * Phase 1: Configuration Check Phase..
    Thu Nov  9 16:22:41 2017 - [info]
    Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
    Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Thu Nov  9 16:22:41 2017 - [info] GTID failover mode = 1
    Thu Nov  9 16:22:41 2017 - [info] Dead Servers:
    Thu Nov  9 16:22:41 2017 - [info]   host_2(host_2:3306)
    Thu Nov  9 16:22:41 2017 - [info] Checking master reachability via MySQL(double check)...
    Thu Nov  9 16:22:41 2017 - [info]  ok.
    Thu Nov  9 16:22:41 2017 - [info] Alive Servers:
    Thu Nov  9 16:22:41 2017 - [info]   host_1(host_1:3306)
    Thu Nov  9 16:22:41 2017 - [info]   host_3(host_3:3306)
    Thu Nov  9 16:22:41 2017 - [info] Alive Slaves:
    Thu Nov  9 16:22:41 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 16:22:41 2017 - [info]     GTID ON
    Thu Nov  9 16:22:41 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 16:22:41 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 16:22:41 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 16:22:41 2017 - [info]     GTID ON
    Thu Nov  9 16:22:41 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 16:22:41 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
    Thu Nov  9 16:22:41 2017 - [info]   done.
    Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Thu Nov  9 16:22:41 2017 - [info]   done.
    Thu Nov  9 16:22:41 2017 - [info] Starting GTID based failover.
    Thu Nov  9 16:22:41 2017 - [info]
    Thu Nov  9 16:22:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Thu Nov  9 16:22:41 2017 - [info]
    Thu Nov  9 16:22:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Thu Nov  9 16:22:41 2017 - [info]
    Thu Nov  9 16:22:42 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Thu Nov  9 16:22:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Thu Nov  9 16:22:42 2017 - [info] Executing master IP deactivation script:
    Thu Nov  9 16:22:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
    ===================    swift vip :  vip from host_2 is deleted  ==============================
    
    
    --2017-11-09 16:22:42--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
    正在连接 tgw_server:80... 已连接。
    已发出 HTTP 请求,正在等待回应... 200 OK
    长度:未指定 [text/html]
    正在保存至: “STDOUT”
    
         0K                                                        9.79M=0s
    
    2017-11-09 16:22:44 (9.79 MB/s) - 已写入标准输出 [38]
    
    Thu Nov  9 16:22:44 2017 - [info]  done.
    Thu Nov  9 16:22:44 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Thu Nov  9 16:22:44 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info] * Phase 3: Master Recovery Phase..
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info] The latest binary log file/position on all slaves is host_1.000005:4015
    Thu Nov  9 16:22:44 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372
    Thu Nov  9 16:22:44 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Thu Nov  9 16:22:44 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 16:22:44 2017 - [info]     GTID ON
    Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 16:22:44 2017 - [info] The oldest binary log file/position on all slaves is host_1.000005:3130
    Thu Nov  9 16:22:44 2017 - [info] Oldest slaves:
    Thu Nov  9 16:22:44 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 16:22:44 2017 - [info]     GTID ON
    Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info] Searching new master from slaves..
    Thu Nov  9 16:22:44 2017 - [info]  Candidate masters from the configuration file:
    Thu Nov  9 16:22:44 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 16:22:44 2017 - [info]     GTID ON
    Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Thu Nov  9 16:22:44 2017 - [info]  Non-candidate masters:
    Thu Nov  9 16:22:44 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Thu Nov  9 16:22:44 2017 - [info]     GTID ON
    Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)
    Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)
    Thu Nov  9 16:22:44 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Thu Nov  9 16:22:44 2017 - [info]   Not found.
    Thu Nov  9 16:22:44 2017 - [info]  Searching from all candidate_master slaves..
    Thu Nov  9 16:22:44 2017 - [info] New master is host_1(host_1:3306)
    Thu Nov  9 16:22:44 2017 - [info] Starting master failover..
    Thu Nov  9 16:22:44 2017 - [info]
    From:
    host_2(host_2:3306) (current master)
     +--host_1(host_1:3306)
     +--host_3(host_3:3306)
    
    To:
    host_1(host_1:3306) (new master)
     +--host_3(host_3:3306)
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Thu Nov  9 16:22:44 2017 - [info]
    Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied..
    Thu Nov  9 16:22:44 2017 - [info]   done.
    Thu Nov  9 16:22:44 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..
    Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied on the latest slave..
    Thu Nov  9 16:22:44 2017 - [info]  Resetting slave host_1(host_1:3306) and starting replication from the new master host_3(host_3:3306)..
    Thu Nov  9 16:22:44 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 16:22:45 2017 - [info]  Slave started.
    Thu Nov  9 16:22:45 2017 - [info]  Waiting to execute all relay logs on host_1(host_1:3306)..
    Thu Nov  9 16:22:45 2017 - [info]  master_pos_wait(host_3.000049:28663) completed on host_1(host_1:3306). Executed 0 events.
    Thu Nov  9 16:22:45 2017 - [info]   done.
    Thu Nov  9 16:22:45 2017 - [info]   done.
    Thu Nov  9 16:22:45 2017 - [info] Getting new master's binlog name and position..
    Thu Nov  9 16:22:45 2017 - [info]  tjtx-126-164.000056:1170
    Thu Nov  9 16:22:45 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Thu Nov  9 16:22:45 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000056, 1170, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372
    Thu Nov  9 16:22:45 2017 - [info] Executing master IP activate script:
    Thu Nov  9 16:22:45 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Unknown option: new_master_user
    Unknown option: new_master_password
    ===================    swift vip :  vip to host_1  is added  ==============================
    
    
    Thu Nov  9 16:22:47 2017 - [info]  OK.
    Thu Nov  9 16:22:47 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
    Thu Nov  9 16:22:47 2017 - [info]  ok.
    Thu Nov  9 16:22:47 2017 - [info] ** Finished master recovery successfully.
    Thu Nov  9 16:22:47 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Thu Nov  9 16:22:47 2017 - [info]
    Thu Nov  9 16:22:47 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Thu Nov  9 16:22:47 2017 - [info]
    Thu Nov  9 16:22:47 2017 - [info]
    Thu Nov  9 16:22:47 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Thu Nov  9 16:22:47 2017 - [info]
    Thu Nov  9 16:22:47 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 112317. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109162241.log if it takes time..
    Thu Nov  9 16:22:48 2017 - [info]
    Thu Nov  9 16:22:48 2017 - [info] Log messages from host_3 ...
    Thu Nov  9 16:22:48 2017 - [info]
    Thu Nov  9 16:22:47 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
    Thu Nov  9 16:22:47 2017 - [info]  Executed CHANGE MASTER.
    Thu Nov  9 16:22:48 2017 - [info]  Slave started.
    Thu Nov  9 16:22:48 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372) completed on host_3(host_3:3306). Executed 0 events.
    Thu Nov  9 16:22:48 2017 - [info] End of log messages from host_3.
    Thu Nov  9 16:22:48 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Thu Nov  9 16:22:48 2017 - [info] All new slave servers recovered successfully.
    Thu Nov  9 16:22:48 2017 - [info]
    Thu Nov  9 16:22:48 2017 - [info] * Phase 5: New master cleanup phase..
    Thu Nov  9 16:22:48 2017 - [info]
    Thu Nov  9 16:22:48 2017 - [info] Resetting slave info on the new master..
    Thu Nov  9 16:22:49 2017 - [info]  host_1: Resetting slave info succeeded.
    Thu Nov  9 16:22:49 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
    Thu Nov  9 16:22:49 2017 - [info]
    
    ----- Failover Report -----
    
    bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
    
    Master host_2(host_2:3306) is down!
    
    Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
    
    Started automated(non-interactive) failover.
    Invalidated master IP address on host_2(host_2:3306)
    Selected host_1(host_1:3306) as a new master.
    host_1(host_1:3306): OK: Applying all logs succeeded.
    host_1(host_1:3306): OK: Activated master IP address.
    host_3(host_3:3306): OK: Slave started, replicating from host_1(host_1:3306)
    host_1(host_1:3306): Resetting slave info succeeded.
    Master failover to host_1(host_1:3306) completed successfully.
    
    
    

    结论: 由于binlog server没有配置master,所以会丢失master没有传递过来的事务日志
    好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

    1.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?

    • 99%的场景都是可以重新执行的

    • 1%的场景不能再次执行,执行会报错

    一般这种场景就是:已经failover到最后的change master阶段,这样主从结构已经变更,MHA无法重新走一遍。
    不过,即便到这步骤失败了,表示master的日志已经补完,由于是gtid模式,自己再让slave change master到最新的master即可,最后ACTIVE new ip和readonly=1就好了

    Thu Nov  9 16:49:39 2017 - [info] MHA::MasterFailover version 0.56.
    Thu Nov  9 16:49:39 2017 - [info] Starting master failover.
    Thu Nov  9 16:49:39 2017 - [info]
    Thu Nov  9 16:49:39 2017 - [info] * Phase 1: Configuration Check Phase..
    Thu Nov  9 16:49:39 2017 - [info]
    Thu Nov  9 16:49:39 2017 - [info] GTID failover mode = 1
    Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master host_1(host_1:3306) does not match with specified dead master host_2(host_2:3306)!
    Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53
    
    

    1.7 Master:MySQL down小结

    1. failover最终命令
    
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    2. binlog server建议
    
    配置master就可以了
    [binlog1]
    $master_ip
    
    只配置slave,或者没有配置,会导致丢失部分没有从master传递过来的日志事务
    

    二、Master : Server down

    2.1 etl 延迟8小时

     同1.1 结论
    

    2.2 slave(候选master)比etl还要落后更多

    • 2.2.1 当master的部分日志还没传递两个slave,这时候master server挂了
    ### 3台DB的GTID状态
    
    * master host_2
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | host_1.000008 |     5445 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    * slave host_1
    
               Retrieved_Gtid_Set:
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446385
                    Auto_Position: 1
    
    
    
    * etl host_3
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388
                    Auto_Position: 1
    
    
    
    
    ### 模拟故障场景
    
    * 隔离master的网络,让其等同于down机
    
    master> iptables -A INPUT -p tcp -s monitor_ip --dport 22 -j ACCEPT
    master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
    
    ### 切换日志
    
    
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    Fri Nov 10 11:12:38 2017 - [info] MHA::MasterFailover version 0.56.
    Fri Nov 10 11:12:38 2017 - [info] Starting master failover.
    Fri Nov 10 11:12:38 2017 - [info]
    Fri Nov 10 11:12:38 2017 - [info] * Phase 1: Configuration Check Phase..
    Fri Nov 10 11:12:38 2017 - [info]
    Fri Nov 10 11:13:28 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
    Fri Nov 10 11:13:28 2017 - [warning] Failed to SSH to binlog server host_2
    Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to host_1 is reachable.
    Fri Nov 10 11:13:29 2017 - [info] Binlog server host_1 is reachable.
    Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to host_3 is reachable.
    Fri Nov 10 11:13:29 2017 - [info] Binlog server host_3 is reachable.
    Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
    Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Fri Nov 10 11:13:29 2017 - [info] GTID failover mode = 1
    Fri Nov 10 11:13:29 2017 - [info] Dead Servers:
    Fri Nov 10 11:13:29 2017 - [info]   host_2(host_2:3306)
    Fri Nov 10 11:13:29 2017 - [info] Checking master reachability via MySQL(double check)...
    Fri Nov 10 11:13:30 2017 - [info]  ok.
    Fri Nov 10 11:13:30 2017 - [info] Alive Servers:
    Fri Nov 10 11:13:30 2017 - [info]   host_1(host_1:3306)
    Fri Nov 10 11:13:30 2017 - [info]   host_3(host_3:3306)
    Fri Nov 10 11:13:30 2017 - [info] Alive Slaves:
    Fri Nov 10 11:13:30 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 11:13:30 2017 - [info]     GTID ON
    Fri Nov 10 11:13:30 2017 - [info]     Replicating from host_2(host_2:3306)
    Fri Nov 10 11:13:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Fri Nov 10 11:13:30 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 11:13:30 2017 - [info]     GTID ON
    Fri Nov 10 11:13:30 2017 - [info]     Replicating from host_2(host_2:3306)
    Fri Nov 10 11:13:30 2017 - [info]     Not candidate for the new Master (no_master is set)
    Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
    Fri Nov 10 11:13:30 2017 - [info]   done.
    Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Fri Nov 10 11:13:30 2017 - [info]   done.
    Fri Nov 10 11:13:30 2017 - [info] Starting GTID based failover.
    Fri Nov 10 11:13:30 2017 - [info]
    Fri Nov 10 11:13:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Fri Nov 10 11:13:30 2017 - [info]
    Fri Nov 10 11:13:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Fri Nov 10 11:13:30 2017 - [info]
    Fri Nov 10 11:14:20 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
    Fri Nov 10 11:14:20 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Fri Nov 10 11:14:20 2017 - [info] Executing master IP deactivation script:
    Fri Nov 10 11:14:20 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stop
    ssh: connect to host host_2 port 22: Connection timed out
    ===================    swift vip :  vip from host_2 is deleted  ==============================
    
    
    --2017-11-10 11:14:27--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
    正在连接 tgw_server:80... 已连接。
    已发出 HTTP 请求,正在等待回应... 200 OK
    长度:未指定 [text/html]
    正在保存至: “STDOUT”
    
         0K                                                        11.4M=0s
    
    2017-11-10 11:16:27 (11.4 MB/s) - 已写入标准输出 [38]
    
    Fri Nov 10 11:16:27 2017 - [info]  done.
    Fri Nov 10 11:16:27 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Fri Nov 10 11:16:27 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info] * Phase 3: Master Recovery Phase..
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info] The latest binary log file/position on all slaves is host_1.000008:4265
    Fri Nov 10 11:16:27 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388
    Fri Nov 10 11:16:27 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Fri Nov 10 11:16:27 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 11:16:27 2017 - [info]     GTID ON
    Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)
    Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)
    Fri Nov 10 11:16:27 2017 - [info] The oldest binary log file/position on all slaves is host_1.000008:3380
    Fri Nov 10 11:16:27 2017 - [info] Oldest slaves:
    Fri Nov 10 11:16:27 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 11:16:27 2017 - [info]     GTID ON
    Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)
    Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info] Searching new master from slaves..
    Fri Nov 10 11:16:27 2017 - [info]  Candidate masters from the configuration file:
    Fri Nov 10 11:16:27 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 11:16:27 2017 - [info]     GTID ON
    Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)
    Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Fri Nov 10 11:16:27 2017 - [info]  Non-candidate masters:
    Fri Nov 10 11:16:27 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 11:16:27 2017 - [info]     GTID ON
    Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)
    Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)
    Fri Nov 10 11:16:27 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Fri Nov 10 11:16:27 2017 - [info]   Not found.
    Fri Nov 10 11:16:27 2017 - [info]  Searching from all candidate_master slaves..
    Fri Nov 10 11:16:27 2017 - [info] New master is host_1(host_1:3306)
    Fri Nov 10 11:16:27 2017 - [info] Starting master failover..
    Fri Nov 10 11:16:27 2017 - [info]
    From:
    host_2(host_2:3306) (current master)
     +--host_1(host_1:3306)
     +--host_3(host_3:3306)
    
    To:
    host_1(host_1:3306) (new master)
     +--host_3(host_3:3306)
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Fri Nov 10 11:16:27 2017 - [info]
    Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied..
    Fri Nov 10 11:16:27 2017 - [info]   done.
    Fri Nov 10 11:16:27 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..
    Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied on the latest slave..
    Fri Nov 10 11:16:27 2017 - [info]  Resetting slave host_1(host_1:3306) and starting replication from the new master host_3(host_3:3306)..
    Fri Nov 10 11:16:27 2017 - [info]  Executed CHANGE MASTER.
    Fri Nov 10 11:16:28 2017 - [info]  Slave started.
    Fri Nov 10 11:16:28 2017 - [info]  Waiting to execute all relay logs on host_1(host_1:3306)..
    Fri Nov 10 11:16:28 2017 - [info]  master_pos_wait(host_3.000049:40136) completed on host_1(host_1:3306). Executed 0 events.
    Fri Nov 10 11:16:28 2017 - [info]   done.
    Fri Nov 10 11:16:28 2017 - [info]   done.
    Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_2 started, pid: 43038
    Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_1 started, pid: 43039
    Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_3 started, pid: 43041
    Fri Nov 10 11:16:28 2017 - [info]
    Fri Nov 10 11:16:28 2017 - [info] Log messages from host_2 ...
    Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_2.
    Fri Nov 10 11:16:28 2017 - [warning] SSH is not reachable on host_2. Skipping
    Fri Nov 10 11:16:28 2017 - [info]
    Fri Nov 10 11:16:28 2017 - [info] Log messages from host_1 ...
    Fri Nov 10 11:16:28 2017 - [info]
    Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server host_1..
    Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
     at /usr/bin/save_binary_logs line 123
        eval {...} called at /usr/bin/save_binary_logs line 70
        main::main() called at /usr/bin/save_binary_logs line 66
    Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_1.
    Fri Nov 10 11:16:28 2017 - [warning] Got error from host_1.
    Fri Nov 10 11:16:28 2017 - [info]
    Fri Nov 10 11:16:28 2017 - [info] Log messages from host_3 ...
    Fri Nov 10 11:16:28 2017 - [info]
    Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server host_3..
    Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
     at /usr/bin/save_binary_logs line 123
        eval {...} called at /usr/bin/save_binary_logs line 70
        main::main() called at /usr/bin/save_binary_logs line 66
    Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_3.
    Fri Nov 10 11:16:28 2017 - [warning] Got error from host_3.
    Fri Nov 10 11:16:28 2017 - [info] Getting new master's binlog name and position..
    Fri Nov 10 11:16:28 2017 - [info]  tjtx-126-164.000058:4059
    Fri Nov 10 11:16:28 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Fri Nov 10 11:16:28 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000058, 4059, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388
    Fri Nov 10 11:16:28 2017 - [info] Executing master IP activate script:
    Fri Nov 10 11:16:28 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Unknown option: new_master_user
    Unknown option: new_master_password
    ===================    swift vip :  vip to host_1  is added  ==============================
    
    
    Fri Nov 10 11:16:30 2017 - [info]  OK.
    Fri Nov 10 11:16:30 2017 - [info] ** Finished master recovery successfully.
    Fri Nov 10 11:16:30 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Fri Nov 10 11:16:30 2017 - [info]
    Fri Nov 10 11:16:30 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Fri Nov 10 11:16:30 2017 - [info]
    Fri Nov 10 11:16:30 2017 - [info]
    Fri Nov 10 11:16:30 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Fri Nov 10 11:16:30 2017 - [info]
    Fri Nov 10 11:16:30 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 46878. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171110111238.log if it takes time..
    Fri Nov 10 11:16:31 2017 - [info]
    Fri Nov 10 11:16:31 2017 - [info] Log messages from host_3 ...
    Fri Nov 10 11:16:31 2017 - [info]
    Fri Nov 10 11:16:30 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
    Fri Nov 10 11:16:30 2017 - [info]  Executed CHANGE MASTER.
    Fri Nov 10 11:16:31 2017 - [info]  Slave started.
    Fri Nov 10 11:16:31 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388) completed on host_3(host_3:3306). Executed 0 events.
    Fri Nov 10 11:16:31 2017 - [info] End of log messages from host_3.
    Fri Nov 10 11:16:31 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Fri Nov 10 11:16:31 2017 - [info] All new slave servers recovered successfully.
    Fri Nov 10 11:16:31 2017 - [info]
    Fri Nov 10 11:16:31 2017 - [info] * Phase 5: New master cleanup phase..
    Fri Nov 10 11:16:31 2017 - [info]
    Fri Nov 10 11:16:31 2017 - [info] Resetting slave info on the new master..
    Fri Nov 10 11:16:31 2017 - [info]  host_1: Resetting slave info succeeded.
    Fri Nov 10 11:16:31 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
    Fri Nov 10 11:16:31 2017 - [info]
    
    ----- Failover Report -----
    
    bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
    
    Master host_2(host_2:3306) is down!
    
    Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
    
    Started automated(non-interactive) failover.
    Invalidated master IP address on host_2(host_2:3306)
    Selected host_1(host_1:3306) as a new master.
    host_1(host_1:3306): OK: Applying all logs succeeded.
    host_1(host_1:3306): OK: Activated master IP address.
    host_3(host_3:3306): OK: Slave started, replicating from host_1(host_1:3306)
    host_1(host_1:3306): Resetting slave info succeeded.
    Master failover to host_1(host_1:3306) completed successfully.
    Fri Nov 10 11:16:31 2017 - [info] Sending mail..
    
    
    ### 最后一步很重要
    
    如果dead master之后又活过来了,那么这一步要做
    
    dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
    
    http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md  ==》 tgw章节详细描述
    
    
    

    结论: 由于master 已挂,然而最后的日志没有传递到其他服务器,所以会丢失master没有传递过来的事务日志
    好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

    • 2.2.2 当master的所有日志已经传递到1个etl,这时候master server挂了
    测试省略,和2.2.1基本一样
    
    

    结论:由于master上的所有日志全部传递到etl,所以最后是不会丢失master上任何数据的。

    2.3 slave(候选master)的日志是最新的,比etl要多

    • 2.3.1 当master的部分日志还没传递两个slave,这时候master server挂了
    测试省略,和2.2.1基本一样
    
    

    结论: 由于master 已挂,然而最后的日志没有传递到其他服务器,所以会丢失master没有传递过来的事务日志
    好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

    • 2.3.2 当master的所有日志已经传递slave,这时候master server挂了
    测试省略,和2.2.1基本一样
    
    

    结论:由于master上的所有日志全部传递到slave,所以最后是不会丢失master上任何数据的。

    2.4 slave(候选master)上面有大事务在跑

    • 1000s的大查询
    同1.4结论
    
    
    • flush tables with readlock
    
    同1.4结论
    

    2.5 binlog server 不同场景的测试

    dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl (这是最严苛的情况)

    • binlog server 写3台
    2.2.1 测试的就是这种情况,详细日志切换请看2.2.1
    
    

    结论: 由于binlog server配置了3台,但是由于master server已经挂掉,无法从master的binlog server上获取日志,所以会丢失master上没有传递的日志事务

    • binlog server 只写master
    
    ### 3台DB的gtid 状态
    
    * master host_1
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | tjtx-126-164.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    
    * slave host_2
    
               Retrieved_Gtid_Set:
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                    Auto_Position: 1
    
    
    * etl host_3
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                    Auto_Position: 1
    
    
    
    ### 模拟故障
    
    master> iptables -A INPUT -p tcp -s monitor_ip --dport 22 -j ACCEPT
    master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
    
    
    ### 故障切换
    
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    Fri Nov 10 14:15:51 2017 - [info] MHA::MasterFailover version 0.56.
    Fri Nov 10 14:15:51 2017 - [info] Starting master failover.
    Fri Nov 10 14:15:51 2017 - [info]
    Fri Nov 10 14:15:51 2017 - [info] * Phase 1: Configuration Check Phase..
    Fri Nov 10 14:15:51 2017 - [info]
    Fri Nov 10 14:16:41 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
    Fri Nov 10 14:16:41 2017 - [warning] Failed to SSH to binlog server host_1
    Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln239] Binlog Server is defined but there is no alive server.
    Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/share/perl5/vendor_perl/MHA/MasterFailover.pm line 2082
    
    
    
    
    
    

    结论: binlog server 必须要配置一个活的 server,如果只配置master,如果master挂了,那么就等于一个都没有,MHA不会切换

    • binlog server 只写slave
    
    ### 3台DB的gtid 状态
    
    * master host_1
    
    dba:lc> show master status;
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    | tjtx-126-164.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
    +---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    
    * slave host_2
    
               Retrieved_Gtid_Set:
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                    Auto_Position: 1
    
    
    * etl host_3
    
               Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
                Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
                    Auto_Position: 1
    
    
    
    ### 模拟故障
    
    master> iptables -A INPUT -p tcp -s monitor_ip --dport 22 -j ACCEPT
    master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
    
    
    ### 故障切换
    
    Fri Nov 10 14:29:50 2017 - [info] MHA::MasterFailover version 0.56.
    Fri Nov 10 14:29:50 2017 - [info] Starting master failover.
    Fri Nov 10 14:29:50 2017 - [info]
    Fri Nov 10 14:29:50 2017 - [info] * Phase 1: Configuration Check Phase..
    Fri Nov 10 14:29:50 2017 - [info]
    Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
    Fri Nov 10 14:29:50 2017 - [info] Binlog server host_2 is reachable.
    Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to host_3 is reachable.
    Fri Nov 10 14:29:50 2017 - [info] Binlog server host_3 is reachable.
    Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
    Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
    Fri Nov 10 14:29:50 2017 - [info] GTID failover mode = 1
    Fri Nov 10 14:29:50 2017 - [info] Dead Servers:
    Fri Nov 10 14:29:50 2017 - [info]   host_1(host_1:3306)
    Fri Nov 10 14:29:50 2017 - [info] Checking master reachability via MySQL(double check)...
    Fri Nov 10 14:29:51 2017 - [info]  ok.
    Fri Nov 10 14:29:51 2017 - [info] Alive Servers:
    Fri Nov 10 14:29:51 2017 - [info]   host_2(host_2:3306)
    Fri Nov 10 14:29:51 2017 - [info]   host_3(host_3:3306)
    Fri Nov 10 14:29:51 2017 - [info] Alive Slaves:
    Fri Nov 10 14:29:51 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 14:29:51 2017 - [info]     GTID ON
    Fri Nov 10 14:29:51 2017 - [info]     Replicating from host_1(host_1:3306)
    Fri Nov 10 14:29:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Fri Nov 10 14:29:51 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 14:29:51 2017 - [info]     GTID ON
    Fri Nov 10 14:29:51 2017 - [info]     Replicating from host_1(host_1:3306)
    Fri Nov 10 14:29:51 2017 - [info]     Not candidate for the new Master (no_master is set)
    Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..
    Fri Nov 10 14:29:51 2017 - [info]   done.
    Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
    Fri Nov 10 14:29:52 2017 - [info]   done.
    Fri Nov 10 14:29:52 2017 - [info] Starting GTID based failover.
    Fri Nov 10 14:29:52 2017 - [info]
    Fri Nov 10 14:29:52 2017 - [info] ** Phase 1: Configuration Check Phase completed.
    Fri Nov 10 14:29:52 2017 - [info]
    Fri Nov 10 14:29:52 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
    Fri Nov 10 14:29:52 2017 - [info]
    Fri Nov 10 14:30:42 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
    Fri Nov 10 14:30:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..
    Fri Nov 10 14:30:42 2017 - [info] Executing master IP deactivation script:
    Fri Nov 10 14:30:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
    ssh: connect to host host_1 port 22: Connection timed out
    ===================    swift vip :  vip from host_1 is deleted  ==============================
    
    
    --2017-11-10 14:30:49--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
    正在连接 tgw_server:80... 已连接。
    已发出 HTTP 请求,正在等待回应... 200 OK
    长度:未指定 [text/html]
    正在保存至: “STDOUT”
    
         0K                                                        12.1M=0s
    
    2017-11-10 14:32:47 (12.1 MB/s) - 已写入标准输出 [38]
    
    Fri Nov 10 14:32:47 2017 - [info]  done.
    Fri Nov 10 14:32:47 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Fri Nov 10 14:32:47 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info] * Phase 3: Master Recovery Phase..
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000058:6912
    Fri Nov 10 14:32:47 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
    Fri Nov 10 14:32:47 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Fri Nov 10 14:32:47 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 14:32:47 2017 - [info]     GTID ON
    Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)
    Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)
    Fri Nov 10 14:32:47 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000058:5307
    Fri Nov 10 14:32:47 2017 - [info] Oldest slaves:
    Fri Nov 10 14:32:47 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 14:32:47 2017 - [info]     GTID ON
    Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)
    Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: Determining New Master Phase..
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info] Searching new master from slaves..
    Fri Nov 10 14:32:47 2017 - [info]  Candidate masters from the configuration file:
    Fri Nov 10 14:32:47 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 14:32:47 2017 - [info]     GTID ON
    Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)
    Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
    Fri Nov 10 14:32:47 2017 - [info]  Non-candidate masters:
    Fri Nov 10 14:32:47 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
    Fri Nov 10 14:32:47 2017 - [info]     GTID ON
    Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)
    Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)
    Fri Nov 10 14:32:47 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Fri Nov 10 14:32:47 2017 - [info]   Not found.
    Fri Nov 10 14:32:47 2017 - [info]  Searching from all candidate_master slaves..
    Fri Nov 10 14:32:47 2017 - [info] New master is host_2(host_2:3306)
    Fri Nov 10 14:32:47 2017 - [info] Starting master failover..
    Fri Nov 10 14:32:47 2017 - [info]
    From:
    host_1(host_1:3306) (current master)
     +--host_2(host_2:3306)
     +--host_3(host_3:3306)
    
    To:
    host_2(host_2:3306) (new master)
     +--host_3(host_3:3306)
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: New Master Recovery Phase..
    Fri Nov 10 14:32:47 2017 - [info]
    Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied..
    Fri Nov 10 14:32:47 2017 - [info]   done.
    Fri Nov 10 14:32:47 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..
    Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied on the latest slave..
    Fri Nov 10 14:32:47 2017 - [info]  Resetting slave host_2(host_2:3306) and starting replication from the new master host_3(host_3:3306)..
    Fri Nov 10 14:32:47 2017 - [info]  Executed CHANGE MASTER.
    Fri Nov 10 14:32:48 2017 - [info]  Slave started.
    Fri Nov 10 14:32:48 2017 - [info]  Waiting to execute all relay logs on host_2(host_2:3306)..
    Fri Nov 10 14:32:48 2017 - [info]  master_pos_wait(host_3.000049:42954) completed on host_2(host_2:3306). Executed 0 events.
    Fri Nov 10 14:32:48 2017 - [info]   done.
    Fri Nov 10 14:32:48 2017 - [info]   done.
    Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host host_2 started, pid: 76664
    Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host host_3 started, pid: 76665
    Fri Nov 10 14:32:48 2017 - [info]
    Fri Nov 10 14:32:48 2017 - [info] Log messages from host_2 ...
    Fri Nov 10 14:32:48 2017 - [info]
    Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server host_2..
    Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
     at /usr/bin/save_binary_logs line 123
        eval {...} called at /usr/bin/save_binary_logs line 70
        main::main() called at /usr/bin/save_binary_logs line 66
    Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Fri Nov 10 14:32:48 2017 - [info] End of log messages from host_2.
    Fri Nov 10 14:32:48 2017 - [warning] Got error from host_2.
    Fri Nov 10 14:32:48 2017 - [info]
    Fri Nov 10 14:32:48 2017 - [info] Log messages from host_3 ...
    Fri Nov 10 14:32:48 2017 - [info]
    Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server host_3..
    Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin
    Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
     at /usr/bin/save_binary_logs line 123
        eval {...} called at /usr/bin/save_binary_logs line 70
        main::main() called at /usr/bin/save_binary_logs line 66
    Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
    Fri Nov 10 14:32:48 2017 - [info] End of log messages from host_3.
    Fri Nov 10 14:32:48 2017 - [warning] Got error from host_3.
    Fri Nov 10 14:32:48 2017 - [info] Getting new master's binlog name and position..
    Fri Nov 10 14:32:48 2017 - [info]  host_1.000008:6895
    Fri Nov 10 14:32:48 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Fri Nov 10 14:32:48 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000008, 6895, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
    Fri Nov 10 14:32:48 2017 - [info] Executing master IP activate script:
    Fri Nov 10 14:32:48 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'
    Unknown option: new_master_user
    Unknown option: new_master_password
    ===================    swift vip :  vip to host_2  is added  ==============================
    
    
    Fri Nov 10 14:32:51 2017 - [info]  OK.
    Fri Nov 10 14:32:51 2017 - [info] ** Finished master recovery successfully.
    Fri Nov 10 14:32:51 2017 - [info] * Phase 3: Master Recovery Phase completed.
    Fri Nov 10 14:32:51 2017 - [info]
    Fri Nov 10 14:32:51 2017 - [info] * Phase 4: Slaves Recovery Phase..
    Fri Nov 10 14:32:51 2017 - [info]
    Fri Nov 10 14:32:51 2017 - [info]
    Fri Nov 10 14:32:51 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
    Fri Nov 10 14:32:51 2017 - [info]
    Fri Nov 10 14:32:51 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 80398. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171110142950.log if it takes time..
    Fri Nov 10 14:32:52 2017 - [info]
    Fri Nov 10 14:32:52 2017 - [info] Log messages from host_3 ...
    Fri Nov 10 14:32:52 2017 - [info]
    Fri Nov 10 14:32:51 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
    Fri Nov 10 14:32:51 2017 - [info]  Executed CHANGE MASTER.
    Fri Nov 10 14:32:52 2017 - [info]  Slave started.
    Fri Nov 10 14:32:52 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
    ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392) completed on host_3(host_3:3306). Executed 0 events.
    Fri Nov 10 14:32:52 2017 - [info] End of log messages from host_3.
    Fri Nov 10 14:32:52 2017 - [info] -- Slave on host host_3(host_3:3306) started.
    Fri Nov 10 14:32:52 2017 - [info] All new slave servers recovered successfully.
    Fri Nov 10 14:32:52 2017 - [info]
    Fri Nov 10 14:32:52 2017 - [info] * Phase 5: New master cleanup phase..
    Fri Nov 10 14:32:52 2017 - [info]
    Fri Nov 10 14:32:52 2017 - [info] Resetting slave info on the new master..
    Fri Nov 10 14:32:52 2017 - [info]  host_2: Resetting slave info succeeded.
    Fri Nov 10 14:32:52 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
    Fri Nov 10 14:32:52 2017 - [info]
    
    ----- Failover Report -----
    
    bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
    
    Master host_1(host_1:3306) is down!
    
    Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
    
    Started automated(non-interactive) failover.
    Invalidated master IP address on host_1(host_1:3306)
    Selected host_2(host_2:3306) as a new master.
    host_2(host_2:3306): OK: Applying all logs succeeded.
    host_2(host_2:3306): OK: Activated master IP address.
    host_3(host_3:3306): OK: Slave started, replicating from host_2(host_2:3306)
    host_2(host_2:3306): Resetting slave info succeeded.
    Master failover to host_2(host_2:3306) completed successfully.
    Fri Nov 10 14:32:52 2017 - [info] Sending mail..
    
    
    
    
    

    结论: binlog server 配置成多台slave,这是正确的方案。 由于master 挂了,master没有传递过来的binlog会丢失,这是没办法的. 好在,其余slave自动补齐现有日志

    • binlog server 啥都不写
    
    会切换成功,由于master 挂了,master没有传递过来的binlog会丢失
    
    好在,其余slave自动补齐现有日志
    
    

    2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?

    同1.6结论
    
    

    三、遇到的坑

    3.1 交互模式下,如果没有及时敲'YES',则终止切换

    四、总结

    MHA + GTID 模式,重点配置和用法如下:

    1. command
    masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
    
    
    2. binlog server
    
    在配置文件中对 master,slave,etl 都写在binlog server中。对MySQL down 和 DB server down  综合考虑下,建议这样配置。
    
    
    3. tgw 清理
    
    dead master 如果还可以起来,那么必须在上面执行: /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
    
    原因可参看:http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节
    
    本文为云栖社区原创内容,未经允许不得转载,如需转载请发送邮件至yqeditor@list.alibaba-inc.com;如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件至:yqgroup@service.aliyun.com 进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容
  • 相关阅读:
    jquery 监听input的value值改变
    Win10家庭版共享打印机启用Guest账户
    js数组操作大全(pop,push,unshift,splice,shift方法)
    EF Core索引
    工作经验(C++篇)
    FFmpeg编译i386 OSX 脚本
    xcrun -sdk 选择
    Unity GL 画圆
    OpenGL ES无法获取贴图数据原因
    Unity在Android和iOS中如何调用Native API
  • 原文地址:https://www.cnblogs.com/moss_tan_jun/p/7847050.html
Copyright © 2011-2022 走看看