zoukankan      html  css  js  c++  java
  • my31_MGR单写模式压测以及对比普通从库记录

    场景
    MGR单写模式三节点,db46写节点,db47/db48为读节点
    工具sysbencn、压测15个小时,db46上18线程纯写,12线程oltp混合测试,db48上12线程select
    在压测2个小时后,两个从库就卡住不再写,前一天晚上起的脚本,第二天来发现的这个问题,然后开始尝试解决

    日志信息
    2019-03-01T12:35:23.921815+08:00 2763 [Note] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 568; events assigned = 185124865; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 2030997800 waited (count) when Workers occupied = 75103 waited when Workers occupied = 22073734900
    2019-03-01T12:37:29.742741+08:00 2763 [Note] Multi-threaded slave statistics for channel
    'group_replication_applier': seconds elapsed = 126;
    events assigned = 185125889;
    worker queues filled over overrun level = 0;
    waited due a Worker queue full = 0;
    waited due the total size = 0;
    waited at clock conflicts = 2032822300 waited (count) when Workers occupied = 75114
    waited when Workers occupied = 22075589000

    读节点无法关闭集群,一直卡
    mysql> stop group_replication;

    卡的过程中报相关日志

    2019-03-01T12:54:22.192891+08:00 16 [ERROR] Plugin group_replication reported: 'Timeout on wait for view after joining group'
    2019-03-01T12:54:22.192959+08:00 16 [Note] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
    2019-03-01T12:54:22.193012+08:00 16 [ERROR] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
    2019-03-01T12:54:22.193109+08:00 0 [Warning] Plugin group_replication reported: 'read failed'
    2019-03-01T12:54:22.193211+08:00 0 [Warning] Plugin group_replication reported: '[GCS] The member has left the group but the new view will not be installed, probably because it has not been delivered yet.'
    2019-03-01T12:54:27.193393+08:00 16 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'
    2019-03-01T12:54:27.193414+08:00 16 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'
    2019-03-01T12:54:27.193598+08:00 164 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
    2019-03-01T12:54:27.194129+08:00 164 [Note] Slave SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 374
    2019-03-01T12:54:27.194564+08:00 161 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'

    强制kill了读节点db48 mysql进程后,无法再次加入集群
    mysql> start group_replication;
    ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.
    mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
    Query OK, 0 rows affected, 1 warning (0.00 sec)

    mysql> start group_replication;
    ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.

    在现写节点db46上执行stop group_replication;依然是卡住, 过一段时间之后,写节点切换到另外一个节点db47上了,
    此时db48正常关闭实例,再次启动后,尝试start group_replication;加入了集群,查看写节点为db47
    原来的写节点db46已经不在集群中了
    mysql> SELECT MEMBER_ID,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE FROM performance_schema.replication_group_members;
    +--------------------------------------+------------+-------------+--------------+
    | MEMBER_ID | MEMBER_HOST| MEMBER_PORT | MEMBER_STATE |
    +--------------------------------------+------------+-------------+--------------+
    | 3b6d8d58-3b31-11e9-b581-6c92bfb23298 | db47 | 3301 | ONLINE |
    | d29a16c7-3b2f-11e9-8c10-6c92bfb234c0   | db48 | 3301 | RECOVERING |
    +--------------------------------------+------------+-------------+--------------+

    重点是,写入db46的数据,由于复制进程卡住,还会同步到db47上吗,毕竟从状态上看db46已经不在集群中了
    查看db47的数据文件,发现文件大小不变且小于db46上的文件,即没有再从db46上同步数据,
    但db48是从db47上同步数据的,db47/db48都落后于db46,但db48落后的更多,所以db47成为写节点后,db48开始从db47追数据

    数据丢失是业务不能接受的,宁可全部舍弃db47与db48,也不会舍弃db46,因为db46数据是全的;
    至此,已经可以得到两个观点
    1. 追加从库要直接加在写节点的下面,不要加在读或从节点的下面;因为读节点要挂了,之后的节点数据皆不全
    2. MGR正常的切换是在保证节点数据一致或节点间复制、通信正常为前提的;如果正常命令无法执行,
    比如无法stop group_replication,就代表着此时集群有问题,
    在关闭之前,应该先记录一下出问题的位置,查看一下日志报错等
    不要上来就强制关闭实例,更不要重置节点,做这些之前要先想好恢复的方案

    下面优先恢复写节点,在原写节点db46上执行
    set global group_replication_bootstrap_group=on;
    start group_replication;
    set global group_replication_bootstrap_group=off;


    重置节点db47,这是原来的操作步骤,没有记录执行的位置就重置了,是错误的操作,
    由于实际操作时是这么做的,就记录一下,后面对此做法的数据恢复也做了分析
    mysql> show master statusG;
    *************************** 1. row ***************************
    File: mysql-bin.000047
    Position: 270
    Binlog_Do_DB:
    Binlog_Ignore_DB:
    Executed_Gtid_Set: 3b6d8d58-3b31-11e9-b581-6c92bfb23298:1-583,
    aaaaaaaa-bbba-ccca-ddda-aaaaaaaaa104:1-52150701,
    c7a444f8-3b30-11e9-8336-6c92bfb2329c:1-49
    1 row in set (0.00 sec)

    ERROR:
    No query specified

    mysql> reset master;
    Query OK, 0 rows affected (5.23 sec)

    mysql> show master statusG;
    *************************** 1. row ***************************
    File: mysql-bin.000001
    Position: 150
    Binlog_Do_DB:
    Binlog_Ignore_DB:
    Executed_Gtid_Set:
    1 row in set (0.00 sec)

    ERROR:
    No query specified

    mysql> start group_replication;
    Query OK, 0 rows affected (7.58 sec)

    mysql> show master statusG;
    *************************** 1. row ***************************
    File: mysql-bin.000001
    Position: 1946
    Binlog_Do_DB:
    Binlog_Ignore_DB:
    Executed_Gtid_Set: aaaaaaaa-bbba-ccca-ddda-aaaaaaaaa104:1-4,
    c7a444f8-3b30-11e9-8336-6c92bfb2329c:1-2
    1 row in set (0.00 sec)

    ERROR:
    No query specified

    mysql> SELECT MEMBER_ID,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE FROM performance_schema.replication_group_members;
    +--------------------------------------+------------+-------------+--------------+
    | MEMBER_ID                            | MEMBER_HOST| MEMBER_PORT | MEMBER_STATE |
    +--------------------------------------+------------+-------------+--------------+
    | 3b6d8d58-3b31-11e9-b581-6c92bfb23298 | db47 |        3301 | RECOVERING   |
    | c7a444f8-3b30-11e9-8336-6c92bfb2329c | db46 |        3301 | ONLINE       |
    +--------------------------------------+------------+-------------+--------------+

    不指定恢复的起点时,MGR默认是从头开始恢复的,可以看到日志中MGR在尝试启动,但启动不起来,最后就放弃从开始的位置开始同步数据了
    2019-03-01T14:14:10.430561+08:00 928 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='db46', master_port= 3301, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
    2019-03-01T14:14:10.432469+08:00 928 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'

    使用db48以指定的执行位置开始恢复

    mysql> show master statusG;
    *************************** 1. row ***************************
    File: mysql-bin.000045
    Position: 59480248
    Binlog_Do_DB:
    Binlog_Ignore_DB:
    Executed_Gtid_Set: aaaaaaaa-bbba-ccca-ddda-aaaaaaaaa104:1-49285172,
    c7a444f8-3b30-11e9-8336-6c92bfb2329c:1-49
    1 row in set (0.00 sec)


    reset master;

    SET @@GLOBAL.GTID_PURGED='aaaaaaaa-bbba-ccca-ddda-aaaaaaaaa104:1-49285172,c7a444f8-3b30-11e9-8336-6c92bfb2329c:1-49';
    start group_replication;

    对于db47,先执行了reset master,之前的日志清空,位置重新开始;之前db48是落后于db47的,
    如果让db47从db48的位置开始同步,那么必定存在大量的重复数据,分以下情况讨论
    1. insert,由于是日志方式的同步,insert同样的数据时,会提示主键同步,报1062错误;可在配置文件中跳过1062错误即可
    2. update,相同的数据会再次update一遍,比如set a=1,a字段本来就是1,再执行一次set a = 1,a字段的值还是1,数据正确性没有被破坏
    3. delete,删除一条数据,由于是重复执行,此时会报1032错误-删除的数据不存在,跳过后不影响数据的正确性

    测试期间并未做DDL操作,预期db47 online恢复后,数据应该与db46一致,等待恢复结束后再验证
    同步完成之后,写库db46测试表| 61272706 行记录,db47也是| 61272706 |条记录,行数与预期的一致
    但行数一致并不真正代表每一行的数据是一致的,此时时间有限,不细验证这个了,
    简单地认为追加了13个小时延迟数据,有insert也有delete,数据量在6千万的情况下还能保持行数一致,数据就是一致的

    至此,这套库数据恢复;再提一下最开始日志的信息中有写
    waited when Workers occupied = 22075589000
    Workers被占用而导致的等待,解决方法为调大slave_pending_jobs_size_max参数

    mysql> show variables like 'slave_pending_jobs_size_max';
    +-----------------------------+----------+
    | Variable_name               | Value    |
    +-----------------------------+----------+
    | slave_pending_jobs_size_max | 16777216 |
    +-----------------------------+----------+
    1 row in set (0.00 sec)

    服务器配置比较高,所以我设置了一个比较大的值
    set global slave_pending_jobs_size_max=64424509440;
    这是修改运行状态中的参数设置,重启实例就失效了,在配置文件中添加
    slave_pending_jobs_size_max=64424509440

    下面进行另外一个测试,把db48 从MGR集群中去掉,变成普通的从库挂载到db46进行压测
    对比普通从库与MGR读节点的性能差异

    MGR读节点db48转普通主从
    stop group_replication;
    CHANGE MASTER TO MASTER_HOST='db46',MASTER_PORT=3301,MASTER_USER='rpl_user',MASTER_PASSWORD='11111111',MASTER_AUTO_POSITION=1;

    start slave;
    show slave statusG;


    结论:
    现象1:MGR在压力大的情况下,两个读节点全挂了,写节点正常
    结论1:MGR读节点不如写节点能抗压力,MGR集群在压力大的情况下读节点挂掉的概率高于写节点
    现象2:同样的硬件环境下,MGR一写一读、一从,读的性能不如从高
    结论2:MGR在高压力环境读节点比普通从库性能有损耗,损耗多少可能与硬件配置、压力的大小有关,要具体测试验证
    措施:
    1. MGR集群下再挂一些普通从库,作为线上运行数据库的灾备库使用;
    2. MGR读节点的硬件配置、性能需求可以比写读高一些
    3. 在业务允许延迟的系统中,MGR一写一读+两从库,两从对外提供服务,MGR的读不对外使用,作为灾备库,故障时自动切换; 不差机器可以采用 3(MGR)+2(从)的方式。

    4. 侧重于高可用,可以采用3(mgr)+ 1(从),从作为灾备使用,三个MGR节点负责高可用

  • 相关阅读:
    基础
    基础
    基础
    基础
    基础
    基础
    基础
    基础
    Gym102361A Angle Beats(直角三角形 计算几何)题解
    Petrozavodsk Summer Training Camp 2016H(多标记线段树)题解
  • 原文地址:https://www.cnblogs.com/perfei/p/10458347.html
Copyright © 2011-2022 走看看