zoukankan      html  css  js  c++  java
  • 一次线上MySQL主从延迟排查

    今天早上来上班,发现zabbix一直告警主从延迟,mysql slave Seconds_Behind_Master (mysql.slave_status[Seconds_Behind_Master]): 69
    登录MySQL从库查看 slave状态,Seconds_Behind_Master: 10562,确实存在延迟
    mysql> show slave statusG
    *************************** 1. row ***************************
    Slave_IO_State: Waiting for master to send event
    ......
    Master_Log_File: mysql-bin.002582
    Read_Master_Log_Pos: 60231375
    Relay_Log_File: relay-log.000998
    Relay_Log_Pos: 283
    Relay_Master_Log_File: mysql-bin.002580
    Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
    Last_Errno: 0
    Last_Error:
    Skip_Counter: 0
    Exec_Master_Log_Pos: 1070638020
    Relay_Log_Space: 29799422300
    Until_Condition: None
    Until_Log_File:
    Until_Log_Pos: 0
    Master_SSL_Allowed: No
    Master_SSL_Key:
    Seconds_Behind_Master: 10562
    Master_SSL_Verify_Server_Cert: No

    1 row in set (0.00 sec)


    于是登录MySQL主库排查可能出现的原因
    1、可能的原因如下
    (1)主从服务器处于不同的网络之中,由于网络延迟导致;
    (2)主从服务器的硬件配置不同,从服务器的硬件配置(包括内存,CPU,网卡等)远低于主服务器;
    (3)主库上有大量的写入操作,导致从库无法实时重放主库上的binlog;
    (4)主库上存在着大事务操作或者慢SQL,导致从库在应用主库binlog的过程过慢,形成延迟;
    (5)数据库实例的参数配置问题导致,如:从库开启了binlog,或者配置了每次事务都去做刷盘操作;

    经过排查,确定不是1、2、5的原因,查看慢日志文件,也没发现主库慢日志,所以判断主库上可能有大量的写入操作,或者主库上存在着大事务操作

    2、查看主库的binlog文件,看到在1号有一个28G的mysql-bin文件
    # ls -rtlsh
    1.1G -rw-rw---- 1 mysql mysql 1.1G Mar 31 19:06 mysql-bin.002579
    28G -rw-rw---- 1 mysql mysql 28G Apr 1 08:02 mysql-bin.002580
    1.1G -rw-rw---- 1 mysql mysql 1.1G Apr 1 10:22 mysql-bin.002581

    3、结合上述条件,用binlog_rollback分析MySQL binlog,判断是主库都执行了什么大的dml语句

    binlog_rollback简介
    binlog_rollback实现了基于row格式binlog的回滚闪回功能,让误删除或者误更新数据,可以不停机不使用备份而快速回滚误操作。
    binlog_rollback也可以解释binlog(支持非row格式binlog)生成易读的SQL,让查找问题如什么时个某个表的某个字段的值被更新成了1,或者找出某个时间内某个表的所有删除操作等问题变得简单。
    binlog_rollback可以按配置输出各个表的update/insert/delete统计报表, 也会输出大事务与长事务的分析, 应用是否干了坏事一目了然, 也会输出所有DDL。
    binlog_rollback通过解释mysql/mariadb binlog/relaylog实现以下三大功能:
    1)flashback/闪回/回滚, DML回滚到任意时间或者位置。
    生成的文件名为rollback.xxx.sql或者db.tb.rollback.xxx.sql
    生成的SQL形式如下
    ```sql
    begin
    DELETE FROM `danny`.`emp` WHERE `id`=1
    # datetime=2017-10-23_00:14:28 database=danny table=emp binlog=mysql-bin.000012 startpos=417 stoppos=575
    commit
    ```
    2)前滚,把binlog/relaylog的DML解释成易读的SQL语句。
    *支持非row格式的binlog, 默认不解释非row格式的DML, 需要指定参数-stsql
    生成的文件名为forward.xxx.sql或者db.tb.forward.xxx.sql
    生成的SQL形式如下
    ```sql
    begin
    # datetime=2017-10-23_00:14:28 database=danny table=emp binlog=mysql-bin.000012 startpos=417 stoppos=575
    INSERT INTO `danny`.`emp` (`id`,`name`,`sr`,`icon`,`points`,`sa`,`sex`) VALUES (1,'张三1','华南理工大学&SCUT',X'89504e47',1.1,1.1,1)
    commit
    ```
    3)统计分析, 输出各个表的DML统计, 输出大事务与长事务, 输出所有的DDL
    DML统计结果文件:binlog_stats.txt
    大事务与长事务结果文件:big_long_trx.txt
    DDL结果文件:ddl_info.txt

    具体可以产考 https://github.com/GoDannyLai/binlog_rollback

    # /root/dba/mysql_dml/bin/binlog_rollback -m file -w stats -M mysql -i 20 -b 100 -l 10 -o /data/dml_binlog/log /data/mysql/var/mysql-bin.002580

    # cat binlog_stats.txt |awk '{print $8}' |sort -nr|head -10
    417936881
    9718
    7183
    7183
    7165
    7162
    7152
    7150
    7137
    7119


    mysql-bin.002580 2020-03-31_18:33:00 2020-03-31_18:33:00 1070638106 3966168092 0 0 417936881 content testing

    可以看到在 2020-03-31_18:33:00的时候content.testing删除了 417936881条数据,是由于delete造成数据库主从延迟的主要原因。

    最后和相关的开发建讨论群追踪问题,发现是一个研发直接delete  form  testing;删了4亿多条数据的一个大事务,从而造成主从延迟。

    解决办法

    (一)

    由于这组实例不是核心业务,qps和tps都不是很高,零时把读写都切到master上,等主从延迟追平了,再把读切换到从库上。

    (二)

    也可以在主库上备份一个全备,再从库上恢复2个新的slave,恢复完了重新做主从。

     
    为了下次方便,写了一个脚本。
     
    建表语句
    CREATE TABLE `t_dml_count` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `insert_totals` int(11) DEFAULT NULL,
      `update_totals` int(11) DEFAULT NULL,
      `delete_totals` int(11) DEFAULT NULL,
      `db_name` varchar(30) DEFAULT NULL,
      `table_name` varchar(50) DEFAULT NULL,
      `binlog_name` varchar(30) DEFAULT NULL,
      `starttime` varchar(50) DEFAULT NULL,
      `stoptime` varchar(50) DEFAULT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
     
     
    同步主库binlog脚本
    cat binlog_server.sh
    #!/bin/sh
    case "$1" in
        start)
            /usr/local/mysql/bin/mysqlbinlog -h$2 -root -pxxx -R --raw --stop-never $3  --stop-never-slave-server-id=20003306  -r /data/dml_binlog/binlog/  &  
            ;;
        stop)
            ps -ef|grep  mysqlbinlog|grep -v grep|awk '{print $2}'|xargs kill -9
            ;;
            *)
            echo "Usage: ls {start|stop}"
            exit 1
            ;;
    esac
     
    dml启动和停止脚本
    cat mysql_dml_count.sh
    #!/bin/sh
    case "$1" in
        start)
            /root/dba/mysql_dml/mysql_dml_count.py   &>/dev/null &
            ;;
        stop)
            ps -ef|grep  mysql_dml_count.py|grep -v grep|awk '{print $2}'|xargs kill -9
            /bin/rm -rf /data/dml_binlog/log/*
            ;;
            *)
            echo "Usage: ls {start|stop}"
            exit 1
            ;;
    esac
     
    分析和统计dml脚本
    cat mysql_dml_count.py
     
    #!/usr/bin/env python
    # -*- coding:utf-8 -*-
    import sys, os, MySQLdb
     
    dirpath = "/data/dml_binlog/binlog/"
     
    while True:
        file_name = os.listdir(dirpath)
        for file in file_name:
            print(file)
            os.system(
                "/root/dba/mysql_dml/bin/binlog_rollback -m file -w stats -M mysql -i 20 -b 100 -l 10 -o /data/dml_binlog/log   /data/dml_binlog/binlog/%s" % file)
            os.remove("/data/dml_binlog/binlog/%s" % file)
     
            log_info = open("/data/dml_binlog/log/binlog_stats.txt", mode="r")
            for info in log_info.readlines():
                info_array = info.split()
                arg1 = info_array[0]  # binlog_name
                arg2 = info_array[1]  # starttime
                arg3 = info_array[2]  # stoptime
                arg4 = info_array[5]  # insert_totals
                arg5 = info_array[6]  # update_totals
                arg6 = info_array[7]  # delete_totals
                arg7 = info_array[8]  # db_name
                arg8 = info_array[9]  # table_name
     
                # 打开数据库连接
                conn = MySQLdb.connect(host='10.20.30.21', port=3306, user='root', passwd='xxx', db='test')
                # 使用cursor()方法获取操作游标
                cursor = conn.cursor()
                # SQL 插入语句
                sql = "INSERT INTO  t_dml_count(`insert_totals`,`update_totals`,`delete_totals`,`db_name`,`table_name`,`binlog_name`,`starttime`,`stoptime`) values('%s','%s','%s','%s','%s','%s','%s','%s')" % (
                    arg4, arg5, arg6, arg7, arg8, arg1, arg2, arg3)
                try:
                    # 执行sql语句
                    cursor.execute(sql)
                    # 提交到数据库执行
                    conn.commit()
                except Exception as error:
                    print(error)
                    # 发生错误时回滚
                    conn.rollback()
                # 关闭数据库连接
                conn.close()
     
            log_info.close()
     

    查看哪些MySQL数据库有哪些大的dml操作
    mysql> select delete_totals,db_name,table_name,starttime,stoptime from test.t_dml_count order by delete_totals desc limit 10;
    +---------------+--------------------+------------------+---------------------+---------------------+
    | delete_totals | db_name | table_name | starttime | stoptime |
    +---------------+--------------------+------------------+---------------------+---------------------+
    | 417936881 | content | testing | 2020-03-31_18:33:00 | 2020-03-31_18:33:00 |
    | 9718 | content | data_count | 2020-04-01_00:15:01 | 2020-04-01_00:15:01 |
    | 4831 | Log | soLog | 2020-04-01_05:10:40 | 2020-04-01_05:10:59 |
    | 4813 | Log | soLog | 2020-04-01_05:21:20 | 2020-04-01_05:21:39 |
    | 4811 | Log | soLog | 2020-04-01_05:34:00 | 2020-04-01_05:34:19 |
    | 4806 | Log | soLog | 2020-04-01_05:11:40 | 2020-04-01_05:11:59 |
    | 4802 | Log | soLog | 2020-04-01_05:33:40 | 2020-04-01_05:33:59 |
    | 4800 | Log | soLog | 2020-04-01_05:34:40 | 2020-04-01_05:34:59 |
    | 4779 | Log | soLog | 2020-04-01_05:35:20 | 2020-04-01_05:35:39 |
    | 4776 | Log | soLog | 2020-04-01_05:19:40 | 2020-04-01_05:19:59 |
    +---------------+--------------------+------------------+---------------------+---------------------+
    10 rows in set (0.01 sec)

    mysql> select insert_totals,db_name,table_name,starttime,stoptime from test.t_dml_count order by insert_totals desc limit 10; 

    +---------------+--------------------+-----------------------------+---------------------+---------------------+
    | insert_totals | db_name | table_name | starttime | stoptime |
    +---------------+--------------------+-----------------------------+---------------------+---------------------+
    | 9722 | content | data_count | 2020-04-01_00:15:01 | 2020-04-01_00:15:01 |
    | 4000 | Log | LogBackUp | 2020-04-01_05:05:40 | 2020-04-01_05:05:59 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:03:01 | 2020-04-01_05:03:19 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:02:40 | 2020-04-01_05:02:58 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:03:40 | 2020-04-01_05:03:57 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:01:40 | 2020-04-01_05:01:58 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:00:40 | 2020-04-01_05:00:58 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:04:20 | 2020-04-01_05:04:37 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:01:01 | 2020-04-01_05:01:19 |
    | 3500 | Log | LogBackUp | 2020-04-01_05:02:01 | 2020-04-01_05:02:19 |
    +---------------+--------------------+-----------------------------+---------------------+---------------------+
    10 rows in set (0.01 sec)

    mysql> select update_totals ,db_name,table_name,starttime,stoptime from test.t_dml_count order by update_totals desc limit 10;
    +---------------+--------------------+--------------------+---------------------+---------------------+
    | update_totals | db_name | table_name | starttime | stoptime |
    +---------------+--------------------+--------------------+---------------------+---------------------+
    | 2629 | dataanalytics | Code | 2020-04-01_04:10:56 | 2020-04-01_04:11:15 |
    | 2493 | dataanalytics | Code | 2020-04-01_04:10:16 | 2020-04-01_04:10:35 |
    | 2430 | dataanalytics | Code | 2020-04-01_04:11:16 | 2020-04-01_04:11:35 |
    | 2377 | dataanalytics | Code | 2020-04-01_04:10:36 | 2020-04-01_04:10:55 |
    | 1778 | dataanalytics | Code | 2020-04-01_04:10:01 | 2020-04-01_04:10:15 |
    | 959 | dataanalytics | Code | 2020-04-01_04:11:36 | 2020-04-01_04:11:43 |
    | 503 | Log | PointLog | 2020-04-01_08:03:09 | 2020-04-01_08:03:28 |
    | 431 | content | push | 2020-04-01_15:15:13 | 2020-04-01_15:15:13 |
    | 351 | Log | PointLog | 2020-04-01_11:43:09 | 2020-04-01_11:43:28 |
    | 349 | Log | PointLog | 2020-04-01_11:51:09 | 2020-04-01_11:51:28 |
    +---------------+--------------------+--------------------+---------------------+---------------------+
    10 rows in set (0.02 sec)

  • 相关阅读:
    Linux 创建sftp用户并限制目录权限
    idea操作maven时控制台中文显示乱码/maven项目启动方式
    Docker 容器镜像删除
    Golang 在 Mac、Linux、Windows 下如何交叉编译
    Linux后台运行Jar方法
    MAC安装JDK及环境变量配置
    Docker 容器镜像删除
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
    《设计模式之禅》之状态模式
    《设计模式之禅》之访问者模式
  • 原文地址:https://www.cnblogs.com/manger/p/12613515.html
Copyright © 2011-2022 走看看