zoukankan      html  css  js  c++  java
  • Replication Lag 的mysql 如何分析

    https://blogs.oracle.com/jsmyth/what-causes-replication-lag

    What Causes Replication Lag?

     
    Mark Lewin
    MYSQL CURRICULUM DEVELOPER

    Replication lag occurs when the slaves (or secondaries) cannot keep up with the updates occuring on the master (or primary). Unapplied changes accumulate in the slaves' relay logs and the version of the database on the slaves becomes increasingly different from that of the master.

     

    To work out what's causing the lag, you must determine which replication thread is getting backed up. Replication relies on three threads per master/slave connection: one is created on the master and two are created on the slave.

    • The Slave I/O Thread. When you issue START SLAVE on a slave server, the slave creates this thread which connects to the master and requests a copy of the master's binary log.
    • The Binlog Dump Thread. When the slave connects to the master, the master uses this thread to send the slave the contents of its binary log.
    • The Slave SQL Thread. The slaves creates this SQL (or applier) thread to read the contents of the retrieved binary log and apply its contents.

    Replication lag is caused when either the I/O Thread or SQL Thread cannot cope with the demands placed upon it.

    If the I/O Thread is suffering, this means that the network connection between the master and its slaves is slow. You might want to consider enabling the slave_compressed_protocol to compress network traffic or speaking to your network administrator.

    If it's the SQL thread then your problem is probably due to poorly-optimized queries that are taking the slave too long to apply. There may be long-running transactions or too much I/O activity. Having no primary key on the slave tables when using the ROW or MIXED replication format is also a common cause of lag on this thread: check that both your master and slave versions of tables have a primary key.

    But how do you know which thread is affected? This depends on whether you are using binary log file names and positions or GTIDs.

    If you are using binary log file names and positions, then executing SHOW MASTER STATUS and SHOW SLAVE STATUS gives you all you need to work out which thread is causing the problems:

    master> SHOW MASTER STATUS;
    +---------------+-----------+
    | File          | Position  |...
    +---------------+-----------+
    | binlog.000002 | 121752008 |...
    +---------------+-----------+
    
    slave> SHOW SLAVE STATUSG
    ********************* 1. row *********************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 127.0.0.1
                Master_User: replication
                Master_Port: 22808
              Connect_Retry: 60
            Master_Log_File: binlog.000002
        Read_Master_Log_Pos: 121409852
             Relay_Log_File: relaylog.000002
              Relay_Log_Pos: 119819329
      Relay_Master_Log_File: binlog.000002
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
    ...
        Exec_Master_Log_Pos: 120003004
            Relay_Log_Space: 121226377
    

    You can calculate any I/O thread lag by subtracting the value of Read_Master_Log_Pos from Position.

    Bytes behind master = Position - Read_Master_Log_Pos
    = 121752008 - 121409852
    = 342156 bytes
    

    Note that this assumes that the binary log file name is the same on both the slave and the master, otherwise the calculation becomes more complex because you must take into account the size of the binary log files.

    The following calculation tells you if the slave SQL thread is lagging behind the replication master:

    Bytes behind master = Position - Exec_Master_Log_Pos
    = 121752008 - 120003004
    = 1749004 bytes
    

    Or, if the slave's SQL thread is lagging behind its I/O thread, then this will be visible in the results of the following calculation:

    Bytes behind master = Read_Master_Log_Pos - Exec_Master_Log_Pos
    = 121409852 - 120003004
    = 1406848 bytes
    

    The Seconds_Behind_Master column in the output of SHOW SLAVE STATUS also gives you a rough idea of the extent of the delay. Don't rely on it too much though, because it doesn't take into account long-running transactions or any operations that it performs on the master that is not required on the slave. Its value is best monitored over a period.

    If your replication setup uses GTIDs, then you'll use the following information to diagnose replication lag:

    • Executed GTIDs. Shows which transactions have been executed on the instance. Available on both master and slave. You can also get the GTID of the latest transaction executed from the global gtid_executed status variable.
    • Retrieved GTIDs. Shows which transactions the I/O thread has received. Available only on the slave.
    • Purged GTIDs. Shows which transactions have been purged from the binary log after completion. You're only interested in the transactions purged by the slave in this context.

    Once you have the GTIDs you can use the GTID_SUBTRACT() function to calculate the difference in the slave's view of the world from that of the master. For example, the following query on the slave shows the GTIDs read from the binary log that have not yet been applied (SQL thread lag):

    slave> SELECT GTID_SUBTRACT('96985d6f-2ebc-11e7-84df-08002715584a:5-133', 
            '96985d6f-2ebc-11e7-84df-08002715584a:26-132') AS MissingGTIDs;
    +-----------------------------------------------+
    | MissingGTIDs                                  |
    +-----------------------------------------------+
    | 96985d6f-2ebc-11e7-84df-08002715584a:5-25:133 |
    +-----------------------------------------------+
    1 row in set (0.00 sec)


    ###从官方文档来看:
    是因为一个大的查询事务在slave 库上跑,引起的其他事务的阻塞,因为会在seconds behind master 上出现大规避的滞后现象
    同时,他也解释了 mk-heartbeat 这个工具更为有效的。
    from https://serverfault.com/questions/287164/erratic-seconds-behind-master

    It's not a reporting error, that's just how MySQL replication works. It transfers over the query log, then runs the queries. If one of the queries take a long time to run, then all the other queries clog up until that one completes. That's why you see the spikes. The fact that mk-heartbeat shows a low average just means that it's not a general overload problem, just a few big queries (or, less likely, an occasional monster load spike on the slave).

    Instantaneous "seconds behind master" is a fairly useless figure (except when you want to know Right Now how far behind you are). The mk-heartbeat stats are far better for getting an idea of how overloaded your replication is -- anything higher than about 2-3 seconds on average over a dozen or so pings and you're pooched.

  • 相关阅读:
    android 休眠唤醒机制分析(三) — suspend
    android 休眠唤醒机制分析(一) — wake_lock
    开机音乐不发声的问题
    Linux的时钟管理
    Android4.2增加新键值
    _IO, _IOR, _IOW, _IOWR 宏的用法与解析
    Mifare 0简介
    Mifare 1卡的存储结构
    Maven 介绍
    DAL 层引用 System.Net.Http ,引发的一阵心慌
  • 原文地址:https://www.cnblogs.com/feiyun8616/p/12163938.html
Copyright © 2011-2022 走看看