Replication Lag 的mysql 如何分析

zoukankan html css js c++ java

Replication Lag 的mysql 如何分析
转https://blogs.oracle.com/jsmyth/what-causes-replication-lag

What Causes Replication Lag?

Mark Lewin
MYSQL CURRICULUM DEVELOPER
Replication lag occurs when the slaves (or secondaries) cannot keep up with the updates occuring on the master (or primary). Unapplied changes accumulate in the slaves' relay logs and the version of the database on the slaves becomes increasingly different from that of the master.

To work out what's causing the lag, you must determine which replication thread is getting backed up. Replication relies on three threads per master/slave connection: one is created on the master and two are created on the slave.

The Slave I/O Thread. When you issue START SLAVE on a slave server, the slave creates this thread which connects to the master and requests a copy of the master's binary log.

The Binlog Dump Thread. When the slave connects to the master, the master uses this thread to send the slave the contents of its binary log.

The Slave SQL Thread. The slaves creates this SQL (or applier) thread to read the contents of the retrieved binary log and apply its contents.

Replication lag is caused when either the I/O Thread or SQL Thread cannot cope with the demands placed upon it.

If the I/O Thread is suffering, this means that the network connection between the master and its slaves is slow. You might want to consider enabling the slave_compressed_protocol to compress network traffic or speaking to your network administrator.

If it's the SQL thread then your problem is probably due to poorly-optimized queries that are taking the slave too long to apply. There may be long-running transactions or too much I/O activity. Having no primary key on the slave tables when using the ROW or MIXED replication format is also a common cause of lag on this thread: check that both your master and slave versions of tables have a primary key.

But how do you know which thread is affected? This depends on whether you are using binary log file names and positions or GTIDs.

If you are using binary log file names and positions, then executing SHOW MASTER STATUS and SHOW SLAVE STATUS gives you all you need to work out which thread is causing the problems:

master> SHOW MASTER STATUS; +---------------+-----------+ | File | Position |... +---------------+-----------+ | binlog.000002 | 121752008 |... +---------------+-----------+ slave> SHOW SLAVE STATUSG ********************* 1. row ********************* Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: replication Master_Port: 22808 Connect_Retry: 60 Master_Log_File: binlog.000002 Read_Master_Log_Pos: 121409852 Relay_Log_File: relaylog.000002 Relay_Log_Pos: 119819329 Relay_Master_Log_File: binlog.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Exec_Master_Log_Pos: 120003004 Relay_Log_Space: 121226377

You can calculate any I/O thread lag by subtracting the value of Read_Master_Log_Pos from Position.

Bytes behind master = Position - Read_Master_Log_Pos = 121752008 - 121409852 = 342156 bytes

Note that this assumes that the binary log file name is the same on both the slave and the master, otherwise the calculation becomes more complex because you must take into account the size of the binary log files.

The following calculation tells you if the slave SQL thread is lagging behind the replication master:

Bytes behind master = Position - Exec_Master_Log_Pos = 121752008 - 120003004 = 1749004 bytes

Or, if the slave's SQL thread is lagging behind its I/O thread, then this will be visible in the results of the following calculation:

Bytes behind master = Read_Master_Log_Pos - Exec_Master_Log_Pos = 121409852 - 120003004 = 1406848 bytes

The Seconds_Behind_Master column in the output of SHOW SLAVE STATUS also gives you a rough idea of the extent of the delay. Don't rely on it too much though, because it doesn't take into account long-running transactions or any operations that it performs on the master that is not required on the slave. Its value is best monitored over a period.

If your replication setup uses GTIDs, then you'll use the following information to diagnose replication lag:

Executed GTIDs. Shows which transactions have been executed on the instance. Available on both master and slave. You can also get the GTID of the latest transaction executed from the global gtid_executed status variable.

Retrieved GTIDs. Shows which transactions the I/O thread has received. Available only on the slave.

Purged GTIDs. Shows which transactions have been purged from the binary log after completion. You're only interested in the transactions purged by the slave in this context.

Once you have the GTIDs you can use the GTID_SUBTRACT() function to calculate the difference in the slave's view of the world from that of the master. For example, the following query on the slave shows the GTIDs read from the binary log that have not yet been applied (SQL thread lag):

slave> SELECT GTID_SUBTRACT('96985d6f-2ebc-11e7-84df-08002715584a:5-133', '96985d6f-2ebc-11e7-84df-08002715584a:26-132') AS MissingGTIDs; +-----------------------------------------------+ | MissingGTIDs | +-----------------------------------------------+ | 96985d6f-2ebc-11e7-84df-08002715584a:5-25:133 | +-----------------------------------------------+ 1 row in set (0.00 sec)

###从官方文档来看：
是因为一个大的查询事务在slave 库上跑，引起的其他事务的阻塞，因为会在seconds behind master 上出现大规避的滞后现象
同时，他也解释了 mk-heartbeat 这个工具更为有效的。
from https://serverfault.com/questions/287164/erratic-seconds-behind-master

It's not a reporting error, that's just how MySQL replication works. It transfers over the query log, then runs the queries. If one of the queries take a long time to run, then all the other queries clog up until that one completes. That's why you see the spikes. The fact that mk-heartbeat shows a low average just means that it's not a general overload problem, just a few big queries (or, less likely, an occasional monster load spike on the slave).

Instantaneous "seconds behind master" is a fairly useless figure (except when you want to know Right Now how far behind you are). The mk-heartbeat stats are far better for getting an idea of how overloaded your replication is -- anything higher than about 2-3 seconds on average over a dozen or so pings and you're pooched.
查看全文

相关阅读:
【集训第二天·翻水的老师】--ac自动机+splay树
 【集训第一天·特来卖萌】树链剖分之水水的例题
 1.ARM寄存器简解
 2.汇编代码的简单实现
 Android之Activity 生命周期
 Android之SQLite
魅族 -- 禁止D及以下级别LOG的输出
 常用的SQL语句
 Android开发工具
 Android介绍

原文地址：https://www.cnblogs.com/feiyun8616/p/12163938.html

Replication Lag 的mysql 如何分析

转https://blogs.oracle.com/jsmyth/what-causes-replication-lag

What Causes Replication Lag?