  • Redis集群的主从切换研究


    1. 前言






    vars currentEpoch 27 lastVoteEpoch 27

    2. slave发起选举


    500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds


    1) 它的masterfail状态(非pfail状态);

    2) 它的master至少负责了一个slot

    3) slavemaster的复制连接断开时间不超过给定的值(值可配置,目的是确保slave上的数据足够完整,所以运维时不能任由一个slave长时间不可用,需要通过监控将异常的slave及时恢复)。




    12961:S 06 Jan 2019 19:00:21.969 # Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option.


    /* This function is called if we are a slave node and our master serving

     * a non-zero amount of hash slots is in FAIL state.


     * The gaol of this function is:

     * 1) To check if we are able to perform a failover, is our data updated?

     * 2) Try to get elected by masters.

     * 3) Perform the failover informing all the other nodes.


    void clusterHandleSlaveFailover(void) {

         mstime_t data_age; // 与master断开的时长,单位毫秒

         mstime_t auth_age = mstime() - server.cluster->failover_auth_time;

         int needed_quorum = (server.cluster->size / 2) + 1;

         int manual_failover = server.cluster->mf_end != 0 && server.cluster->mf_can_start;

         auth_timeout = server.cluster_node_timeout*2;

         if (auth_timeout < 2000) auth_timeout = 2000;

         auth_retry_time = auth_timeout*2;


         /* Set data_age to the number of seconds we are disconnected from

         * the master. */

        if (server.repl_state == REPL_STATE_CONNECTED) {

            data_age = (mstime_t)(server.unixtime - server.master->lastinteraction) * 1000;

        } else {

            data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;


        /* Remove the node timeout from the data age as it is fine that we are

         * disconnected from our master at least for the time it was down to be

         * flagged as FAIL, that's the baseline. */

        if (data_age > server.cluster_node_timeout)

            data_age -= server.cluster_node_timeout;

        /* Check if our data is recent enough according to the slave validity

         * factor configured by the user.


         * Check bypassed for manual failovers. */

        if (server.cluster_slave_validity_factor &&

            data_age >

            (((mstime_t)server.repl_ping_slave_period * 1000) +

             (server.cluster_node_timeout * server.cluster_slave_validity_factor)))


            // slave不可用时间过长,导致不能自动切换为master

            if (!manual_failover) { // 人工切换除外






        /* Ask for votes if needed. */

        // failover_auth_sent标记是否已发送过投票消息

        if (server.cluster->failover_auth_sent == 0) {


            server.cluster->failover_auth_epoch = server.cluster->currentEpoch;

            serverLog(LL_WARNING,"Starting a failover election for epoch %llu.",

                (unsigned long long) server.cluster->currentEpoch);

            // 给所有节点(包括slaves)发送投票消息FAILOVE_AUTH_REQUEST(请求投票成为master消息),但注意只有master响应该消息


            server.cluster->failover_auth_sent = 1;




            return; /* Wait for replies. */


        /* Check if we reached the quorum. */

        if (server.cluster->failover_auth_count >= needed_quorum) {

            /* We have the quorum, we can finally failover the master. */


                "Failover election won: I'm the new master.");

            /* Update my configEpoch to the epoch of the election. */

            if (myself->configEpoch < server.cluster->failover_auth_epoch) {

                myself->configEpoch = server.cluster->failover_auth_epoch;


                    "configEpoch set to %llu after successful failover",

                    (unsigned long long) myself->configEpoch);


            /* Take responsibility for the cluster slots. */


        } else {















    3. master响应选举


    1) 对一个epoch,只投票一次;

    2) 会拒绝所有更小epoch的投票请求;

    3) 不会给小于lastVoteEpochepoch投票;

    4) master只给master状态为failslave投票;

    5) 如果slave请求的currentEpoch小于mastercurrentEpoch,则master忽略该请求,但下列情况例外:

    ① 假设master的currentEpoch值为5,lastVoteEpoch值为1(当有选举失败会出现这个情况,亦即currentEpoch值增加了,但因为选举失败,lastVoteEpoch值未变);

    ② slave的currentEpoch值为3;

    ③ slave增一,使用值为4的epoch发起选举,这个时候master会响应epoch值为5,不巧这个响应延迟了;

    ④ slave重新发起选举,这个时候选举用的epoch值为5(每次发起选举epoch值均需增一),凑巧这个时候原来延迟的响应达到了,这个时候原来延迟的响应被slave认为有效。


    4. 选举示例


    1) 假设slave A赢得选举成为master

    2) slave A因为网络分区不再可用;

    3) slave B赢得选举;

    4) slave B因为网络分区不再可用;

    5) 网络分区修复,slave A又可用。

    B挂了,A又可用。同一时刻,slave C发起选举,试图替代B成为master。由于slave Cmaster已不可用,所以它能够选举成为master,并将configEpoch值增一。而A将不能成为master,因为C已成为master,并且Cepoch值更大。

    5. 哈希槽传播方式

    有两种哈希槽(hash slot)传播途径:

    1) 心跳消息(Heartbeat messages)。节点在发送pingpong消息时,总是携带了它所负责(或它的master所负责)的哈希槽信息;

    2) 更新消息(UPDATE messages)。由于心跳包还包含了epoch信息,当消息接收者发现心跳包携带的信息陈旧时,会响应更新的信息,这样强迫发送者更新哈希槽。

    6. 一次主从切换记录1


    6.1. 相关参数





    6.2. 时间点记录


    master A标记fail时间:20:12:55.467

    master B标记fail时间:20:12:55.467

    master A投票时间:20:12:56.164

    master B投票时间:20:12:56.164








    6.3. 其它master日志

    master IDc67dc9e02e25f2e6321df8ac2eb4d99789917783

    30613:M 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 从其它master收到44eb43e50c101c5f44f48295c42dda878b6cb3e9已fail消息

    30613:M 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

    30613:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 对选举投票

    30613:M 04 Jan 2019 20:12:56.204 # Cluster state changed: ok

    30613:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    6.4. 其它master日志

    master IDbfad383775421b1090eaa7e0b2dcfb3b38455079

    30614:M 04 Jan 2019 20:12:55.467 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). // 标记44eb43e50c101c5f44f48295c42dda878b6cb3e9为已fail

    30614:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 对选举投票

    30614:M 04 Jan 2019 20:12:56.709 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    6.5. slave日志

    slavemaster ID44eb43e50c101c5f44f48295c42dda878b6cb3e9slave自己的ID0ae8b5400d566907a3d8b425d983ac3b7cbd8412

    30651:S 04 Jan 2019 20:12:32.810 # MASTER timeout: no data nor PING received... // 发现master超时,master异常10秒后发现,原因是repl-timeout的值为10

    30651:S 04 Jan 2019 20:12:32.810 # Connection with master lost.

    30651:S 04 Jan 2019 20:12:32.810 * Caching the disconnected master state.

    30651:S 04 Jan 2019 20:12:32.810 * Connecting to MASTER

    30651:S 04 Jan 2019 20:12:32.810 * MASTER <-> REPLICA sync started

    30651:S 04 Jan 2019 20:12:32.810 * Non blocking connect for SYNC fired the event.

    30651:S 04 Jan 2019 20:12:43.834 # Timeout connecting to the MASTER...

    30651:S 04 Jan 2019 20:12:43.834 * Connecting to MASTER

    30651:S 04 Jan 2019 20:12:43.834 * MASTER <-> REPLICA sync started

    30651:S 04 Jan 2019 20:12:43.834 * Non blocking connect for SYNC fired the event.

    30651:S 04 Jan 2019 20:12:54.856 # Timeout connecting to the MASTER...

    30651:S 04 Jan 2019 20:12:54.856 * Connecting to MASTER

    30651:S 04 Jan 2019 20:12:54.856 * MASTER <-> REPLICA sync started

    30651:S 04 Jan 2019 20:12:54.856 * Non blocking connect for SYNC fired the event.

    30651:S 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 从其它master收到自己的master的FAIL消息

    30651:S 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

    30651:S 04 Jan 2019 20:12:55.558 # Start of election delayed for 579 milliseconds (rank #0, offset 227360). // 准备发起选举,延迟579毫秒,其中500毫秒为固定延迟,279秒为随机延迟,因为RANK值为0,所以RANK延迟为0毫秒

    30651:S 04 Jan 2019 20:12:56.160 # Starting a failover election for epoch 30. // 发起选举

    30651:S 04 Jan 2019 20:12:56.180 # Failover election won: I'm the new master. // 赢得选举

    30651:S 04 Jan 2019 20:12:56.180 # configEpoch set to 30 after successful failover

    30651:M 04 Jan 2019 20:12:56.180 # Setting secondary replication ID to 154a9c2319403d610808477dcda3d4bede0f374c, valid up to offset: 227361. New replication ID is 927fb64a420236ee46d39389611ab2d8f6530b6a

    30651:M 04 Jan 2019 20:12:56.181 * Discarding previously cached master state.

    30651:M 04 Jan 2019 20:12:56.181 # Cluster state changed: ok

    30651:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 忽略来自非集群成员1.9.16.9:4077的消息

    7. 一次主从切换记录2


    7.1. 相关参数





    7.2. 时间点记录


    master A标记fail时间:20:37:10.398

    master B标记fail时间:20:37:10.398

    master A投票时间:20:37:11.084

    master B投票时间:20:37:11.085








    7.3. 其它master日志

    master IDc67dc9e02e25f2e6321df8ac2eb4d99789917783

    30613:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

    30613:M 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

    30613:M 04 Jan 2019 20:37:11.084 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

    30613:M 04 Jan 2019 20:37:11.124 # Cluster state changed: ok

    30613:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    7.4. 其它master日志

    master IDbfad383775421b1090eaa7e0b2dcfb3b38455079

    30614:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

    30614:M 04 Jan 2019 20:37:11.085 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

    30614:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    7.5. slave日志

    slavemaster ID44eb43e50c101c5f44f48295c42dda878b6cb3e9slave自己的ID0ae8b5400d566907a3d8b425d983ac3b7cbd8412

    30651:S 04 Jan 2019 20:37:10.398 * FAIL message received from c67dc9e02e25f2e6321df8ac2eb4d99789917783 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    30651:S 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

    30651:S 04 Jan 2019 20:37:10.475 # Start of election delayed for 539 milliseconds (rank #0, offset 228620).

    30651:S 04 Jan 2019 20:37:11.077 # Starting a failover election for epoch 32.

    30651:S 04 Jan 2019 20:37:11.100 # Failover election won: I'm the new master.

    30651:S 04 Jan 2019 20:37:11.100 # configEpoch set to 32 after successful failover

    30651:M 04 Jan 2019 20:37:11.100 # Setting secondary replication ID to 0cf19d01597610c7933b7ed67c999a631655eafc, valid up to offset: 228621. New replication ID is 53daa7fa265d982aebd3c18c07ed5f178fc3f70b

    30651:M 04 Jan 2019 20:37:11.101 # Connection with master lost.

    30651:M 04 Jan 2019 20:37:11.101 * Caching the disconnected master state.

    30651:M 04 Jan 2019 20:37:11.101 * Discarding previously cached master state.

    30651:M 04 Jan 2019 20:37:11.101 # Cluster state changed: ok

    30651:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    8. slave延迟发起选举代码

    // 摘自Redis-5.0.3

    // cluster.c

    /* This function is called if we are a slave node and our master serving

     * a non-zero amount of hash slots is in FAIL state.


     * The gaol of this function is:

     * 1) To check if we are able to perform a failover, is our data updated?

     * 2) Try to get elected by masters.

     * 3) Perform the failover informing all the other nodes.


    void clusterHandleSlaveFailover(void) {


        /* Check if our data is recent enough according to the slave validity

         * factor configured by the user.


         * Check bypassed for manual failovers. */

        if (server.cluster_slave_validity_factor &&

            data_age >

            (((mstime_t)server.repl_ping_slave_period * 1000) +

             (server.cluster_node_timeout * server.cluster_slave_validity_factor)))


            if (!manual_failover) {





        /* If the previous failover attempt timedout and the retry time has

         * elapsed, we can setup a new one. */

        if (auth_age > auth_retry_time) {

            server.cluster->failover_auth_time = mstime() +

                500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */

                random() % 500; /* Random delay between 0 and 500 milliseconds. */

            server.cluster->failover_auth_count = 0;

            server.cluster->failover_auth_sent = 0;

            server.cluster->failover_auth_rank = clusterGetSlaveRank();

            /* We add another delay that is proportional to the slave rank.

             * Specifically 1 second * rank. This way slaves that have a probably

             * less updated replication offset, are penalized. */

            server.cluster->failover_auth_time +=

                server.cluster->failover_auth_rank * 1000;

            /* However if this is a manual failover, no delay is needed. */

            if (server.cluster->mf_end) {

                server.cluster->failover_auth_time = mstime();

                server.cluster->failover_auth_rank = 0;



                "Start of election delayed for %lld milliseconds "

                "(rank #%d, offset %lld).",

                server.cluster->failover_auth_time - mstime(),



            /* Now that we have a scheduled election, broadcast our offset

             * to all the other slaves so that they'll updated their offsets

             * if our offset is better. */






