zoukankan      html  css  js  c++  java
  • Raft

    Typical Properties of Consensus Algorithms

      1. safety: produce correct results under all non-Byzantine conditions
                   include network delays/partitions, packet loss/duplication/reordering
      2. availability/liveness:
         if majority of peers are operational and can communicate with each other and with clients assume failures are all fail stopping
      3. safety should not depend on timing
        do not require timing to ensure consistency
        timing: faulty clocks or message delays
      4. performance can not be affected by a minority of slow servers


    Strengths and Weaknesses of Paxos
    Strengths
      1. It ensures safety and liveness
      2. Its correctness is formally proved
      3. It is efficient in the normal case (?)

    Weaknesses
      1. Difficult to understand
        thus difficult to implement / reason about / debug / optimize ...
      2. Does not provide a good foundation for building practical implementations
        no widely agreed-upon algo. for multi-paxos
        single-decree paxos -> multi-paxos is complex
        leader-based protocols is simpler and may be faster for making a series of decisions

    Brief Introduction to Raft
      A consensus protocol with strong leader
      Why leader?
        simplifies the management of log replication
        makes raft easier to understand
      Raft log
        orders commands
        stores tentative commands until committed
        stores commands in case leader must re-send to followers
        acts as history of commands, can be replayed after server reboot


    Leader Election
    State of a server:
      follower, candidate, leader

    State transision
     

    at most one leader can be elected in a given term

    Timing requirement for steady leaders
      broadcast time << election timeout << MTBF
      broadcast time: 10s of ms, may depend upon log persistence performance
      election timeout: 100s of ms (or 1s, 3s ...)
      MTBF: months

    How to choose election timeout
      * at least a few (10s of) heartbeat intervals (in case network drops a heartbeat)
        to avoid needless elections, which waste time and introduce availability issues
      * random part long enough to let one candidate succeed before next starts
      * short enough to react quickly to failure, avoid long pauses (availability issues)


    Log Replication

    Leader Append-only: leader never deletes or overwrites its log entries; it only append new entries
    Log Matching: if two logs contain an entry with same index and same term, all log entries up through the given index are identical
    Leader Completeness: : if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.


    Leader Election Resitriction

    1. new leader must contain all log entries committed

    example 1
      index: 1 2 3
          S1: 1 2 2 (leader)
          S2: 1 2 2
          S3: 1 1
      S3 is leader of term 1, is partitioned away from S1 and S2
      S1 broadcasts log entry with index 3, consider the following 3 cases:
        #1: S2 receives the log entry but the reply to S1 is lost, S1 crashes
        #2: S2 receives the log entry and replies, S1 commits the log entry and crashes
        #3: S2 receives the log entry and replies, S1 and S2 commit the log entry, S1 crashes
      After the partition heals, only S2 can be the leader.
      If S3 becomes leader, it is possible that commited log entries will be overwritten
        violates the Leader Completeness property

    example 2
      index: 1 2 3
          S1: 1
         S2: 1                            ->
         S3: 1 1 (leader)

      index: 1 2 3
        S1: 1 2
          S2: 1 2 2 (leader)          ->
              S3: 1 1

      index: 1 2 3
          S1: 1 2 3 (leader)
         S2: 1 2 2
         S3: 1 1

      S1 crashes and restarts, S1 and S2 can become leader of term 4, S3 can not.


    2. new leader's log must be at least as up-to-date as majority of servers
      up-to-date
        compare last entries. log has last entry with higher term is more up-to-date
        if term of last entries are same. log which is longer is more up-to-date


    Q: why not elect the server with the longest log as new leader?
    A:
      example:
        S1: 5 6 7
        S2: 5 8
        S3: 5 8
      first, could this scenario happen? how?
        S1 leader in term 6; crash+reboot; leader in term 7; crash and stay down
          both times it crashed after only appending to its own log
        Q: after S1 crashes in term 7, why won't S2/S3 choose 6 as next term?
        A: at least one of them votes for S1, and becomes follower in term 7
        next term will be 8, since at least one of S2/S3 learned of 7 while voting
        S2 leader in term 8, only S2+S3 alive, then crash
      all peers reboot
      who should be next leader?
        S1 has longest log, but entry 8 could have committed !!!
        so new leader can only be one of S2 or S3
        i.e. the rule cannot be simply "longest log"


    Raft Persistence
      why log?
        if a server was in leader's majority for committing an entry,
        must remember entry despite reboot, so any future leader is
        guaranteed to see the committed log entry
      why votedFor?
        to prevent a client from voting for one candidate, then reboot,
        then vote for a different candidate in the same (or older!) term
        could lead to two leaders for the same term
      why currentTerm?
        to ensure terms only increase, so each term has at most one leader
        to detect RPCs from stale leaders and candidates

    Q: What if we do not persist currentTerm and initialize currTerm of a server to term of its last log entry when it starts, give an example to show what could go wrong ?

    A: Assume there is a raft group of three servers S1, S2 and S3. S1's last log entry has term 10. S1 receives a VoteRequest for term 11 from S2, and answers "yes". Then S1 crashes, and restarts. S1 initializes currentTerm from the term in its last log entry, which is 10. Now S1 receives a VoteRequest from S3 for term 11. S1 will vote for S3 for term 11 even though it previously voted for S2. There are two leaders S2 and S3 in term 11, which would lead to different peers committing different commands at the same index.


    Q: What happens after all servers crash and restart at about the same time ?
    A: An example:
      S1, S2, S3
      1. S1, S2, S3 all start as follower in currentTerm of their own, say 3, 2, 1. they voteFor S1, S1, S3 respectively
      2. election timer elapses. one of them become candidate, say S2 becomes candidate of 2 + 1 = 3
      3. S2 request votes from S1 and S3, S1 rejects to vote because it has voted for itself in term 3, S3 grants the vote and becomes follower of term 3
      4. S2 becomes leader of term 3 and starts to broadcast heartbeats to S1 and S3

      We can induce the states of S1, S2, S3 before the crash
        S1 is candidate of term 3 and leader of term 2
        S2 is follower of term 2
        S3 is leader of term 1 and then is partitioned away from S1 and S2


    A special case needs to take care (Figure 8 in the Raft paper)

    index: 1 2 3
        S1: 1 2
        S2: 1                                          ->
        S3: 1 3 (leader of term 3)


    index: 1 2 3
        S1: 1 2 4 (leader of term 4)
        S2: 1 2
        S3: 1 3

    Although log entry at 2 is successflly replicated from S1 to S2, we can not simply commit the index. what if we do so, and S1 crashes, S3 will be leader of term 5, and will commit an entry with term 3 at index 2.

    Commtting different entries at same index violates safty guarantee of Raft !!!

    A leader should not commit log entries from older terms and it should only commits log entries of its own term. After S1 commits log entry at index 3 with term 4, log entry at index 2 is commited indirectly because of the Log Matching property.


    Protocol for Cluster Membership Changes
    [TBD]


    TiKV Case Study

    Solutions proposed in TiKV blog for fast linearizable reads over Raft (inspired by approaches mentioned in the Raft paper)

    ReadIndex Read
      when a read request comes in, the leader performs the following steps to serve the read:
        1. ReadIndex = commitIndex
        2. Ensure the leader is still leader by broadcasting heartbeats to peers
        3. Wait apply of the log entry at ReadIndex
        4. Handle the read request and reply to the client

    Lease Read
      Lease:
        Assume that start is the time when the leader starts to broadcast AppendEntries to followers
        after the leader get positive responses from majority, the leader holds a valid
        lease which lasts till start + minimum_election_timeout / clock_drift_bound.

      When the leader holds a vaild lease, it can perform read requests without writing anythong into the log

      clock skew has no effect on the correctness of this approach
      but clock that may jump back and forth does

      See https://pingcap.com/blog/lease-read/ for more details


    More on Replicated State Machine Protocols
      Paxos / Viewstamped Replcation / SMART / ZAB / Raft ...
      Single-decree paxos v.s. multi-decree paxos
      [TBD]  

      Master-based Paxos has similiar advantages as Raft
        1. The master is up-to-date and can serve reads of the current consensus state w/o network communication
        2. Writes can be reduced to a single round of communication by piggybacking prepares to accept messages
        3. The master can batch writes to improve throughput

      Downside of master-based/leader-based prototol
        1. Reading/writing must be done near the master to avoid accumulating latency from sequential requests.
        2. Any potential master must have adequate resources for the system’s full workload slave replicas waste resources until the moment they become master.
        3. Master failover can require a complicated state machine, and a series of timers must elapse before service is restored. It is difficult to avoid user-visible outages.

      What will go wrong if Zookeeper uses Raft instead of ZAB as its consensus protocol ?
        Zookeeper client submits requests in FIFO order and these requests must be committed in the same order. It can submit next request before preious request gets acknowledged.
        [TBD]


    References
      The Raft Paper: https://raft.github.io/raft.pdf
      A guide for implementing Raft: https://thesquareplanet.com/blog/students-guide-to-raft/
      TiKV Blog: https://pingcap.com/blog/lease-read/
      The Megastore Paper: http://cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf

  • 相关阅读:
    20210304
    20210303
    20210302
    20210210
    20210209
    20210208
    20210207
    例4-6
    例4-5
    例4-4
  • 原文地址:https://www.cnblogs.com/william-cheung/p/12682702.html
Copyright © 2011-2022 走看看