Typical Properties of Consensus Algorithms
1. safety: produce correct results under all non-Byzantine conditions
include network delays/partitions, packet loss/duplication/reordering
2. availability/liveness:
if majority of peers are operational and can communicate with each other and with clients assume failures are all fail stopping
3. safety should not depend on timing
do not require timing to ensure consistency
timing: faulty clocks or message delays
4. performance can not be affected by a minority of slow servers
Strengths and Weaknesses of Paxos
Strengths
1. It ensures safety and liveness
2. Its correctness is formally proved
3. It is efficient in the normal case (?)
Weaknesses
1. Difficult to understand
thus difficult to implement / reason about / debug / optimize ...
2. Does not provide a good foundation for building practical implementations
no widely agreed-upon algo. for multi-paxos
single-decree paxos -> multi-paxos is complex
leader-based protocols is simpler and may be faster for making a series of decisions
Brief Introduction to Raft
A consensus protocol with strong leader
Why leader?
simplifies the management of log replication
makes raft easier to understand
Raft log
orders commands
stores tentative commands until committed
stores commands in case leader must re-send to followers
acts as history of commands, can be replayed after server reboot
Leader Election
State of a server:
follower, candidate, leader
State transision
at most one leader can be elected in a given term
Timing requirement for steady leaders
broadcast time << election timeout << MTBF
broadcast time: 10s of ms, may depend upon log persistence performance
election timeout: 100s of ms (or 1s, 3s ...)
MTBF: months
How to choose election timeout
* at least a few (10s of) heartbeat intervals (in case network drops a heartbeat)
to avoid needless elections, which waste time and introduce availability issues
* random part long enough to let one candidate succeed before next starts
* short enough to react quickly to failure, avoid long pauses (availability issues)
Log Replication
Leader Append-only: leader never deletes or overwrites its log entries; it only append new entries
Log Matching: if two logs contain an entry with same index and same term, all log entries up through the given index are identical
Leader Completeness: : if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.
Leader Election Resitriction
1. new leader must contain all log entries committed
example 1
index: 1 2 3
S1: 1 2 2 (leader)
S2: 1 2 2
S3: 1 1
S3 is leader of term 1, is partitioned away from S1 and S2
S1 broadcasts log entry with index 3, consider the following 3 cases:
#1: S2 receives the log entry but the reply to S1 is lost, S1 crashes
#2: S2 receives the log entry and replies, S1 commits the log entry and crashes
#3: S2 receives the log entry and replies, S1 and S2 commit the log entry, S1 crashes
After the partition heals, only S2 can be the leader.
If S3 becomes leader, it is possible that commited log entries will be overwritten
violates the Leader Completeness property
example 2
index: 1 2 3
S1: 1
S2: 1 ->
S3: 1 1 (leader)
index: 1 2 3
S1: 1 2
S2: 1 2 2 (leader) ->
S3: 1 1
index: 1 2 3
S1: 1 2 3 (leader)
S2: 1 2 2
S3: 1 1
S1 crashes and restarts, S1 and S2 can become leader of term 4, S3 can not.
2. new leader's log must be at least as up-to-date as majority of servers
up-to-date
compare last entries. log has last entry with higher term is more up-to-date
if term of last entries are same. log which is longer is more up-to-date
Q: why not elect the server with the longest log as new leader?
A:
example:
S1: 5 6 7
S2: 5 8
S3: 5 8
first, could this scenario happen? how?
S1 leader in term 6; crash+reboot; leader in term 7; crash and stay down
both times it crashed after only appending to its own log
Q: after S1 crashes in term 7, why won't S2/S3 choose 6 as next term?
A: at least one of them votes for S1, and becomes follower in term 7
next term will be 8, since at least one of S2/S3 learned of 7 while voting
S2 leader in term 8, only S2+S3 alive, then crash
all peers reboot
who should be next leader?
S1 has longest log, but entry 8 could have committed !!!
so new leader can only be one of S2 or S3
i.e. the rule cannot be simply "longest log"
Raft Persistence
why log?
if a server was in leader's majority for committing an entry,
must remember entry despite reboot, so any future leader is
guaranteed to see the committed log entry
why votedFor?
to prevent a client from voting for one candidate, then reboot,
then vote for a different candidate in the same (or older!) term
could lead to two leaders for the same term
why currentTerm?
to ensure terms only increase, so each term has at most one leader
to detect RPCs from stale leaders and candidates
Q: What if we do not persist currentTerm and initialize currTerm of a server to term of its last log entry when it starts, give an example to show what could go wrong ?
A: Assume there is a raft group of three servers S1, S2 and S3. S1's last log entry has term 10. S1 receives a VoteRequest for term 11 from S2, and answers "yes". Then S1 crashes, and restarts. S1 initializes currentTerm from the term in its last log entry, which is 10. Now S1 receives a VoteRequest from S3 for term 11. S1 will vote for S3 for term 11 even though it previously voted for S2. There are two leaders S2 and S3 in term 11, which would lead to different peers committing different commands at the same index.
Q: What happens after all servers crash and restart at about the same time ?
A: An example:
S1, S2, S3
1. S1, S2, S3 all start as follower in currentTerm of their own, say 3, 2, 1. they voteFor S1, S1, S3 respectively
2. election timer elapses. one of them become candidate, say S2 becomes candidate of 2 + 1 = 3
3. S2 request votes from S1 and S3, S1 rejects to vote because it has voted for itself in term 3, S3 grants the vote and becomes follower of term 3
4. S2 becomes leader of term 3 and starts to broadcast heartbeats to S1 and S3
We can induce the states of S1, S2, S3 before the crash
S1 is candidate of term 3 and leader of term 2
S2 is follower of term 2
S3 is leader of term 1 and then is partitioned away from S1 and S2
A special case needs to take care (Figure 8 in the Raft paper)
index: 1 2 3
S1: 1 2
S2: 1 ->
S3: 1 3 (leader of term 3)
index: 1 2 3
S1: 1 2 4 (leader of term 4)
S2: 1 2
S3: 1 3
Although log entry at 2 is successflly replicated from S1 to S2, we can not simply commit the index. what if we do so, and S1 crashes, S3 will be leader of term 5, and will commit an entry with term 3 at index 2.
Commtting different entries at same index violates safty guarantee of Raft !!!
A leader should not commit log entries from older terms and it should only commits log entries of its own term. After S1 commits log entry at index 3 with term 4, log entry at index 2 is commited indirectly because of the Log Matching property.
Protocol for Cluster Membership Changes
[TBD]
TiKV Case Study
Solutions proposed in TiKV blog for fast linearizable reads over Raft (inspired by approaches mentioned in the Raft paper)
ReadIndex Read
when a read request comes in, the leader performs the following steps to serve the read:
1. ReadIndex = commitIndex
2. Ensure the leader is still leader by broadcasting heartbeats to peers
3. Wait apply of the log entry at ReadIndex
4. Handle the read request and reply to the client
Lease Read
Lease:
Assume that start is the time when the leader starts to broadcast AppendEntries to followers
after the leader get positive responses from majority, the leader holds a valid
lease which lasts till start + minimum_election_timeout / clock_drift_bound.
When the leader holds a vaild lease, it can perform read requests without writing anythong into the log
clock skew has no effect on the correctness of this approach
but clock that may jump back and forth does
See https://pingcap.com/blog/lease-read/ for more details
More on Replicated State Machine Protocols
Paxos / Viewstamped Replcation / SMART / ZAB / Raft ...
Single-decree paxos v.s. multi-decree paxos
[TBD]
Master-based Paxos has similiar advantages as Raft
1. The master is up-to-date and can serve reads of the current consensus state w/o network communication
2. Writes can be reduced to a single round of communication by piggybacking prepares to accept messages
3. The master can batch writes to improve throughput
Downside of master-based/leader-based prototol
1. Reading/writing must be done near the master to avoid accumulating latency from sequential requests.
2. Any potential master must have adequate resources for the system’s full workload slave replicas waste resources until the moment they become master.
3. Master failover can require a complicated state machine, and a series of timers must elapse before service is restored. It is difficult to avoid user-visible outages.
What will go wrong if Zookeeper uses Raft instead of ZAB as its consensus protocol ?
Zookeeper client submits requests in FIFO order and these requests must be committed in the same order. It can submit next request before preious request gets acknowledged.
[TBD]
References
The Raft Paper: https://raft.github.io/raft.pdf
A guide for implementing Raft: https://thesquareplanet.com/blog/students-guide-to-raft/
TiKV Blog: https://pingcap.com/blog/lease-read/
The Megastore Paper: http://cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf