zoukankan      html  css  js  c++  java
  • 8. SOFAJRaft源码分析— JRaft是如何实现日志复制的?

    前言

    前几天和腾讯的大佬一起吃饭聊天,说起我对SOFAJRaft的理解,我自然以为我是很懂了的,但是大佬问起了我那SOFAJRaft集群之间的日志是怎么复制的?
    我当时哑口无言,说不出是怎么实现的,所以这次来分析一下SOFAJRaft中日志复制是怎么做的。

    Leader发送探针获取Follower的LastLogIndex

    Leader 节点在通过 Replicator 和 Follower 建立连接之后,要发送一个 Probe 类型的探针请求,目的是知道 Follower 已经拥有的的日志位置,以便于向 Follower 发送后续的日志。

    大致的流程如下:

    NodeImpl#becomeLeader->replicatorGroup#addReplicator->Replicator#start->Replicator#sendEmptyEntries
    

    最后会通过调用Replicator的sendEmptyEntries方法来发送探针来获取Follower的LastLogIndex

    Replicator#sendEmptyEntries

    private void sendEmptyEntries(final boolean isHeartbeat,
                                  final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
        final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
        //将集群配置设置到rb中,例如Term,GroupId,ServerId等
        if (!fillCommonFields(rb, this.nextIndex - 1, isHeartbeat)) {
            // id is unlock in installSnapshot
            installSnapshot();
            if (isHeartbeat && heartBeatClosure != null) {
                Utils.runClosureInThread(heartBeatClosure, new Status(RaftError.EAGAIN,
                    "Fail to send heartbeat to peer %s", this.options.getPeerId()));
            }
            return;
        }
        try {
            final long monotonicSendTimeMs = Utils.monotonicMs();
            final AppendEntriesRequest request = rb.build();
    
            if (isHeartbeat) {
                ....//省略心跳代码
            } else {
                //statInfo这个类没看到哪里有用到,
                // Sending a probe request.
                //leader发送探针获取Follower的LastLogIndex
                this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
                //将lastLogIndex设置为比firstLogIndex小1
                this.statInfo.firstLogIndex = this.nextIndex;
                this.statInfo.lastLogIndex = this.nextIndex - 1;
                this.appendEntriesCounter++;
                //设置当前Replicator为发送探针
                this.state = State.Probe;
                final int stateVersion = this.version;
                //返回reqSeq,并将reqSeq加一
                final int seq = getAndIncrementReqSeq();
                final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
                    request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {
    
                        @Override
                        public void run(final Status status) {
                            onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request,
                                getResponse(), seq, stateVersion, monotonicSendTimeMs);
                        }
    
                    });
                //Inflight 是对批量发送出去的 logEntry 的一种抽象,他表示哪些 logEntry 已经被封装成日志复制 request 发送出去了
                //这里是将logEntry封装到Inflight中
                addInflight(RequestType.AppendEntries, this.nextIndex, 0, 0, seq, rpcFuture);
            }
            LOG.debug("Node {} send HeartbeatRequest to {} term {} lastCommittedIndex {}", this.options.getNode()
                .getNodeId(), this.options.getPeerId(), this.options.getTerm(), request.getCommittedIndex());
        } finally {
            this.id.unlock();
        }
    }
    

    在调用sendEmptyEntries方法的时候,会传入isHeartbeat为false和heartBeatClosure为null,因为我们这个方法主要是发送探针获取Follower的位移。
    首先调用fillCommonFields方法,将任期,groupId,ServerId,PeerIdLogIndex等设置到rb中,如:

    private boolean fillCommonFields(final AppendEntriesRequest.Builder rb, long prevLogIndex, final boolean isHeartbeat) {
        final long prevLogTerm = this.options.getLogManager().getTerm(prevLogIndex);
        ....
        rb.setTerm(this.options.getTerm());
        rb.setGroupId(this.options.getGroupId());
        rb.setServerId(this.options.getServerId().toString());
        rb.setPeerId(this.options.getPeerId().toString());
        rb.setPrevLogIndex(prevLogIndex);
        rb.setPrevLogTerm(prevLogTerm);
        rb.setCommittedIndex(this.options.getBallotBox().getLastCommittedIndex());
        return true;
    }
    

    注意prevLogIndex是nextIndex-1,表示当前的index
    继续往下走,会设置statInfo实例里面的属性,但是statInfo这个对象我没看到哪里有用到过。
    然后向该Follower发送一个AppendEntriesRequest请求,onRpcReturned负责响应请求。
    发送完请求后调用addInflight初始化一个Inflight实例,加入到inflights集合中,如下:

    private void addInflight(final RequestType reqType, final long startIndex, final int count, final int size,
                             final int seq, final Future<Message> rpcInfly) {
        this.rpcInFly = new Inflight(reqType, startIndex, count, size, seq, rpcInfly);
        this.inflights.add(this.rpcInFly);
        this.nodeMetrics.recordSize("replicate-inflights-count", this.inflights.size());
    }
    

    Inflight 是对批量发送出去的 logEntry 的一种抽象,他表示哪些 logEntry 已经被封装成日志复制 request 发送出去了,这里是将logEntry封装到Inflight中。

    Leader批量的发送日志给Follower

    Replicator#sendEntries

    private boolean sendEntries(final long nextSendingIndex) {
        final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
        //填写当前Replicator的配置信息到rb中
        if (!fillCommonFields(rb, nextSendingIndex - 1, false)) {
            // unlock id in installSnapshot
            installSnapshot();
            return false;
        }
    
        ByteBufferCollector dataBuf = null;
        //获取最大的size为1024
        final int maxEntriesSize = this.raftOptions.getMaxEntriesSize();
    
        //这里使用了类似对象池的技术,避免重复创建对象
        final RecyclableByteBufferList byteBufList = RecyclableByteBufferList.newInstance();
        try {
            //循环遍历出所有的logEntry封装到byteBufList和emb中
            for (int i = 0; i < maxEntriesSize; i++) {
                final RaftOutter.EntryMeta.Builder emb = RaftOutter.EntryMeta.newBuilder();
                //nextSendingIndex代表下一个要发送的index,i代表偏移量
                if (!prepareEntry(nextSendingIndex, i, emb, byteBufList)) {
                    break;
                }
                rb.addEntries(emb.build());
            }
            //如果EntriesCount为0的话,说明LogManager里暂时没有新数据
            if (rb.getEntriesCount() == 0) {
                if (nextSendingIndex < this.options.getLogManager().getFirstLogIndex()) {
                    installSnapshot();
                    return false;
                }
                // _id is unlock in _wait_more
                waitMoreEntries(nextSendingIndex);
                return false;
            }
            //将byteBufList里面的数据放入到rb中
            if (byteBufList.getCapacity() > 0) {
                dataBuf = ByteBufferCollector.allocateByRecyclers(byteBufList.getCapacity());
                for (final ByteBuffer b : byteBufList) {
                    dataBuf.put(b);
                }
                final ByteBuffer buf = dataBuf.getBuffer();
                buf.flip();
                rb.setData(ZeroByteStringHelper.wrap(buf));
            }
        } finally {
            //回收一下byteBufList
            RecycleUtil.recycle(byteBufList);
        }
    
        final AppendEntriesRequest request = rb.build();
        if (LOG.isDebugEnabled()) {
            LOG.debug(
                "Node {} send AppendEntriesRequest to {} term {} lastCommittedIndex {} prevLogIndex {} prevLogTerm {} logIndex {} count {}",
                this.options.getNode().getNodeId(), this.options.getPeerId(), this.options.getTerm(),
                request.getCommittedIndex(), request.getPrevLogIndex(), request.getPrevLogTerm(), nextSendingIndex,
                request.getEntriesCount());
        }
        //statInfo没找到哪里有用到过
        this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
        this.statInfo.firstLogIndex = rb.getPrevLogIndex() + 1;
        this.statInfo.lastLogIndex = rb.getPrevLogIndex() + rb.getEntriesCount();
    
        final Recyclable recyclable = dataBuf;
        final int v = this.version;
        final long monotonicSendTimeMs = Utils.monotonicMs();
        final int seq = getAndIncrementReqSeq();
        final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
            request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {
    
                @Override
                public void run(final Status status) {
                    //回收资源
                    RecycleUtil.recycle(recyclable);
                    onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request, getResponse(), seq,
                        v, monotonicSendTimeMs);
                }
    
            });
        //添加Inflight
        addInflight(RequestType.AppendEntries, nextSendingIndex, request.getEntriesCount(), request.getData().size(),
            seq, rpcFuture);
        return true;
    
    }
    
    1. 首先会调用fillCommonFields方法,填写当前Replicator的配置信息到rb中;
    2. 调用prepareEntry,根据当前的I和nextSendingIndex计算出当前的偏移量,然后去LogManager找到对应的LogEntry,再把LogEntry里面的属性设置到emb中,并把LogEntry里面的数据加入到RecyclableByteBufferList中;
    3. 如果LogEntry里面没有新的数据,那么EntriesCount会为0,那么就返回;
    4. 遍历byteBufList里面的数据,将数据添加到rb中,这样rb里面的数据就是前面是任期、类型、数据长度等信息,rb后面就是真正的数据;
    5. 新建AppendEntriesRequest实例发送请求;
    6. 添加 Inflight 到队列中。Leader 维护一个 queue,每发出一批 logEntry 就向 queue 中 添加一个代表这一批 logEntry 的 Inflight,这样当它知道某一批 logEntry 复制失败之后,就可以依赖 queue 中的 Inflight 把该批次 logEntry 以及后续的所有日志重新复制给 follower。既保证日志复制能够完成,又保证了复制日志的顺序不变

    其中RecyclableByteBufferList采用对象池进行实例化,对象池的相关信息可以看我这篇:7. SOFAJRaft源码分析—如何实现一个轻量级的对象池?

    下面我们详解一下sendEntries里面的具体方法。

    prepareEntry填充emb属性

    Replicator#prepareEntry

    boolean prepareEntry(final long nextSendingIndex, final int offset, final RaftOutter.EntryMeta.Builder emb,
                         final RecyclableByteBufferList dateBuffer) {
        if (dateBuffer.getCapacity() >= this.raftOptions.getMaxBodySize()) {
            return false;
        }
        //设置当前要发送的index
        final long logIndex = nextSendingIndex + offset;
        //如果这个index已经在LogManager中找不到了,那么直接返回
        final LogEntry entry = this.options.getLogManager().getEntry(logIndex);
        if (entry == null) {
            return false;
        }
        //下面就是把LogEntry里面的属性设置到emb中
        emb.setTerm(entry.getId().getTerm());
        if (entry.hasChecksum()) {
            emb.setChecksum(entry.getChecksum()); //since 1.2.6
        }
        emb.setType(entry.getType());
        if (entry.getPeers() != null) {
            Requires.requireTrue(!entry.getPeers().isEmpty(), "Empty peers at logIndex=%d", logIndex);
            for (final PeerId peer : entry.getPeers()) {
                emb.addPeers(peer.toString());
            }
            if (entry.getOldPeers() != null) {
                for (final PeerId peer : entry.getOldPeers()) {
                    emb.addOldPeers(peer.toString());
                }
            }
        } else {
            Requires.requireTrue(entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION,
                "Empty peers but is ENTRY_TYPE_CONFIGURATION type at logIndex=%d", logIndex);
        }
        final int remaining = entry.getData() != null ? entry.getData().remaining() : 0;
        emb.setDataLen(remaining);
        //把LogEntry里面的数据放入到dateBuffer中
        if (entry.getData() != null) {
            // should slice entry data
            dateBuffer.add(entry.getData().slice());
        }
        return true;
    }
    
    1. 对比一下传入的dateBuffer的容量是否已经超过了系统设置的容量(512 * 1024),如果超过了则返回false
    2. 根据给定的起始的index和偏移量offset计算logIndex,然后去LogManager里面根据index获取LogEntry,如果返回的为则说明找不到了,那么就直接返回false,外层的if判断会执行break跳出循环
    3. 然后将LogEntry里面的属性设置到emb对象中,最后将LogEntry里面的数据添加到dateBuffer,这里要做到数据和属性分离

    Follower处理Leader发送的日志复制请求

    在leader发送完AppendEntriesRequest请求之后,请求的数据会在Follower中被AppendEntriesRequestProcessor所处理

    具体的处理方法是processRequest0

    public Message processRequest0(final RaftServerService service, final AppendEntriesRequest request,
                                   final RpcRequestClosure done) {
    
        final Node node = (Node) service;
    
        //默认使用pipeline
        if (node.getRaftOptions().isReplicatorPipeline()) {
            final String groupId = request.getGroupId();
            final String peerId = request.getPeerId();
            //获取请求的次数,以groupId+peerId为一个维度
            final int reqSequence = getAndIncrementSequence(groupId, peerId, done.getBizContext().getConnection());
            //Follower处理leader发过来的日志请求
            final Message response = service.handleAppendEntriesRequest(request, new SequenceRpcRequestClosure(done,
                reqSequence, groupId, peerId));
            //正常的数据只返回null,异常的数据会返回response
            if (response != null) {
                sendSequenceResponse(groupId, peerId, reqSequence, done.getAsyncContext(), done.getBizContext(),
                    response);
            }
            return null;
        } else {
            return service.handleAppendEntriesRequest(request, done);
        }
    }
    

    调用service的handleAppendEntriesRequest会调用到NodeIml的handleAppendEntriesRequest方法中,handleAppendEntriesRequest方法只是异常情况和leader没有发送数据时才会返回,正常情况是返回null

    处理响应日志复制请求

    NodeIml#handleAppendEntriesRequest

    public Message handleAppendEntriesRequest(final AppendEntriesRequest request, final RpcRequestClosure done) {
        boolean doUnlock = true;
        final long startMs = Utils.monotonicMs();
        this.writeLock.lock();
        //获取entryLog个数
        final int entriesCount = request.getEntriesCount();
        try {
            //校验当前节点是否活跃
            if (!this.state.isActive()) {
                LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm);
                return RpcResponseFactory.newResponse(RaftError.EINVAL, "Node %s is not in active state, state %s.",
                    getNodeId(), this.state.name());
            }
            //校验传入的serverId是否能被正常解析
            final PeerId serverId = new PeerId();
            if (!serverId.parse(request.getServerId())) {
                LOG.warn("Node {} received AppendEntriesRequest from {} serverId bad format.", getNodeId(),
                    request.getServerId());
                return RpcResponseFactory.newResponse(RaftError.EINVAL, "Parse serverId failed: %s.",
                    request.getServerId());
            }
            //校验任期
            // Check stale term
            if (request.getTerm() < this.currTerm) {
                LOG.warn("Node {} ignore stale AppendEntriesRequest from {}, term={}, currTerm={}.", getNodeId(),
                    request.getServerId(), request.getTerm(), this.currTerm);
                return AppendEntriesResponse.newBuilder() //
                    .setSuccess(false) //
                    .setTerm(this.currTerm) //
                    .build();
            }
    
            // Check term and state to step down
            //当前节点如果不是Follower节点的话要执行StepDown操作
            checkStepDown(request.getTerm(), serverId);
            //这说明请求的节点不是当前节点的leader
            if (!serverId.equals(this.leaderId)) {
                LOG.error("Another peer {} declares that it is the leader at term {} which was occupied by leader {}.",
                    serverId, this.currTerm, this.leaderId);
                // Increase the term by 1 and make both leaders step down to minimize the
                // loss of split brain
                stepDown(request.getTerm() + 1, false, new Status(RaftError.ELEADERCONFLICT,
                    "More than one leader in the same term."));
                return AppendEntriesResponse.newBuilder() //
                    .setSuccess(false) //
                    .setTerm(request.getTerm() + 1) //
                    .build();
            }
    
            updateLastLeaderTimestamp(Utils.monotonicMs());
    
            //校验是否正在生成快照
            if (entriesCount > 0 && this.snapshotExecutor != null && this.snapshotExecutor.isInstallingSnapshot()) {
                LOG.warn("Node {} received AppendEntriesRequest while installing snapshot.", getNodeId());
                return RpcResponseFactory.newResponse(RaftError.EBUSY, "Node %s:%s is installing snapshot.",
                    this.groupId, this.serverId);
            }
            //传入的是发起请求节点的nextIndex-1
            final long prevLogIndex = request.getPrevLogIndex();
            final long prevLogTerm = request.getPrevLogTerm();
            final long localPrevLogTerm = this.logManager.getTerm(prevLogIndex);
            //发起请求的节点prevLogIndex对应的任期和当前节点的index所对应的任期不匹配
            if (localPrevLogTerm != prevLogTerm) {
                final long lastLogIndex = this.logManager.getLastLogIndex();
    
                LOG.warn(
                    "Node {} reject term_unmatched AppendEntriesRequest from {}, term={}, prevLogIndex={}, prevLogTerm={}, localPrevLogTerm={}, lastLogIndex={}, entriesSize={}.",
                    getNodeId(), request.getServerId(), request.getTerm(), prevLogIndex, prevLogTerm, localPrevLogTerm,
                    lastLogIndex, entriesCount);
    
                return AppendEntriesResponse.newBuilder() //
                    .setSuccess(false) //
                    .setTerm(this.currTerm) //
                    .setLastLogIndex(lastLogIndex) //
                    .build();
            }
            //响应心跳或者发送的是sendEmptyEntry
            if (entriesCount == 0) {
                // heartbeat
                final AppendEntriesResponse.Builder respBuilder = AppendEntriesResponse.newBuilder() //
                    .setSuccess(true) //
                    .setTerm(this.currTerm)
                    //  返回当前节点的最新的index
                    .setLastLogIndex(this.logManager.getLastLogIndex());
                doUnlock = false;
                this.writeLock.unlock();
                // see the comments at FollowerStableClosure#run()
                this.ballotBox.setLastCommittedIndex(Math.min(request.getCommittedIndex(), prevLogIndex));
                return respBuilder.build();
            }
    
            // Parse request
            long index = prevLogIndex;
            final List<LogEntry> entries = new ArrayList<>(entriesCount);
            ByteBuffer allData = null;
            if (request.hasData()) {
                allData = request.getData().asReadOnlyByteBuffer();
            }
            //获取所有数据
            final List<RaftOutter.EntryMeta> entriesList = request.getEntriesList();
            for (int i = 0; i < entriesCount; i++) {
                final RaftOutter.EntryMeta entry = entriesList.get(i);
                index++;
                if (entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_UNKNOWN) {
                    //给logEntry属性设值
                    final LogEntry logEntry = new LogEntry();
                    logEntry.setId(new LogId(index, entry.getTerm()));
                    logEntry.setType(entry.getType());
                    if (entry.hasChecksum()) {
                        logEntry.setChecksum(entry.getChecksum()); // since 1.2.6
                    }
                    //将数据填充到logEntry
                    final long dataLen = entry.getDataLen();
                    if (dataLen > 0) {
                        final byte[] bs = new byte[(int) dataLen];
                        assert allData != null;
                        allData.get(bs, 0, bs.length);
                        logEntry.setData(ByteBuffer.wrap(bs));
                    }
    
                    if (entry.getPeersCount() > 0) {
                        //只有配置类型的entry才有多个Peer
                        if (entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
                            throw new IllegalStateException(
                                    "Invalid log entry that contains peers but is not ENTRY_TYPE_CONFIGURATION type: "
                                            + entry.getType());
                        }
    
                        final List<PeerId> peers = new ArrayList<>(entry.getPeersCount());
                        for (final String peerStr : entry.getPeersList()) {
                            final PeerId peer = new PeerId();
                            peer.parse(peerStr);
                            peers.add(peer);
                        }
                        logEntry.setPeers(peers);
    
                        if (entry.getOldPeersCount() > 0) {
                            final List<PeerId> oldPeers = new ArrayList<>(entry.getOldPeersCount());
                            for (final String peerStr : entry.getOldPeersList()) {
                                final PeerId peer = new PeerId();
                                peer.parse(peerStr);
                                oldPeers.add(peer);
                            }
                            logEntry.setOldPeers(oldPeers);
                        }
                    } else if (entry.getType() == EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
                        throw new IllegalStateException(
                                "Invalid log entry that contains zero peers but is ENTRY_TYPE_CONFIGURATION type");
                    }
    
                    // Validate checksum
                    if (this.raftOptions.isEnableLogEntryChecksum() && logEntry.isCorrupted()) {
                        long realChecksum = logEntry.checksum();
                        LOG.error(
                                "Corrupted log entry received from leader, index={}, term={}, expectedChecksum={}, " +
                                 "realChecksum={}",
                                logEntry.getId().getIndex(), logEntry.getId().getTerm(), logEntry.getChecksum(),
                                realChecksum);
                        return RpcResponseFactory.newResponse(RaftError.EINVAL,
                                "The log entry is corrupted, index=%d, term=%d, expectedChecksum=%d, realChecksum=%d",
                                logEntry.getId().getIndex(), logEntry.getId().getTerm(), logEntry.getChecksum(),
                                realChecksum);
                    }
    
                    entries.add(logEntry);
                }
            }
            //存储日志,并回调返回response
            final FollowerStableClosure closure = new FollowerStableClosure(request, AppendEntriesResponse.newBuilder()
                .setTerm(this.currTerm), this, done, this.currTerm);
            this.logManager.appendEntries(entries, closure);
            // update configuration after _log_manager updated its memory status
            this.conf = this.logManager.checkAndSetConfiguration(this.conf);
            return null;
        } finally {
            if (doUnlock) {
                this.writeLock.unlock();
            }
            this.metrics.recordLatency("handle-append-entries", Utils.monotonicMs() - startMs);
            this.metrics.recordSize("handle-append-entries-count", entriesCount);
        }
    }
    

    handleAppendEntriesRequest方法写的很长,但是实际上做了很多校验的事情,具体的处理逻辑不多

    1. 校验当前的Node节点是否还处于活跃状态,如果不是的话,那么直接返回一个error的response
    2. 校验请求的serverId的格式是否正确,不正确则返回一个error的response
    3. 校验请求的任期是否小于当前的任期,如果是那么返回一个AppendEntriesResponse类型的response
    4. 调用checkStepDown方法检测当前节点的任期,以及状态,是否有leader等
    5. 如果请求的serverId和当前节点的leaderId是不是同一个,用来校验是不是leader发起的请求,如果不是返回一个AppendEntriesResponse
    6. 校验是否正在生成快照
    7. 获取请求的Index在当前节点中对应的LogEntry的任期是不是和请求传入的任期相同,不同的话则返回AppendEntriesResponse
    8. 如果传入的entriesCount为零,那么leader发送的可能是心跳或者发送的是sendEmptyEntry,返回AppendEntriesResponse,并将当前任期和最新index封装返回
    9. 请求的数据不为空,那么遍历所有的数据
    10. 实例化一个logEntry,并且将数据和属性设置到logEntry实例中,最后将logEntry放入到entries集合中
    11. 调用logManager将数据批量提交日志写入 RocksDB

    发送响应给leader

    最终发送给leader的响应是通过AppendEntriesRequestProcessor的sendSequenceResponse来发送的

    void sendSequenceResponse(final String groupId, final String peerId, final int seq,
                              final AsyncContext asyncContext, final BizContext bizContext, final Message msg) {
        final Connection connection = bizContext.getConnection();
        //获取context,维度是groupId和peerId
        final PeerRequestContext ctx = getPeerRequestContext(groupId, peerId, connection);
        final PriorityQueue<SequenceMessage> respQueue = ctx.responseQueue;
        assert (respQueue != null);
    
        synchronized (Utils.withLockObject(respQueue)) {
            //将要响应的数据放入到优先队列中
            respQueue.add(new SequenceMessage(asyncContext, msg, seq));
            //校验队列里面的数据是否超过了256
            if (!ctx.hasTooManyPendingResponses()) {
                while (!respQueue.isEmpty()) {
                    final SequenceMessage queuedPipelinedResponse = respQueue.peek();
                    //如果序列对应不上,那么就不发送响应
                    if (queuedPipelinedResponse.sequence != getNextRequiredSequence(groupId, peerId, connection)) {
                        // sequence mismatch, waiting for next response.
                        break;
                    }
                    respQueue.remove();
                    try {
                        //发送响应
                        queuedPipelinedResponse.sendResponse();
                    } finally {
                        //序列加一
                        getAndIncrementNextRequiredSequence(groupId, peerId, connection);
                    }
                }
            } else {
                LOG.warn("Closed connection to peer {}/{}, because of too many pending responses, queued={}, max={}",
                    ctx.groupId, peerId, respQueue.size(), ctx.maxPendingResponses);
                connection.close();
                // Close the connection if there are too many pending responses in queue.
                removePeerRequestContext(groupId, peerId);
            }
        }
    }
    

    这个方法会将要发送的数据依次压入到PriorityQueue优先队列中进行排序,然后获取序列号最小的元素和nextRequiredSequence比较,如果不相等,那么则是出现了乱序的情况,那么就不发送请求

    Leader处理日志复制的Response

    Leader收到Follower发过来的Response响应之后会调用Replicator的onRpcReturned方法

    static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
                              final Message response, final int seq, final int stateVersion, final long rpcSendTime) {
        if (id == null) {
            return;
        }
        final long startTimeMs = Utils.nowMs();
        Replicator r;
        if ((r = (Replicator) id.lock()) == null) {
            return;
        }
        //检查版本号,因为每次resetInflights都会让version加一,所以检查一下
        if (stateVersion != r.version) {
            LOG.debug(
                "Replicator {} ignored old version response {}, current version is {}, request is {}
    , and response is {}
    , status is {}.",
                r, stateVersion, r.version, request, response, status);
            id.unlock();
            return;
        }
        //使用优先队列按seq排序,最小的会在第一个
        final PriorityQueue<RpcResponse> holdingQueue = r.pendingResponses;
        //这里用一个优先队列是因为响应是异步的,seq小的可能响应比seq大慢
        holdingQueue.add(new RpcResponse(reqType, seq, status, request, response, rpcSendTime));
        //默认holdingQueue队列里面的数量不能超过256
        if (holdingQueue.size() > r.raftOptions.getMaxReplicatorInflightMsgs()) {
            LOG.warn("Too many pending responses {} for replicator {}, maxReplicatorInflightMsgs={}",
                holdingQueue.size(), r.options.getPeerId(), r.raftOptions.getMaxReplicatorInflightMsgs());
            //重新发送探针
            //清空数据
            r.resetInflights();
            r.state = State.Probe;
            r.sendEmptyEntries(false);
            return;
        }
    
        boolean continueSendEntries = false;
    
        final boolean isLogDebugEnabled = LOG.isDebugEnabled();
        StringBuilder sb = null;
        if (isLogDebugEnabled) {
            sb = new StringBuilder("Replicator ").append(r).append(" is processing RPC responses,");
        }
        try {
            int processed = 0;
            while (!holdingQueue.isEmpty()) {
                //取出holdingQueue里seq最小的数据
                final RpcResponse queuedPipelinedResponse = holdingQueue.peek();
    
                //如果Follower没有响应的话就会出现次序对不上的情况,那么就不往下走了
                //sequence mismatch, waiting for next response.
                if (queuedPipelinedResponse.seq != r.requiredNextSeq) {
                    // 如果之前存在处理,则到此直接break循环
                    if (processed > 0) {
                        if (isLogDebugEnabled) {
                            sb.append("has processed ").append(processed).append(" responses,");
                        }
                        break;
                    } else {
                        //Do not processed any responses, UNLOCK id and return.
                        continueSendEntries = false;
                        id.unlock();
                        return;
                    }
                }
                //走到这里说明seq对的上,那么就移除优先队列里面seq最小的数据
                holdingQueue.remove();
                processed++;
                //获取inflights队列里的第一个元素
                final Inflight inflight = r.pollInflight();
                //发起一个请求的时候会将inflight放入到队列中
                //如果为空,那么就忽略
                if (inflight == null) {
                    // The previous in-flight requests were cleared.
                    if (isLogDebugEnabled) {
                        sb.append("ignore response because request not found:").append(queuedPipelinedResponse)
                            .append(",
    ");
                    }
                    continue;
                }
                //seq没有对上,说明顺序乱了,重置状态
                if (inflight.seq != queuedPipelinedResponse.seq) {
                    // reset state
                    LOG.warn(
                        "Replicator {} response sequence out of order, expect {}, but it is {}, reset state to try again.",
                        r, inflight.seq, queuedPipelinedResponse.seq);
                    r.resetInflights();
                    r.state = State.Probe;
                    continueSendEntries = false;
                    // 锁住节点,根据错误类别等待一段时间
                    r.block(Utils.nowMs(), RaftError.EREQUEST.getNumber());
                    return;
                }
                try {
                    switch (queuedPipelinedResponse.requestType) {
                        case AppendEntries:
                            //处理日志复制的response
                            continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
                                (AppendEntriesRequest) queuedPipelinedResponse.request,
                                (AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
                            break;
                        case Snapshot:
                            //处理快照的response
                            continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
                                (InstallSnapshotRequest) queuedPipelinedResponse.request,
                                (InstallSnapshotResponse) queuedPipelinedResponse.response);
                            break;
                    }
                } finally {
                    if (continueSendEntries) {
                        // Success, increase the response sequence.
                        r.getAndIncrementRequiredNextSeq();
                    } else {
                        // The id is already unlocked in onAppendEntriesReturned/onInstallSnapshotReturned, we SHOULD break out.
                        break;
                    }
                }
            }
        } finally {
            if (isLogDebugEnabled) {
                sb.append(", after processed, continue to send entries: ").append(continueSendEntries);
                LOG.debug(sb.toString());
            }
            if (continueSendEntries) {
                // unlock in sendEntries.
                r.sendEntries();
            }
        }
    }
    
    1. 检查版本号,因为每次resetInflights都会让version加一,所以检查一下是不是同一批的数据
    2. 获取Replicator的pendingResponses队列,然后将当前响应的数据封装成RpcResponse实例加入到队列中
    3. 校验队列里面的元素是否大于256,大于256则清空数据重新同步
    4. 校验holdingQueue队列里面的seq最小的序列数据序列和当前的requiredNextSeq是否相同,不同的话如果是刚进入循环那么直接break退出循环
    5. 获取inflights队列中第一个元素,如果seq没有对上,说明顺序乱了,重置状态
    6. 调用onAppendEntriesReturned方法处理日志复制的response
    7. 如果处理成功,那么则调用sendEntries继续发送复制日志到Follower

    Replicator#onAppendEntriesReturned

    private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
                                                   final AppendEntriesRequest request,
                                                   final AppendEntriesResponse response, final long rpcSendTime,
                                                   final long startTimeMs, final Replicator r) {
        //校验数据序列有没有错
        if (inflight.startIndex != request.getPrevLogIndex() + 1) {
            LOG.warn(
                "Replicator {} received invalid AppendEntriesResponse, in-flight startIndex={}, request prevLogIndex={}, reset the replicator state and probe again.",
                r, inflight.startIndex, request.getPrevLogIndex());
            r.resetInflights();
            r.state = State.Probe;
            // unlock id in sendEmptyEntries
            r.sendEmptyEntries(false);
            return false;
        }
        //度量
        // record metrics
        if (request.getEntriesCount() > 0) {
            r.nodeMetrics.recordLatency("replicate-entries", Utils.monotonicMs() - rpcSendTime);
            r.nodeMetrics.recordSize("replicate-entries-count", request.getEntriesCount());
            r.nodeMetrics.recordSize("replicate-entries-bytes", request.getData() != null ? request.getData().size()
                : 0);
        }
    
        final boolean isLogDebugEnabled = LOG.isDebugEnabled();
        StringBuilder sb = null;
        if (isLogDebugEnabled) {
            sb = new StringBuilder("Node "). //
                append(r.options.getGroupId()).append(":").append(r.options.getServerId()). //
                append(" received AppendEntriesResponse from "). //
                append(r.options.getPeerId()). //
                append(" prevLogIndex=").append(request.getPrevLogIndex()). //
                append(" prevLogTerm=").append(request.getPrevLogTerm()). //
                append(" count=").append(request.getEntriesCount());
        }
        //如果follower因为崩溃,RPC调用失败等原因没有收到成功响应
        //那么需要阻塞一段时间再进行调用
        if (!status.isOk()) {
            // If the follower crashes, any RPC to the follower fails immediately,
            // so we need to block the follower for a while instead of looping until
            // it comes back or be removed
            // dummy_id is unlock in block
            if (isLogDebugEnabled) {
                sb.append(" fail, sleep.");
                LOG.debug(sb.toString());
            }
            //如果注册了Replicator状态监听器,那么通知所有监听器
            notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
            if (++r.consecutiveErrorTimes % 10 == 0) {
                LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}", r.options.getPeerId(),
                    r.consecutiveErrorTimes, status);
            }
            r.resetInflights();
            r.state = State.Probe;
            // unlock in in block
            r.block(startTimeMs, status.getCode());
            return false;
        }
        r.consecutiveErrorTimes = 0;
        //响应失败
        if (!response.getSuccess()) {
            // Leader 的切换,表明可能出现过一次网络分区,从新跟随新的 Leader
            if (response.getTerm() > r.options.getTerm()) {
                if (isLogDebugEnabled) {
                    sb.append(" fail, greater term ").append(response.getTerm()).append(" expect term ")
                        .append(r.options.getTerm());
                    LOG.debug(sb.toString());
                }
                // 获取当前本节点的表示对象——NodeImpl
                final NodeImpl node = r.options.getNode();
                r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
                r.destroy();
                // 调整自己的 term 任期值
                node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
                    "Leader receives higher term heartbeat_response from peer:%s", r.options.getPeerId()));
                return false;
            }
            if (isLogDebugEnabled) {
                sb.append(" fail, find nextIndex remote lastLogIndex ").append(response.getLastLogIndex())
                    .append(" local nextIndex ").append(r.nextIndex);
                LOG.debug(sb.toString());
            }
            if (rpcSendTime > r.lastRpcSendTimestamp) {
                r.lastRpcSendTimestamp = rpcSendTime;
            }
            // Fail, reset the state to try again from nextIndex.
            r.resetInflights();
            //如果Follower最新的index小于下次要发送的index,那么设置为Follower响应的index
            // prev_log_index and prev_log_term doesn't match
            if (response.getLastLogIndex() + 1 < r.nextIndex) {
                LOG.debug("LastLogIndex at peer={} is {}", r.options.getPeerId(), response.getLastLogIndex());
                // The peer contains less logs than leader
                r.nextIndex = response.getLastLogIndex() + 1;
            } else {
                // The peer contains logs from old term which should be truncated,
                // decrease _last_log_at_peer by one to test the right index to keep
                if (r.nextIndex > 1) {
                    LOG.debug("logIndex={} dismatch", r.nextIndex);
                    r.nextIndex--;
                } else {
                    LOG.error("Peer={} declares that log at index=0 doesn't match, which is not supposed to happen",
                        r.options.getPeerId());
                }
            }
            //响应失败需要重新获取Follower的日志信息,用来重新同步
            // dummy_id is unlock in _send_heartbeat
            r.sendEmptyEntries(false);
            return false;
        }
        if (isLogDebugEnabled) {
            sb.append(", success");
            LOG.debug(sb.toString());
        }
        // success
        //响应成功检查任期
        if (response.getTerm() != r.options.getTerm()) {
            r.resetInflights();
            r.state = State.Probe;
            LOG.error("Fail, response term {} dismatch, expect term {}", response.getTerm(), r.options.getTerm());
            id.unlock();
            return false;
        }
        if (rpcSendTime > r.lastRpcSendTimestamp) {
            r.lastRpcSendTimestamp = rpcSendTime;
        }
        // 本次提交的日志数量
        final int entriesSize = request.getEntriesCount();
        if (entriesSize > 0) {
            // 节点确认提交
            r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
            if (LOG.isDebugEnabled()) {
                LOG.debug("Replicated logs in [{}, {}] to peer {}", r.nextIndex, r.nextIndex + entriesSize - 1,
                    r.options.getPeerId());
            }
        } else {
            // The request is probe request, change the state into Replicate.
            r.state = State.Replicate;
        }
        r.nextIndex += entriesSize;
        r.hasSucceeded = true;
        r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
        // dummy_id is unlock in _send_entries
        if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
            r.sendTimeoutNow(false, false);
        }
        return true;
    }
    

    onAppendEntriesReturned方法也非常的长,但是我们要有点耐心往下看

    1. 校验数据序列有没有错
    2. 进行度量和拼接日志操作
    3. 判断一下返回的状态如果不是正常的,那么就通知监听器,进行重置操作并阻塞一定时间后再发送
    4. 如果返回Success状态为false,那么校验一下任期,因为Leader 的切换,表明可能出现过一次网络分区,需要重新跟随新的 Leader;如果任期没有问题那么就进行重置操作,并根据Follower返回的最新的index来重新设值nextIndex
    5. 如果各种校验都没有问题的话,那么进行日志提交确认,更新最新的日志提交位置索引
  • 相关阅读:
    tgttg
    在OpenStack虚拟机实例中创建swap分区的一种方法
    产品:我想要的产品是网络存储+网络备份
    《哪来的天才?》读书笔记——天才源于练习,而且是针对性的练习
    一万小时理论的解读(神贴真开眼界:有意识的刻苦训练是必须的,要有精神动力,还必须有及时的反馈,对实力占优的活动比较有效;玩这样的活动是不行的)
    Cross-compiling Qt Embedded 5.5 for Raspberry Pi 2
    MSYS2的源配置
    关于iOS 5 Could not instantiate class named NSLayoutConstraint错误
    BAT线下战争:巨额投资或培养出自己最大对手(包括美团、58、饿了么在内的公司都在计划推出自己的支付工具和金融产品,腾讯只做2不做O)
    欢聚移动互联时代 在腾讯的夹缝中低调崛起
  • 原文地址:https://www.cnblogs.com/luozhiyun/p/12005975.html
Copyright © 2011-2022 走看看