探秘RocketMQ源码——Series1：Producer视角看事务消息

1. 前言

Apache RocketMQ作为广为人知的开源消息中间件，诞生于阿里巴巴，于2016年捐赠给了Apache。从RocketMQ 4.0到如今最新的v4.7.1，不论是在阿里巴巴内部还是外部社区，都赢得了广泛的关注和好评。
出于兴趣和工作的需要，近期本人对RocketMQ 4.7.1的部分代码进行了研读，其间产生了很多困惑，也收获了更多的启发。

本文将站在发送方视角，通过阅读RocketMQ Producer源码，来分析在事务消息发送中RocketMQ是如何工作的。需要说明的是，本文所贴代码，均来自4.7.1版本的RocketMQ源码。本文中所讨论的发送，仅指从Producer发送到Broker的过程，并不包含Broker将消息投递到Consumer的过程。

2. 宏观概览

RocketMQ事务消息发送流程：

图1

结合源码来看，RocketMQ的事务消息TransactionMQProducer的sendMessageInTransaction方法，实际调用了DefaultMQProducerImpl的sendMessageInTransaction方法。我们进入sendMessageInTransaction方法，整个事务消息的发送流程清晰可见：

首先，做发送前检查，并填入必要参数，包括设prepare事务消息。

源码清单-1

public TransactionSendResult sendMessageInTransaction(final Message msg,
    final LocalTransactionExecuter localTransactionExecuter, final Object arg)
    throws MQClientException {
    TransactionListener transactionListener = getCheckListener(); 
        if (null == localTransactionExecuter && null == transactionListener) {
        throw new MQClientException("tranExecutor is null", null);
    }

    // ignore DelayTimeLevel parameter
    if (msg.getDelayTimeLevel() != 0) {
        MessageAccessor.clearProperty(msg, MessageConst.PROPERTY_DELAY_TIME_LEVEL);
    }

    Validators.checkMessage(msg, this.defaultMQProducer);

    SendResult sendResult = null;
    MessageAccessor.putProperty(msg, MessageConst.PROPERTY_TRANSACTION_PREPARED, "true");
    MessageAccessor.putProperty(msg, MessageConst.PROPERTY_PRODUCER_GROUP, this.defaultMQProducer.getProducerGroup());

进入发送处理流程：

源码清单-2

    try {
        sendResult = this.send(msg);
    } catch (Exception e) {
        throw new MQClientException("send message Exception", e);
    }

根据broker返回的处理结果决策本地事务是否执行，半消息发送成功则开始本地事务执行：

源码清单-3

    LocalTransactionState localTransactionState = LocalTransactionState.UNKNOW;
    Throwable localException = null;
    switch (sendResult.getSendStatus()) {
        case SEND_OK: {
            try {
                if (sendResult.getTransactionId() != null) {
                    msg.putUserProperty("__transactionId__", sendResult.getTransactionId());
                }
                String transactionId = msg.getProperty(MessageConst.PROPERTY_UNIQ_CLIENT_MESSAGE_ID_KEYIDX);
                if (null != transactionId && !"".equals(transactionId)) {
                    msg.setTransactionId(transactionId);
                }
                if (null != localTransactionExecuter) { 
                    localTransactionState = localTransactionExecuter.executeLocalTransactionBranch(msg, arg);
                } else if (transactionListener != null) { 
                    log.debug("Used new transaction API");
                    localTransactionState = transactionListener.executeLocalTransaction(msg, arg); 
                }
                if (null == localTransactionState) {
                    localTransactionState = LocalTransactionState.UNKNOW;
                }

                if (localTransactionState != LocalTransactionState.COMMIT_MESSAGE) {
                    log.info("executeLocalTransactionBranch return {}", localTransactionState);
                    log.info(msg.toString());
                }
            } catch (Throwable e) {
                log.info("executeLocalTransactionBranch exception", e);
                log.info(msg.toString());
                localException = e;
            }
        }
        break;
        case FLUSH_DISK_TIMEOUT:
        case FLUSH_SLAVE_TIMEOUT:
        case SLAVE_NOT_AVAILABLE:  // 当备broker状态不可用时，半消息要回滚，不执行本地事务
            localTransactionState = LocalTransactionState.ROLLBACK_MESSAGE;
            break;
        default:
            break;
    }

本地事务执行结束，根据本地事务状态进行二阶段处理：

源码清单-4

    try {
        this.endTransaction(sendResult, localTransactionState, localException);
    } catch (Exception e) {
        log.warn("local transaction execute " + localTransactionState + ", but end broker transaction failed", e);
    }

    // 组装发送结果
    // ...
    return transactionSendResult;
}

接下来，我们深入每个阶段代码分析。

3. 深扒内幕

3.1 一阶段发送

重点分析send方法。进入send方法后，我们发现，RocketMQ的事务消息的一阶段，使用了SYNC同步模式：

源码清单-5

public SendResult send(Message msg,
    long timeout) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
    return this.sendDefaultImpl(msg, CommunicationMode.SYNC, null, timeout);
}

这一点很容易理解，毕竟事务消息是要根据一阶段发送结果来决定要不要执行本地事务的，所以一定要阻塞等待broker的ack。

我们进入DefaultMQProducerImpl.java中去看sendDefaultImpl方法的实现，通过读这个方法的代码，来尝试了解在事务消息的一阶段发送过程中producer的行为。值得注意的是，这个方法并非为事务消息定制，甚至不是为SYNC同步模式定制的，因此读懂了这段代码，基本可以对RocketMQ的消息发送机制有了一个较为全面的认识。
这段代码逻辑非常通畅，不忍切片。为了节省篇幅，将代码中较为繁杂但信息量不大的部分以注释代替，尽可能保留流程的完整性。个人认为较为重要或是容易被忽略的部分，以注释标出，后文还有部分细节的详细解读。

源码清单-6

private SendResult sendDefaultImpl(
    Message msg,
    final CommunicationMode communicationMode,
    final SendCallback sendCallback,
    final long timeout
    ) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
    this.makeSureStateOK();
    // 一、消息有效性校验。见后文
    Validators.checkMessage(msg, this.defaultMQProducer);
    final long invokeID = random.nextLong();
    long beginTimestampFirst = System.currentTimeMillis();
    long beginTimestampPrev = beginTimestampFirst;
    long endTimestamp = beginTimestampFirst;

    // 获取当前topic的发送路由信息，主要是要broker，如果没找到则从namesrv获取
    TopicPublishInfo topicPublishInfo = this.tryToFindTopicPublishInfo(msg.getTopic());
    if (topicPublishInfo != null && topicPublishInfo.ok()) {
        boolean callTimeout = false;
        MessageQueue mq = null;
        Exception exception = null;
        SendResult sendResult = null;
        // 二、发送重试机制。见后文
        int timesTotal = communicationMode == CommunicationMode.SYNC ? 1 + this.defaultMQProducer.getRetryTimesWhenSendFailed() : 1;
        int times = 0;
        String[] brokersSent = new String[timesTotal];
        for (; times < timesTotal; times++) {
            // 第一次发送是mq == null， 之后都是有broker信息的
            String lastBrokerName = null == mq ? null : mq.getBrokerName();
            // 三、rocketmq发送消息时如何选择队列？——broker异常规避机制 
            MessageQueue mqSelected = this.selectOneMessageQueue(topicPublishInfo, lastBrokerName);

            if (mqSelected != null) {
                mq = mqSelected;
                brokersSent[times] = mq.getBrokerName();
                try {
                    beginTimestampPrev = System.currentTimeMillis();
                    if (times > 0) {
                        //Reset topic with namespace during resend.
                        msg.setTopic(this.defaultMQProducer.withNamespace(msg.getTopic()));
                    }
                    long costTime = beginTimestampPrev - beginTimestampFirst;
                    if (timeout < costTime) {
                        callTimeout = true;
                        break;
                    }
                    // 发送核心代码
                    sendResult = this.sendKernelImpl(msg, mq, communicationMode, sendCallback, topicPublishInfo, timeout - costTime);
                    endTimestamp = System.currentTimeMillis();
                    // rocketmq 选择 broker 时的规避机制，开启 sendLatencyFaultEnable == true 才生效
                    this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false);

                    switch (communicationMode) {
                    // 四、RocketMQ的三种CommunicationMode。见后文
                        case ASYNC: // 异步模式
                            return null;
                        case ONEWAY: // 单向模式
                            return null;
                        case SYNC: // 同步模式
                            if (sendResult.getSendStatus() != SendStatus.SEND_OK) {
                                if (this.defaultMQProducer.isRetryAnotherBrokerWhenNotStoreOK()) {
                                    continue;
                                }
                            }
                            return sendResult;
                        default:
                            break;
                    }
                } catch (RemotingException e) {
                    // ...
                    // 自动重试
                } catch (MQClientException e) {
                    // ...
                    // 自动重试
                } catch (MQBrokerException e) {
                   // ...
                    // 仅返回码==NOT_IN_CURRENT_UNIT==205 时自动重试
                    // 其他情况不重试，抛异常
                } catch (InterruptedException e) {
                   // ...
                    // 不重试，抛异常
                }
            } else {
                break;
            }
        }

        if (sendResult != null) {
            return sendResult;
        }

        // 组装返回的info信息，最后以MQClientException抛出
        // ... ...

        // 超时场景抛RemotingTooMuchRequestException
        if (callTimeout) {
            throw new RemotingTooMuchRequestException("sendDefaultImpl call timeout");
        }

        // 填充MQClientException异常信息
        // ...
    }

    validateNameServerSetting();

    throw new MQClientException("No route info of this topic: " + msg.getTopic() + FAQUrl.suggestTodo(FAQUrl.NO_TOPIC_ROUTE_INFO),
        null).setResponseCode(ClientErrorCode.NOT_FOUND_TOPIC_EXCEPTION);
}

3.1.1 消息有效性校验

源码清单-7

 Validators.checkMessage(msg, this.defaultMQProducer);

在此方法中校验消息的有效性，包括对topic和消息体的校验。topic的命名必须符合规范，且避免使用内置的系统消息TOPIC。消息体长度 > 0 && 消息体长度 <= 102410244 = 4M 。

源码清单-8

public static void checkMessage(Message msg, DefaultMQProducer defaultMQProducer)
    throws MQClientException {
    if (null == msg) {
        throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL, "the message is null");
    }
    // topic
    Validators.checkTopic(msg.getTopic());
    Validators.isNotAllowedSendTopic(msg.getTopic());

    // body
    if (null == msg.getBody()) {
        throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL, "the message body is null");
    }

    if (0 == msg.getBody().length) {
        throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL, "the message body length is zero");
    }

    if (msg.getBody().length > defaultMQProducer.getMaxMessageSize()) {
        throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL,
            "the message body size over max value, MAX: " + defaultMQProducer.getMaxMessageSize());
    }
}

3.1.2 发送重试机制

Producer在消息发送不成功时，会自动重试，最多发送次数 = retryTimesWhenSendFailed + 1 = 3次。

值得注意的是，并非所有异常情况都会重试，从以上源码中可以提取到的信息告诉我们，在以下三种情况下，会自动重试：
1）发生RemotingException，MQClientException两种异常之一时。
2）发生MQBrokerException异常，且ResponseCode是NOT_IN_CURRENT_UNIT = 205时。
3）SYNC模式下，未发生异常且发送结果状态非 SEND_OK。

在每次发送消息之前，会先检查是否在前面这两步就已经耗时超长（超时时长默认3000ms），若是，则不再继续发送并且直接返回超时，不再重试。这里说明了2个问题：
1）producer内部自动重试对业务应用而言是无感知的，应用看到的发送耗时是包含所有重试的耗时在内的；
2）一旦超时意味着本次消息发送已经以失败告终，原因是超时。这个信息最后会以RemotingTooMuchRequestException的形式抛出。

这里需要指出的是，在RocketMQ官方文档中指出，发送超时时长是10s，即10000ms，网上许多人对rocketMQ的超时时间解读也认为是10s。然而代码中却明明白白写着3000ms，最终我debug之后确认，默认超时时间确实是3000ms。这里也建议RocketMQ团队对文档进行确认，如确有误，还是早日更正为好。

图2

3.1.3 broker的异常规避机制

源码清单-8

MessageQueue mqSelected = this.selectOneMessageQueue(topicPublishInfo, lastBrokerName);

这行代码是发送前选择queue的过程。

这里涉及RocketMQ消息发送高可用的的一个核心机制，latencyFaultTolerance。这个机制是Producer负载均衡的一部分，通过sendLatencyFaultEnable的值来控制，默认是false关闭状态，不启动broker故障延迟机制，值为true时启用broker故障延迟机制，可由Producer主动打开。

选择队列时，开启异常规避机制，则根据broker的工作状态避免选择当前状态不佳的broker代理，不健康的broker会在一段时间内被规避，不开启异常规避机制时，则按顺序选取下一个队列，但在重试场景下会尽量选择不同于上次发送broker的queue。每次消息发送都会通过updateFaultItem方法来维护broker的状态信息。

源码清单-9

public void updateFaultItem(final String brokerName, final long currentLatency, boolean isolation) {
    if (this.sendLatencyFaultEnable) {
        // 计算延迟多久，isolation表示是否需要隔离该broker，若是，则从30s往前找第一个比30s小的延迟值，再按下标判断规避的周期，若30s，则是10min规避；
        // 否则，按上一次发送耗时来决定规避时长；
        long duration = computeNotAvailableDuration(isolation ? 30000 : currentLatency);
        this.latencyFaultTolerance.updateFaultItem(brokerName, currentLatency, duration);
    }
}

深入到selectOneMessageQueue方法内部一探究竟：

源码清单-10

public MessageQueue selectOneMessageQueue(final TopicPublishInfo tpInfo, final String lastBrokerName) {
    if (this.sendLatencyFaultEnable) {
        // 开启异常规避
        try {
            int index = tpInfo.getSendWhichQueue().getAndIncrement();
            for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) {
                int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size();
                if (pos < 0)
                    pos = 0;
                // 按顺序取下一个message queue作为发送的queue
                MessageQueue mq = tpInfo.getMessageQueueList().get(pos);
                // 当前queue所在的broker可用，且与上一个queue的broker相同，
                // 或者第一次发送，则使用这个queue
                if (latencyFaultTolerance.isAvailable(mq.getBrokerName())) {
                    if (null == lastBrokerName || mq.getBrokerName().equals(lastBrokerName))
                        return mq;
                }
            }

            final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();
            int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker);
            if (writeQueueNums > 0) {
                final MessageQueue mq = tpInfo.selectOneMessageQueue();
                if (notBestBroker != null) {
                    mq.setBrokerName(notBestBroker);
                    mq.setQueueId(tpInfo.getSendWhichQueue().getAndIncrement() % writeQueueNums);
                }
                return mq;
            } else {
                latencyFaultTolerance.remove(notBestBroker);
            }
        } catch (Exception e) {
            log.error("Error occurred when selecting message queue", e);
        }

        return tpInfo.selectOneMessageQueue();
    }
    // 不开启异常规避，则随机自增选择Queue
    return tpInfo.selectOneMessageQueue(lastBrokerName);
}

3.1.4 RocketMQ的三种CommunicationMode

源码清单-11

 public enum CommunicationMode {
    SYNC,
    ASYNC,
    ONEWAY,
}

以上三种模式指的都是消息从发送方到达broker的阶段，不包含broker将消息投递给订阅方的过程。
三种模式的发送方式的差异：

单向模式：ONEWAY。消息发送方只管发送，并不关心broker处理的结果如何。这种模式下，由于处理流程少，发送耗时非常小，吞吐量大，但不能保证消息可靠不丢，常用于流量巨大但不重要的消息场景，例如心跳发送等。
异步模式：ASYNC。消息发送方发送消息到broker后，无需等待broker处理，拿到的是null的返回值，而由一个异步的线程来做消息处理，处理完成后以回调的形式告诉发送方发送结果。异步处理时如有异常，返回发送方失败结果之前，会经过内部重试（默认3次，发送方不感知）。这种模式下，发送方等待时长较小，吞吐量较大，消息可靠，用于流量大但重要的消息场景。
同步模式：SYNC。消息发送方需等待broker处理完成并明确返回成功或失败，在消息发送方拿到消息发送失败的结果之前，也会经历过内部重试（默认3次，发送方不感知）。这种模式下，发送方会阻塞等待消息处理结果，等待时长较长，消息可靠，用于流量不大但重要的消息场景。需要强调的是，事务消息的一阶段半事务消息的处理是同步模式。

在sendKernelImpl方法中也可以看到具体的实现差异。ONEWAY模式最为简单，不做任何处理。负责发送的sendMessage方法参数中，相比同步模式，异步模式多了回调方法、包含topic发送路由元信息的topicPublishInfo、包含发送broker信息的instance、包含发送队列信息的producer、重试次数。另外，异步模式下，会对有压缩的消息先做copy。

源码清单-12

    switch (communicationMode) {
                case ASYNC:
                    Message tmpMessage = msg;
                    boolean messageCloned = false;
                    if (msgBodyCompressed) {
                        //If msg body was compressed, msgbody should be reset using prevBody.
                        //Clone new message using commpressed message body and recover origin massage.
                        //Fix bug:https://github.com/apache/rocketmq-externals/issues/66
                        tmpMessage = MessageAccessor.cloneMessage(msg);
                        messageCloned = true;
                        msg.setBody(prevBody);
                    }

                    if (topicWithNamespace) {
                        if (!messageCloned) {
                            tmpMessage = MessageAccessor.cloneMessage(msg);
                            messageCloned = true;
                        }
                        msg.setTopic(NamespaceUtil.withoutNamespace(msg.getTopic(), this.defaultMQProducer.getNamespace()));
                    }

                    long costTimeAsync = System.currentTimeMillis() - beginStartTime;
                    if (timeout < costTimeAsync) {
                        throw new RemotingTooMuchRequestException("sendKernelImpl call timeout");
                    }
                    sendResult = this.mQClientFactory.getMQClientAPIImpl().sendMessage(
                        brokerAddr,
                        mq.getBrokerName(),
                        tmpMessage,
                        requestHeader,
                        timeout - costTimeAsync,
                        communicationMode,
                        sendCallback,
                        topicPublishInfo,
                        this.mQClientFactory,
                        this.defaultMQProducer.getRetryTimesWhenSendAsyncFailed(),
                        context,
                        this);
                    break;
                case ONEWAY:
                case SYNC:
                    long costTimeSync = System.currentTimeMillis() - beginStartTime;
                    if (timeout < costTimeSync) {
                        throw new RemotingTooMuchRequestException("sendKernelImpl call timeout");
                    }
                    sendResult = this.mQClientFactory.getMQClientAPIImpl().sendMessage(
                        brokerAddr,
                        mq.getBrokerName(),
                        msg,
                        requestHeader,
                        timeout - costTimeSync,
                        communicationMode,
                        context,
                        this);
                    break;
                default:
                    assert false;
                    break;
            }

官方文档中有这样一张图，十分清晰的描述了异步通信的详细过程：

图3

3.2 二阶段发送

源码清单-3体现了本地事务的执行，localTransactionState将本地事务执行结果与事务消息二阶段的发送关联起来。
值得注意的是，如果一阶段的发送结果是SLAVE_NOT_AVAILABLE，即备broker不可用时，也会将localTransactionState置为Rollback，此时将不会执行本地事务。之后由endTransaction方法负责二阶段提交，见源码清单-4。具体到endTransaction的实现：

源码清单-13

public void endTransaction(
    final SendResult sendResult,
    final LocalTransactionState localTransactionState,
    final Throwable localException) throws RemotingException, MQBrokerException, InterruptedException, UnknownHostException {
    final MessageId id;
    if (sendResult.getOffsetMsgId() != null) {
        id = MessageDecoder.decodeMessageId(sendResult.getOffsetMsgId());
    } else {
        id = MessageDecoder.decodeMessageId(sendResult.getMsgId());
    }
    String transactionId = sendResult.getTransactionId();
    final String brokerAddr = this.mQClientFactory.findBrokerAddressInPublish(sendResult.getMessageQueue().getBrokerName());
    EndTransactionRequestHeader requestHeader = new EndTransactionRequestHeader();
    requestHeader.setTransactionId(transactionId);
    requestHeader.setCommitLogOffset(id.getOffset());
    switch (localTransactionState) {
        case COMMIT_MESSAGE:
            requestHeader.setCommitOrRollback(MessageSysFlag.TRANSACTION_COMMIT_TYPE);
            break;
        case ROLLBACK_MESSAGE:
            requestHeader.setCommitOrRollback(MessageSysFlag.TRANSACTION_ROLLBACK_TYPE);
            break;
        case UNKNOW:
            requestHeader.setCommitOrRollback(MessageSysFlag.TRANSACTION_NOT_TYPE);
            break;
        default:
            break;
    }

    requestHeader.setProducerGroup(this.defaultMQProducer.getProducerGroup());
    requestHeader.setTranStateTableOffset(sendResult.getQueueOffset());
    requestHeader.setMsgId(sendResult.getMsgId());
    String remark = localException != null ? ("executeLocalTransactionBranch exception: " + localException.toString()) : null;
    // 采用oneway的方式发送二阶段消息
    this.mQClientFactory.getMQClientAPIImpl().endTransactionOneway(brokerAddr, requestHeader, remark,
        this.defaultMQProducer.getSendMsgTimeout());
}

在二阶段发送时，之所以用oneway的方式发送，个人理解这正是因为事务消息有一个特殊的可靠机制——回查。

3.3 消息回复

当Broker经过了一个特定的时间，发现依然没有得到事务消息的二阶段是否要提交或者回滚的确切信息，Broker不知道Producer发生了什么情况（可能producer挂了，也可能producer发了commit但网络抖动丢了，也可能...），于是主动发起回查。
事务消息的回查机制，更多的是在broker端的体现。RocketMQ的broker以Half消息、Op消息、真实消息三个不同的topic来将不同发送阶段的事务消息进行了隔离，使得Consumer只能看到最终确认commit需要投递出去的消息。其中详细的实现逻辑在本文中暂不多赘述，后续可另开一篇专门来从Broker视角来解读。

回到Producer的视角，当收到了Broker的回查请求，Producer将根据消息检查本地事务状态，根据结果决定提交或回滚，这就要求Producer必须指定回查实现，以备不时之需。
当然，正常情况下，并不推荐主动发送UNKNOW状态，这个状态毫无疑问会给broker带来额外回查开销，只在出现不可预知的异常情况时才启动回查机制，是一种比较合理的选择。

另外，4.7.1版本的事务回查并非无限回查，而是最多回查15次：

源码清单-14

/**
 * The maximum number of times the message was checked, if exceed this value, this message will be discarded.
 */
@ImportantField
private int transactionCheckMax = 15;

附录

官方给出Producer的默认参数如下（其中超时时长的参数，在前文中也已经提到，debug的结果是默认3000ms，并非10000ms）:

图4

RocketMQ作为一款优秀的开源消息中间件，有很多开发者基于它做了二次开发，例如蚂蚁集团商业化产品SOFAStack MQ消息队列，就是基于RocketMQ内核进行的再次开发的金融级消息中间件，在消息管控、透明运维等方面做了大量优秀的工作。
愿RocketMQ在社区广大开发者的共创共建之下，能够不断发展壮大，迸发更强的生命力。

原文链接

本文为阿里云原创内容，未经允许不得转载。