zoukankan      html  css  js  c++  java
  • RocketMQ消息发送的队列选择与容错策略

    一个topic有多个队列,分散在不同的broker。producer在发送消息的时候,需要选择一个队列

    producer发送消息全局时序图:

    队列选择与容错策略结论:

    • 在不开启容错的情况下,轮询队列进行发送,如果失败了,重试的时候过滤失败的Broker
    • 如果开启了容错策略,会通过RocketMQ的预测机制来预测一个Broker是否可用
    • 如果上次失败的Broker可用那么还是会选择该Broker的队列
    • 如果上述情况失败,则随机选择一个进行发送
    • 在发送消息的时候会记录一下调用的时间与是否报错,根据该时间去预测broker的可用时间
    String lastBrokerName = null == mq ? null : mq.getBrokerName();
        MessageQueue tmpmq = this.selectOneMessageQueue(lastBrokerName);
    if (tmpmq != null) {
                        mq = tmpmq;
    //....

    如上,如果发送失败了,重试的时候lastBrokerName将不为空,进入到selectOneMessageQueue方法

    public MessageQueue selectOneMessageQueue(final TopicPublishInfo tpInfo, final String lastBrokerName) {
            if (this.sendLatencyFaultEnable) {
                try {
                    int index = tpInfo.getSendWhichQueue().getAndIncrement();
                    for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) {
                        int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size();
                        if (pos < 0)
                            pos = 0;
                        MessageQueue mq = tpInfo.getMessageQueueList().get(pos);
                        if (latencyFaultTolerance.isAvailable(mq.getBrokerName())) {
                            if (null == lastBrokerName || mq.getBrokerName().equals(lastBrokerName))
                                return mq;
                        }
                    }
    
                    final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();
                    int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker);
                    if (writeQueueNums > 0) {
                        final MessageQueue mq = tpInfo.selectOneMessageQueue();
                        if (notBestBroker != null) {
                            mq.setBrokerName(notBestBroker);
                            mq.setQueueId(tpInfo.getSendWhichQueue().getAndIncrement() % writeQueueNums);
                        }
                        return mq;
                    } else {
                        latencyFaultTolerance.remove(notBestBroker);
                    }
                } catch (Exception e) {
                }
    
                return tpInfo.selectOneMessageQueue();
            }
    
            return tpInfo.selectOneMessageQueue(lastBrokerName);
        }

    首先判断sendLatencyFaultEnable是否为true,来走不同的流程,默认为false

    public MessageQueue selectOneMessageQueue(final String lastBrokerName) {
            // 如果为空,即第一次发生,未发生错误重试
            // 直接轮询队列进行发送
            if (lastBrokerName == null) {
                return selectOneMessageQueue();
            } else {
                // 与selectOneMessageQueue类似,过滤的lastBrokerName的队列
                int index = this.sendWhichQueue.getAndIncrement();
                for (int i = 0; i < this.messageQueueList.size(); i++) {
                    int pos = Math.abs(index++) % this.messageQueueList.size();
                    if (pos < 0)
                        pos = 0;
                    MessageQueue mq = this.messageQueueList.get(pos);
                    if (!mq.getBrokerName().equals(lastBrokerName)) {
                        return mq;
                    }
                }
                return selectOneMessageQueue();
            }
        }
        public MessageQueue selectOneMessageQueue() {
            int index = this.sendWhichQueue.getAndIncrement();
            int pos = Math.abs(index) % this.messageQueueList.size();
            if (pos < 0)
                pos = 0;
            return this.messageQueueList.get(pos);
        }

    总的来说都是轮询,只是一个有过滤失败的lastBrokerName,一个没有

    sendLatencyFaultEnable开启:

    • 1
    int index = tpInfo.getSendWhichQueue().getAndIncrement();
                    for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) {
                        int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size();
                        if (pos < 0)
                            pos = 0;
                        MessageQueue mq = tpInfo.getMessageQueueList().get(pos);
                        // 判断该Broker是否可用,不可用则进行第二部分的逻辑
                        if (latencyFaultTolerance.isAvailable(mq.getBrokerName())) {
                            // 非失败重试,直接返回到的队列
                            // 失败重试的情况,如果和选择的队列是上次重试是一样的,则返回
                            if (null == lastBrokerName || mq.getBrokerName().equals(lastBrokerName))
                                return mq;
                        }
                    }
    • 2
     //从容错信息中取一个Broker
    final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();
                    int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker);
                    if (writeQueueNums > 0) {// 有可写队列
                        // 往后取一个
                        final MessageQueue mq = tpInfo.selectOneMessageQueue();
                        if (notBestBroker != null) {
                            // 将取到的队列信息设置为取到的broker
                            mq.setBrokerName(notBestBroker);
                            // 队列重置
                            mq.setQueueId(tpInfo.getSendWhichQueue().getAndIncrement() % writeQueueNums);
                        }
                        return mq;
                    } else {
                        latencyFaultTolerance.remove(notBestBroker);
                    }

    第一部分主要是选择一个可用的并且brokerName为lastBrokerName的队列,这里其实有点疑问,是失败的时候lastBrokerName才不为空,这时候为什么还会选择可用且brokerName为lastBrokerName的队列?这个猜测可能是觉得当前brokerName的上一次发送的队列失败了,可能下个队列会成功,加上当前延迟容错机制下的确保可用情况下,选择另外的队列。

    假设没有找到对应的队列,只有一种情况

    • 延迟容错机制觉得lastBrokerName这个broker不可用

    那么将会进入第二部分代码,首先调用pickOneAtLeast获取一个broker,再调用selectOneMessageQueue获取一个队列,如果pickOneAtLeast取到的不为空,那么将队列信息替换

    容错策略

    如何判断broker是否可用

    public boolean isAvailable(final String name) {
            final FaultItem faultItem = this.faultItemTable.get(name);
            if (faultItem != null) {
                return faultItem.isAvailable();
            }
            return true;
        }

    分两部分

    • faultItemTable放进去的时机
    • FaultItem的isAvailable实现

    isAvailable实现

    public boolean isAvailable() {
                return (System.currentTimeMillis() - startTimestamp) >= 0;
            }

    判断当前时间是否大于startTimestamp,为什么只是判断一个时间就可以知道Broker是否可用?

    faultItemTable

    通过查找faultItemTable使用的地方,找到updateFaultItem方法

    public void updateFaultItem(final String name/*brokerName*/, final long currentLatency, final long notAvailableDuration) {
            FaultItem old = this.faultItemTable.get(name);
            if (null == old) {
                final FaultItem faultItem = new FaultItem(name);
                faultItem.setCurrentLatency(currentLatency);
                faultItem.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration);
    
                old = this.faultItemTable.putIfAbsent(name, faultItem);
                if (old != null) {
                    old.setCurrentLatency(currentLatency);
                    old.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration);
                }
            } else {
                old.setCurrentLatency(currentLatency);
                old.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration);
            }
        }

    通过brokerName找到对应的FaultItem,startTimestamp=当前时间+notAvailableDuration,找到updateFaultItem使用的地方,看看notAvailableDuration是什么,找到MQFaultStrategy.updateFaultItem(String, long, boolean)方法

    public void updateFaultItem(final String brokerName, final long currentLatency, boolean isolation) {
            if (this.sendLatencyFaultEnable) {// 开启延迟容错功能
                long duration = computeNotAvailableDuration(isolation ? 30000 : currentLatency);
                this.latencyFaultTolerance.updateFaultItem(brokerName, currentLatency, duration);
            }
        }
        private long computeNotAvailableDuration(final long currentLatency) {
            for (int i = latencyMax.length - 1; i >= 0; i--) {
                if (currentLatency >= latencyMax[i]) return this.notAvailableDuration[i];
            }
            return 0;
        }

    MQFaultStrategy.java部分属性

    public class MQFaultStrategy {
          private final static Logger log = ClientLogger.getLog();
          /**
           * 延迟故障容错,维护每个Broker的发送消息的延迟
           * key:brokerName
           */
          private final LatencyFaultTolerance<String> latencyFaultTolerance = new LatencyFaultToleranceImpl();
          /**
            * 发送消息延迟容错开关
         */
          private boolean sendLatencyFaultEnable = false;
        /**
          * 延迟级别数组
          */
        private long[] latencyMax = {50L, 100L, 550L, 1000L, 2000L, 3000L, 15000L};
         /**
          * 不可用时长数组
          */
         private long[] notAvailableDuration = {0L, 0L, 30000L, 60000L, 120000L, 180000L, 600000L};
    
    .....
    }

    notAvailableDuration为notAvailableDuration数组某个位置的值,latencyMax和notAvailableDuration数组的值分别如下

     
    latencyMaxnotAvailableDuration
    50L 0L
    100L 0L
    550L 30000L
    1000L 60000L
    2000L 120000L
    3000L 180000L
    15000L 600000L

    • currentLatency如果大于等于50小于100,则notAvailableDuration为0
    • currentLatency如果大于等于100小于550,则notAvailableDuration为0
    • currentLatency如果大于等于550小于1000,则notAvailableDuration为300000
    • …以此类推

    假设isolation传入true,那么notAvailableDuration将传入600000。
    结合isAvailable方法,大概流程如下,RocketMQ为每个Broker预测了个可用时间(当前时间+notAvailableDuration),当当前时间大于该时间,才代表Broker可用,而notAvailableDuration有6个级别和latencyMax的区间一一对应,根据传入的currentLatency去预测该Broker在什么时候可用

    那么看下updateFaultItem使用的地方,看看currentLatency传入的是什么

      // 1.
    try {
        beginTimestampPrev = System.currentTimeMillis();
        sendResult = this.sendKernelImpl(msg, mq, communicationMode, sendCallback, topicPublishInfo, timeout);
        endTimestamp = System.currentTimeMillis();
        this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false);
    
      // 2.
    } catch (xxException e) {
        endTimestamp = System.currentTimeMillis();
        this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, true);
    }

    currentLatency为发送消息的执行时间,根据执行时间来看落入哪个区间,在0~100的时间内notAvailableDuration都是0,都是可用的,大于该值后,可用的时间就会开始变大了,而在报错的时候isolation参数为true,那么该broker在600000毫秒后才可用

    pickOneAtLeast

    当真的出现600000毫秒后才可用的情况,在selectOneMessageQueue方法的第一部分代码就走不下去了,只能走到第二部分代码,先调用pickOneAtLeast方法获取一个broker

    public String pickOneAtLeast() {
            final Enumeration<FaultItem> elements = this.faultItemTable.elements();
            List<FaultItem> tmpList = new LinkedList<FaultItem>();
            // 将faultItemTable里的元素全放到list中
            while (elements.hasMoreElements()) {
                final FaultItem faultItem = elements.nextElement();
                tmpList.add(faultItem);
            }
    
            if (!tmpList.isEmpty()) {
                // 先打乱再排序
                Collections.shuffle(tmpList);
                Collections.sort(tmpList);
            
                final int half = tmpList.size() / 2;
                if (half <= 0) {// 只有一个元素的情况
                    return tmpList.get(0).getName();
                } else {// 根据half取余
                    final int i = this.whichItemWorst.getAndIncrement() % half;
                    return tmpList.get(i).getName();
                }
            }
            return null;
        }
  • 相关阅读:
    Lambda表达式 For Android
    RxJava重温基础
    Asp.Net Core 依赖注入默认DI,Autofac注入
    Asp.Net Core2.0 基于QuartzNet任务管理系统
    Asp.Net Core 基于QuartzNet任务管理系统(这是一篇用来水的随笔)
    ADO.NET通用类库
    TripleDES加密解密
    ASP.NET Core的身份认证框架IdentityServer4--(4)添加第三方快捷登录
    ASP.NET Core的身份认证框架IdentityServer4--(3)令牌服务配置访问控制跟UI(可自定义路由)添加
    ASP.NET Core的身份认证框架IdentityServer4--(2)API跟WEB端配置
  • 原文地址:https://www.cnblogs.com/laowz/p/10780910.html
Copyright © 2011-2022 走看看