zoukankan      html  css  js  c++  java
  • storm-kafka-0.8-plus 源码解析

    https://github.com/wurstmeister/storm-kafka-0.8-plus

    http://blog.csdn.net/xeseo/article/details/18615761

     

    准备,一些相关类

    GlobalPartitionInformation (storm.kafka.trident)

    记录partitionid和broker的关系

    GlobalPartitionInformation info = new GlobalPartitionInformation();
    
    info.addPartition(0, new Broker("10.1.110.24",9092));
    
    info.addPartition(0, new Broker("10.1.110.21",9092));

    可以静态的生成GlobalPartitionInformation,向上面代码一样
    也可以动态的从zk获取,推荐这种方式
    从zk获取就会用到DynamicBrokersReader

     

    DynamicBrokersReader

    核心就是从zk上读出partition和broker的对应关系
    操作zk都是使用curator框架

    核心函数,

        /**
         * Get all partitions with their current leaders
         */
        public GlobalPartitionInformation getBrokerInfo() {
            GlobalPartitionInformation globalPartitionInformation = new GlobalPartitionInformation();
            try {
                int numPartitionsForTopic = getNumPartitions(); //从zk取得partition的数目
                String brokerInfoPath = brokerPath();
                for (int partition = 0; partition < numPartitionsForTopic; partition++) {
                    int leader = getLeaderFor(partition); //从zk获取partition的leader broker
                    String path = brokerInfoPath + "/" + leader;
                    try {
                        byte[] brokerData = _curator.getData().forPath(path);
                        Broker hp = getBrokerHost(brokerData); //从zk获取broker的host:port
                        globalPartitionInformation.addPartition(partition, hp);//生成GlobalPartitionInformation 
                    } catch (org.apache.zookeeper.KeeperException.NoNodeException e) {
                        LOG.error("Node {} does not exist ", path);
                    }
                }
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
            LOG.info("Read partition info from zookeeper: " + globalPartitionInformation);
            return globalPartitionInformation;
        }

     

    DynamicPartitionConnections

    维护到每个broker的connection,并记录下每个broker上对应的partitions

    核心数据结构,为每个broker维持一个ConnectionInfo

    Map<Broker, ConnectionInfo> _connections = new HashMap();

    ConnectionInfo的定义,包含连接该broker的SimpleConsumer和记录partitions的set

        static class ConnectionInfo {
            SimpleConsumer consumer;
            Set<Integer> partitions = new HashSet();
    
            public ConnectionInfo(SimpleConsumer consumer) {
                this.consumer = consumer;
            }
        }

    核心函数,就是register

        public SimpleConsumer register(Broker host, int partition) {
            if (!_connections.containsKey(host)) {
                _connections.put(host, new ConnectionInfo(new SimpleConsumer(host.host, host.port, _config.socketTimeoutMs, _config.bufferSizeBytes, _config.clientId)));
            }
            ConnectionInfo info = _connections.get(host);
            info.partitions.add(partition);
            return info.consumer;
        }

     

     

    PartitionManager

    关键核心逻辑,用于管理一个partiiton的读取状态
    先理解下面几个变量,

    Long _emittedToOffset;
    Long _committedTo;
    SortedSet<Long> _pending = new TreeSet<Long>();
    LinkedList<MessageAndRealOffset> _waitingToEmit = new LinkedList<MessageAndRealOffset>();

    kafka对于一个partition,一定是从offset从小到大按顺序读的,并且这里为了保证不读丢数据,会定期的将当前状态即offset写入zk

    几个中间状态,

    从kafka读到的offset,_emittedToOffset
    从kafka读到的messages会放入_waitingToEmit,放入这个list,我们就认为一定会被emit,所以emittedToOffset可以认为是从kafka读到的offset

    已经成功处理的offset,lastCompletedOffset
    由于message是要在storm里面处理的,其中是可能fail的,所以正在处理的offset是缓存在_pending中的
    如果_pending为空,那么lastCompletedOffset=_emittedToOffset
    如果_pending不为空,那么lastCompletedOffset为pending list里面第一个offset,因为后面都还在等待ack

        public long lastCompletedOffset() {
            if (_pending.isEmpty()) {
                return _emittedToOffset;
            } else {
                return _pending.first();
            }
        }

     

    已经写入zk的offset,_committedTo
    我们需要定期将lastCompletedOffset,写入zk,否则crash后,我们不知道上次读到哪儿了
    所以_committedTo <= lastCompletedOffset

    完整过程,

    1. 初始化,

    关键就是注册partition,然后初始化offset,以知道从哪里开始读

        public PartitionManager(DynamicPartitionConnections connections, String topologyInstanceId, ZkState state, Map stormConf, SpoutConfig spoutConfig, Partition id) {
            _partition = id;
            _connections = connections;
            _spoutConfig = spoutConfig;
            _topologyInstanceId = topologyInstanceId;
            _consumer = connections.register(id.host, id.partition); //注册partition到connections,并生成simpleconsumer
            _state = state;
            _stormConf = stormConf;
    
            String jsonTopologyId = null;
            Long jsonOffset = null;
            String path = committedPath();
            try {
                Map<Object, Object> json = _state.readJSON(path);
                LOG.info("Read partition information from: " + path +  " --> " + json );
                if (json != null) {
                    jsonTopologyId = (String) ((Map<Object, Object>) json.get("topology")).get("id");
                    jsonOffset = (Long) json.get("offset"); // 从zk中读出commited offset
                }
            } catch (Throwable e) {
                LOG.warn("Error reading and/or parsing at ZkNode: " + path, e);
            }
    
            if (jsonTopologyId == null || jsonOffset == null) { // zk中没有记录,那么根据spoutConfig.startOffsetTime设置offset,Earliest或Latest
                _committedTo = KafkaUtils.getOffset(_consumer, spoutConfig.topic, id.partition, spoutConfig);
                LOG.info("No partition information found, using configuration to determine offset");
            } else if (!topologyInstanceId.equals(jsonTopologyId) && spoutConfig.forceFromStart) {
                _committedTo = KafkaUtils.getOffset(_consumer, spoutConfig.topic, id.partition, spoutConfig.startOffsetTime);
                LOG.info("Topology change detected and reset from start forced, using configuration to determine offset");
            } else {
                _committedTo = jsonOffset;
            }
    
            _emittedToOffset = _committedTo; // 初始化时,中间状态都是一致的
        }

     

    2. 从kafka读取messages,放到_waitingToEmit

    从kafka中读到数据ByteBufferMessageSet,
    把需要emit的msg,MessageAndRealOffset,放到_waitingToEmit
    把没完成的offset放到pending
    更新emittedToOffset

        private void fill() {
            ByteBufferMessageSet msgs = KafkaUtils.fetchMessages(_spoutConfig, _consumer, _partition, _emittedToOffset);
            for (MessageAndOffset msg : msgs) {
                _pending.add(_emittedToOffset);
                _waitingToEmit.add(new MessageAndRealOffset(msg.message(), _emittedToOffset));
                _emittedToOffset = msg.nextOffset();
            }
        }

    其中fetch message的逻辑如下,

        public static ByteBufferMessageSet fetchMessages(KafkaConfig config, SimpleConsumer consumer, Partition partition, long offset) {
            ByteBufferMessageSet msgs = null;
            String topic = config.topic;
            int partitionId = partition.partition;
            for (int errors = 0; errors < 2 && msgs == null; errors++) { // 容忍两次错误
                FetchRequestBuilder builder = new FetchRequestBuilder();
                FetchRequest fetchRequest = builder.addFetch(topic, partitionId, offset, config.fetchSizeBytes).
                        clientId(config.clientId).build();
                FetchResponse fetchResponse;
                try {
                    fetchResponse = consumer.fetch(fetchRequest);
                } catch (Exception e) {
                    if (e instanceof ConnectException) {
                        throw new FailedFetchException(e);
                    } else {
                        throw new RuntimeException(e);
                    }
                }
                if (fetchResponse.hasError()) { // 主要处理offset outofrange的case,通过getOffset从earliest或latest读
                    KafkaError error = KafkaError.getError(fetchResponse.errorCode(topic, partitionId));
                    if (error.equals(KafkaError.OFFSET_OUT_OF_RANGE) && config.useStartOffsetTimeIfOffsetOutOfRange && errors == 0) {
                        long startOffset = getOffset(consumer, topic, partitionId, config.startOffsetTime);
                        LOG.warn("Got fetch request with offset out of range: [" + offset + "]; " +
                                "retrying with default start offset time from configuration. " +
                                "configured start offset time: [" + config.startOffsetTime + "] offset: [" + startOffset + "]");
                        offset = startOffset;
                    } else {
                        String message = "Error fetching data from [" + partition + "] for topic [" + topic + "]: [" + error + "]";
                        LOG.error(message);
                        throw new FailedFetchException(message);
                    }
                } else {
                    msgs = fetchResponse.messageSet(topic, partitionId);
                }
            }
            return msgs;
        }

     

    3. emit msg

    从_waitingToEmit中取到msg,转换成tuple,然后通过collector.emit发出去

        public EmitState next(SpoutOutputCollector collector) {
            if (_waitingToEmit.isEmpty()) {
                fill();
            }
            while (true) {
                MessageAndRealOffset toEmit = _waitingToEmit.pollFirst();
                if (toEmit == null) {
                    return EmitState.NO_EMITTED;
                }
                Iterable<List<Object>> tups = KafkaUtils.generateTuples(_spoutConfig, toEmit.msg);
                if (tups != null) {
                    for (List<Object> tup : tups) {
                        collector.emit(tup, new KafkaMessageId(_partition, toEmit.offset));
                    }
                    break;
                } else {
                    ack(toEmit.offset);
                }
            }
            if (!_waitingToEmit.isEmpty()) {
                return EmitState.EMITTED_MORE_LEFT;
            } else {
                return EmitState.EMITTED_END;
            }
        }

    可以看看转换tuple的过程,
    可以看到是通过kafkaConfig.scheme.deserialize来做转换

        public static Iterable<List<Object>> generateTuples(KafkaConfig kafkaConfig, Message msg) {
            Iterable<List<Object>> tups;
            ByteBuffer payload = msg.payload();
            ByteBuffer key = msg.key();
            if (key != null && kafkaConfig.scheme instanceof KeyValueSchemeAsMultiScheme) {
                tups = ((KeyValueSchemeAsMultiScheme) kafkaConfig.scheme).deserializeKeyAndValue(Utils.toByteArray(key), Utils.toByteArray(payload));
            } else {
                tups = kafkaConfig.scheme.deserialize(Utils.toByteArray(payload));
            }
            return tups;
        }

    所以你使用时,需要定义scheme逻辑,

    spoutConfig.scheme = new SchemeAsMultiScheme(new TestMessageScheme());
    
    public class TestMessageScheme implements Scheme {
        private static final Logger LOGGER = LoggerFactory.getLogger(TestMessageScheme.class);
        
        @Override
        public List<Object> deserialize(byte[] bytes) {
        try {
            String msg = new String(bytes, "UTF-8");
            return new Values(msg);
        } catch (InvalidProtocolBufferException e) {
             LOGGER.error("Cannot parse the provided message!");
        }
            return null;
        }
    
        @Override
        public Fields getOutputFields() {
            return new Fields("msg");
        }
    }

     

    4. 定期的commit offset

        public void commit() {
            long lastCompletedOffset = lastCompletedOffset();
            if (lastCompletedOffset != lastCommittedOffset()) {
                Map<Object, Object> data = ImmutableMap.builder()
                        .put("topology", ImmutableMap.of("id", _topologyInstanceId,
                                "name", _stormConf.get(Config.TOPOLOGY_NAME)))
                        .put("offset", lastCompletedOffset)
                        .put("partition", _partition.partition)
                        .put("broker", ImmutableMap.of("host", _partition.host.host,
                                "port", _partition.host.port))
                        .put("topic", _spoutConfig.topic).build();
                _state.writeJSON(committedPath(), data);
                _committedTo = lastCompletedOffset;
            } else {
                LOG.info("No new offset for " + _partition + " for topology: " + _topologyInstanceId);
            }
        }

     

    5. 最后关注一下,fail时的处理

    首先作者没有cache message,而只是cache offset
    所以fail的时候,他是无法直接replay的,在他的注释里面写了,不这样做的原因是怕内存爆掉

    所以他的做法是,当一个offset fail的时候, 直接将_emittedToOffset回滚到当前fail的这个offset
    下次从Kafka fetch的时候会从_emittedToOffset开始读,这样做的好处就是依赖kafka做replay,问题就是会有重复问题
    所以使用时,一定要考虑,是否可以接受重复问题

        public void fail(Long offset) {
            //TODO: should it use in-memory ack set to skip anything that's been acked but not committed???
            // things might get crazy with lots of timeouts
            if (_emittedToOffset > offset) {
                _emittedToOffset = offset;
                _pending.tailSet(offset).clear();
            }
        }

     

    KafkaSpout

    最后来看看KafkaSpout

    1. 初始化
    关键就是初始化DynamicPartitionConnections和_coordinator

        public void open(Map conf, final TopologyContext context, final SpoutOutputCollector collector) {
            _collector = collector;
    
            Map stateConf = new HashMap(conf);
            List<String> zkServers = _spoutConfig.zkServers;
            if (zkServers == null) {
                zkServers = (List<String>) conf.get(Config.STORM_ZOOKEEPER_SERVERS);
            }
            Integer zkPort = _spoutConfig.zkPort;
            if (zkPort == null) {
                zkPort = ((Number) conf.get(Config.STORM_ZOOKEEPER_PORT)).intValue();
            }
            stateConf.put(Config.TRANSACTIONAL_ZOOKEEPER_SERVERS, zkServers);
            stateConf.put(Config.TRANSACTIONAL_ZOOKEEPER_PORT, zkPort);
            stateConf.put(Config.TRANSACTIONAL_ZOOKEEPER_ROOT, _spoutConfig.zkRoot);
            _state = new ZkState(stateConf);
    
            _connections = new DynamicPartitionConnections(_spoutConfig, KafkaUtils.makeBrokerReader(conf, _spoutConfig));
    
            // using TransactionalState like this is a hack
            int totalTasks = context.getComponentTasks(context.getThisComponentId()).size();
            if (_spoutConfig.hosts instanceof StaticHosts) {
                _coordinator = new StaticCoordinator(_connections, conf, _spoutConfig, _state, context.getThisTaskIndex(), totalTasks, _uuid);
            } else {
                _coordinator = new ZkCoordinator(_connections, conf, _spoutConfig, _state, context.getThisTaskIndex(), totalTasks, _uuid);
            }
        }

    看看_coordinator 是干嘛的?
    这很关键,因为我们一般都会开多个并发的kafkaspout,类似于high-level中的consumer group,如何保证这些并发的线程不冲突?
    使用和highlevel一样的思路,一个partition只会有一个spout消费,这样就避免处理麻烦的访问互斥问题(kafka做访问互斥很麻烦,试着想想)
    是根据当前spout的task数和partition数来分配,task和partitioin的对应关系的,并且为每个partition建立PartitionManager

    这里首先看到totalTasks就是当前这个spout component的task size
    StaticCoordinator和ZkCoordinator的差别就是, 从StaticHost还是从Zk读到partition的信息,简单起见,看看StaticCoordinator实现

    public class StaticCoordinator implements PartitionCoordinator {
        Map<Partition, PartitionManager> _managers = new HashMap<Partition, PartitionManager>();
        List<PartitionManager> _allManagers = new ArrayList();
    
        public StaticCoordinator(DynamicPartitionConnections connections, Map stormConf, SpoutConfig config, ZkState state, int taskIndex, int totalTasks, String topologyInstanceId) {
            StaticHosts hosts = (StaticHosts) config.hosts;
            List<Partition> myPartitions = KafkaUtils.calculatePartitionsForTask(hosts.getPartitionInformation(), totalTasks, taskIndex);
            for (Partition myPartition : myPartitions) {// 建立PartitionManager
                _managers.put(myPartition, new PartitionManager(connections, topologyInstanceId, state, stormConf, config, myPartition));
            }
            _allManagers = new ArrayList(_managers.values());
        }
    
        @Override
        public List<PartitionManager> getMyManagedPartitions() {
            return _allManagers;
        }
    
        public PartitionManager getManager(Partition partition) {
            return _managers.get(partition);
        }
    
    }

    其中分配的逻辑在calculatePartitionsForTask

        public static List<Partition> calculatePartitionsForTask(GlobalPartitionInformation partitionInformation, int totalTasks, int taskIndex) {
            Preconditions.checkArgument(taskIndex < totalTasks, "task index must be less that total tasks");
            List<Partition> partitions = partitionInformation.getOrderedPartitions();
            int numPartitions = partitions.size();
            List<Partition> taskPartitions = new ArrayList<Partition>();
            for (int i = taskIndex; i < numPartitions; i += totalTasks) {// 平均分配,
                Partition taskPartition = partitions.get(i);
                taskPartitions.add(taskPartition);
            }
            logPartitionMapping(totalTasks, taskIndex, taskPartitions);
            return taskPartitions;
        }

     

    2. nextTuple

    逻辑写的比较tricky,其实只要从一个partition读成功一次
    只所以要for,是当EmitState.NO_EMITTED时,需要遍历后面的partition以保证读成功一次

        @Override
        public void nextTuple() {
            List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
            for (int i = 0; i < managers.size(); i++) {
    
                // in case the number of managers decreased
                _currPartitionIndex = _currPartitionIndex % managers.size(); //_currPartitionIndex初始为0,每次依次读一个partition
                EmitState state = managers.get(_currPartitionIndex).next(_collector); //调用PartitonManager.next去emit数据
                if (state != EmitState.EMITTED_MORE_LEFT) { //当EMITTED_MORE_LEFT时,还有数据,可以继续读,不需要+1
                    _currPartitionIndex = (_currPartitionIndex + 1) % managers.size();
                }
                if (state != EmitState.NO_EMITTED) { //当EmitState.NO_EMITTED时,表明partition的数据已经读完,也就是没有读到数据,所以不能break
                    break;
                }
            }
    
            long now = System.currentTimeMillis();
            if ((now - _lastUpdateMs) > _spoutConfig.stateUpdateIntervalMs) {
                commit(); //定期commit
            }
        }

    定期commit的逻辑,遍历去commit每个PartitionManager

        private void commit() {
            _lastUpdateMs = System.currentTimeMillis();
            for (PartitionManager manager : _coordinator.getMyManagedPartitions()) {
                manager.commit();
            }
        }

     

    3. Ack和Fail

    直接调用PartitionManager

        @Override
        public void ack(Object msgId) {
            KafkaMessageId id = (KafkaMessageId) msgId;
            PartitionManager m = _coordinator.getManager(id.partition);
            if (m != null) {
                m.ack(id.offset);
            }
        }
    
        @Override
        public void fail(Object msgId) {
            KafkaMessageId id = (KafkaMessageId) msgId;
            PartitionManager m = _coordinator.getManager(id.partition);
            if (m != null) {
                m.fail(id.offset);
            }
        }

     

    4. declareOutputFields
    所以在scheme里面需要定义,deserialize和getOutputFields

        @Override
        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(_spoutConfig.scheme.getOutputFields());
        }

     

    Metrics

    再来看下Metrics,关键学习一下如何在storm里面加metrics
    在spout.open里面初始化了下面两个metrics

    kafkaOffset
    反映出每个partition的earliestTimeOffset,latestTimeOffset,和latestEmittedOffset,其中latestTimeOffset - latestEmittedOffset就是spout lag
    除了反映出每个partition的,还会算出所有的partitions的总数据

            context.registerMetric("kafkaOffset", new IMetric() {
                KafkaUtils.KafkaOffsetMetric _kafkaOffsetMetric = new KafkaUtils.KafkaOffsetMetric(_spoutConfig.topic, _connections);
    
                @Override
                public Object getValueAndReset() {
                    List<PartitionManager> pms = _coordinator.getMyManagedPartitions(); //从coordinator获取pms的信息
                    Set<Partition> latestPartitions = new HashSet();
                    for (PartitionManager pm : pms) {
                        latestPartitions.add(pm.getPartition());
                    }
                    _kafkaOffsetMetric.refreshPartitions(latestPartitions); //根据最新的partition信息删除metric中已经不存在的partition的统计信息
                    for (PartitionManager pm : pms) {
                        _kafkaOffsetMetric.setLatestEmittedOffset(pm.getPartition(), pm.lastCompletedOffset()); //更新metric中每个partition的已经完成的offset
                    }
                    return _kafkaOffsetMetric.getValueAndReset();
                }
            }, _spoutConfig.metricsTimeBucketSizeInSecs);

    _kafkaOffsetMetric.getValueAndReset,其实只是get,不需要reset

    @Override
            public Object getValueAndReset() {
                try {
                    long totalSpoutLag = 0;
                    long totalEarliestTimeOffset = 0;
                    long totalLatestTimeOffset = 0;
                    long totalLatestEmittedOffset = 0;
                    HashMap ret = new HashMap();
                    if (_partitions != null && _partitions.size() == _partitionToOffset.size()) {
                        for (Map.Entry<Partition, Long> e : _partitionToOffset.entrySet()) {
                            Partition partition = e.getKey();
                            SimpleConsumer consumer = _connections.getConnection(partition);
                            long earliestTimeOffset = getOffset(consumer, _topic, partition.partition, kafka.api.OffsetRequest.EarliestTime());
                            long latestTimeOffset = getOffset(consumer, _topic, partition.partition, kafka.api.OffsetRequest.LatestTime());
                            long latestEmittedOffset = e.getValue();
                            long spoutLag = latestTimeOffset - latestEmittedOffset;
                            ret.put(partition.getId() + "/" + "spoutLag", spoutLag);
                            ret.put(partition.getId() + "/" + "earliestTimeOffset", earliestTimeOffset);
                            ret.put(partition.getId() + "/" + "latestTimeOffset", latestTimeOffset);
                            ret.put(partition.getId() + "/" + "latestEmittedOffset", latestEmittedOffset);
                            totalSpoutLag += spoutLag;
                            totalEarliestTimeOffset += earliestTimeOffset;
                            totalLatestTimeOffset += latestTimeOffset;
                            totalLatestEmittedOffset += latestEmittedOffset;
                        }
                        ret.put("totalSpoutLag", totalSpoutLag);
                        ret.put("totalEarliestTimeOffset", totalEarliestTimeOffset);
                        ret.put("totalLatestTimeOffset", totalLatestTimeOffset);
                        ret.put("totalLatestEmittedOffset", totalLatestEmittedOffset);
                        return ret;
                    } else {
                        LOG.info("Metrics Tick: Not enough data to calculate spout lag.");
                    }
                } catch (Throwable t) {
                    LOG.warn("Metrics Tick: Exception when computing kafkaOffset metric.", t);
                }
                return null;
            }

     

    kafkaPartition
    反映出从Kafka fetch数据的情况,fetchAPILatencyMax,fetchAPILatencyMean,fetchAPICallCount 和 fetchAPIMessageCount

            context.registerMetric("kafkaPartition", new IMetric() {
                @Override
                public Object getValueAndReset() {
                    List<PartitionManager> pms = _coordinator.getMyManagedPartitions();
                    Map concatMetricsDataMaps = new HashMap();
                    for (PartitionManager pm : pms) {
                        concatMetricsDataMaps.putAll(pm.getMetricsDataMap());
                    }
                    return concatMetricsDataMaps;
                }
            }, _spoutConfig.metricsTimeBucketSizeInSecs);

    pm.getMetricsDataMap(),

    public Map getMetricsDataMap() {
            Map ret = new HashMap();
            ret.put(_partition + "/fetchAPILatencyMax", _fetchAPILatencyMax.getValueAndReset());
            ret.put(_partition + "/fetchAPILatencyMean", _fetchAPILatencyMean.getValueAndReset());
            ret.put(_partition + "/fetchAPICallCount", _fetchAPICallCount.getValueAndReset());
            ret.put(_partition + "/fetchAPIMessageCount", _fetchAPIMessageCount.getValueAndReset());
            return ret;
        }

    更新的逻辑如下,

        private void fill() {
            long start = System.nanoTime();
            ByteBufferMessageSet msgs = KafkaUtils.fetchMessages(_spoutConfig, _consumer, _partition, _emittedToOffset);
            long end = System.nanoTime();
            long millis = (end - start) / 1000000;
            _fetchAPILatencyMax.update(millis);
            _fetchAPILatencyMean.update(millis);
            _fetchAPICallCount.incr();
            int numMessages = countMessages(msgs);
            _fetchAPIMessageCount.incrBy(numMessages);
    }

     

    我们在读取kafka时,

    首先是关心,每个partition的读取状况,这个通过取得KafkaOffset Metrics就可以知道

    再者,我们需要replay数据,使用high-level接口的时候可以通过系统提供的工具,这里如何搞?

    看下下面的代码,
    第一个if,是从配置文件里面没有读到配置的情况
    第二个else if,当topologyInstanceId发生变化时,并且forceFromStart为true时,就会取startOffsetTime指定的offset(Latest或Earliest)
    这个topologyInstanceId, 每次KafkaSpout对象生成的时候随机产生,
    String _uuid = UUID.randomUUID().toString();
    Spout对象是在topology提交时,在client端生成一次的,所以如果topology停止,再重新启动,这个id一定会发生变化

    所以应该是只需要把forceFromStart设为true,再重启topology,就可以实现replay

            if (jsonTopologyId == null || jsonOffset == null) { // failed to parse JSON?
                _committedTo = KafkaUtils.getOffset(_consumer, spoutConfig.topic, id.partition, spoutConfig);
                LOG.info("No partition information found, using configuration to determine offset");
            } else if (!topologyInstanceId.equals(jsonTopologyId) && spoutConfig.forceFromStart) {
                _committedTo = KafkaUtils.getOffset(_consumer, spoutConfig.topic, id.partition, spoutConfig.startOffsetTime);
                LOG.info("Topology change detected and reset from start forced, using configuration to determine offset");
            } else {
                _committedTo = jsonOffset;
                LOG.info("Read last commit offset from zookeeper: " + _committedTo + "; old topology_id: " + jsonTopologyId + " - new topology_id: " + topologyInstanceId );
            }

     

    代码例子

    storm-kafka的文档很差,最后附上使用的例子

    import storm.kafka.KafkaSpout;
    import storm.kafka.SpoutConfig;
    import storm.kafka.BrokerHosts;
    import storm.kafka.ZkHosts;
    import storm.kafka.KeyValueSchemeAsMultiScheme;
    import storm.kafka.KeyValueScheme;
    
        public static class SimplekVScheme implements KeyValueScheme { //定义scheme
            @Override
            public List<Object> deserializeKeyAndValue(byte[] key, byte[] value){
                ArrayList tuple = new ArrayList();
                tuple.add(key);
                tuple.add(value);
                return tuple;
            }
            
            @Override
            public List<Object> deserialize(byte[] bytes) {
                ArrayList tuple = new ArrayList();
                tuple.add(bytes);
                return tuple;
            }
    
            @Override
            public Fields getOutputFields() {
                return new Fields("key","value");
            }
    
        }   
    
            String topic = “test”;  //
            String zkRoot = “/kafkastorm”; //
            String spoutId = “id”; //读取的status会被存在,/kafkastorm/id下面,所以id类似consumer group
            
            BrokerHosts brokerHosts = new ZkHosts("10.1.110.24:2181,10.1.110.22:2181"); 
    
            SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, topic, zkRoot, spoutId);
            spoutConfig.scheme = new KeyValueSchemeAsMultiScheme(new SimplekVScheme());
            
            /*spoutConfig.zkServers = new ArrayList<String>(){{ //只有在local模式下需要记录读取状态时,才需要设置
                add("10.118.136.107");
            }};
            spoutConfig.zkPort = 2181;*/
            
            spoutConfig.forceFromStart = false; 
            spoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime();    
            spoutConfig.metricsTimeBucketSizeInSecs = 6;
    
            builder.setSpout(SqlCollectorTopologyDef.KAFKA_SPOUT_NAME, new KafkaSpout(spoutConfig), 1);
  • 相关阅读:
    2020.05.27
    static{}静态代码块与{}普通代码块之间的区别
    Spring 注解@Autowired注解
    java:List的深拷贝
    IDEA中MAVEN无法自动加载的问题
    java Comparator接口
    JAVA ArrayList<E>
    JAVA BigInteger
    JAVA输入输出
    JAVA String,StringBuilder的一些API
  • 原文地址:https://www.cnblogs.com/fxjwind/p/3808346.html
Copyright © 2011-2022 走看看