zoukankan      html  css  js  c++  java
  • Spark-stream,kafka结合

    先列参考文献:

    Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html

    kafka(Java Client端Producer API):http://kafka.apache.org/documentation/#producerapi

    版本: 

    spark:2.1.1
    scala:2.11.12
    kafka运行版本:2.3.0
    spark-streaming-kafka-0-10_2.11:2.2.0

    开发环境:

      3台虚拟机部署kafka,域名分别为coo1、coo2、coo3,部署版本如上,zookeeper版本3.4.7

      在kafka上创建topic:xzrz,replica为3,partition为4;

    ./kafka-topics.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --create --topic xzrz --replication-factor 3 --partitions 4

      准备代码环境:

      一个是Java端的kafka发送端:KafkaSender.java

      另一个是scale端的spark-streaming-kafka消费端,KafkaStreaming.scala

    kafka发送端:

      maven配置:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>kafkaTest</groupId>
        <artifactId>kafkaTest</artifactId>
        <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13-beta-2</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>
    </project>
    KafkaSender.java
     1 package gjm;
     2 
     3 import org.apache.kafka.clients.producer.KafkaProducer;
     4 import org.apache.kafka.clients.producer.Producer;
     5 import org.apache.kafka.clients.producer.ProducerConfig;
     6 import org.apache.kafka.clients.producer.ProducerRecord;
     7 import org.junit.Test;
     8 
     9 import java.util.Properties;
    10 import java.util.concurrent.ExecutionException;
    11 
    12 public class KafkaSender {
    13     @Test
    14     public void producer() throws InterruptedException, ExecutionException {
    15         Properties props = new Properties();
    16         props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
    17         props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    18         props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "coo1:9092,coo2:9092,coo3:9092");
    19 //        props.put(ProducerConfig.BATCH_SIZE_CONFIG,"1024");
    20 //        props.put(ProducerConfig.BUFFER_MEMORY_CONFIG,"0");
    21         //配置kafka语义exacts once语义
    22         props.put("acks", "all");
    23         props.put("enable.idempotence", "true");
    24         Producer<Integer, String> kafkaProducer = new KafkaProducer<Integer, String>(props);
    25         for (int j = 0; j < 1; j++)
    26             for (int i = 0; i < 100; i++) {
    27                 ProducerRecord<Integer, String> message = new ProducerRecord<Integer, String>("xzrz", "{wo|2019-12-12|1|2|0|5}");
    28                 kafkaProducer.send(message);
    29             }
    30         //这个flush和close一定要写,类似于流操作
    31         //因为kafka自带betch和buffer,如果没有这两行代码,一是浪费资源,二是有可能消息没有发送到kafka中,依旧保留在本地betch中
    32         kafkaProducer.flush();
    33         kafkaProducer.close();
    34     }
    35 }

    kafka消费端-->sparkstreaming-kafka-->KafkaStreaming.scala代码:

      maven:

     1 <?xml version="1.0" encoding="UTF-8"?>
     2 <project xmlns="http://maven.apache.org/POM/4.0.0"
     3          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     4          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     5     <modelVersion>4.0.0</modelVersion>
     6 
     7     <groupId>sparkMVN</groupId>
     8     <artifactId>sparkMVN</artifactId>
     9     <version>1.0-SNAPSHOT</version>
    10 
    11 
    12     <properties>
    13         <spark.version>2.1.1</spark.version>
    14         <hadoop.version>2.7.3</hadoop.version>
    15         <hbase.version>0.98.17-hadoop2</hbase.version>
    16     </properties>
    17 <dependencies>
    18 
    19     <dependency>
    20         <groupId>junit</groupId>
    21         <artifactId>junit</artifactId>
    22         <version>4.12</version>
    23     </dependency>
    24 
    25     <dependency>
    26         <groupId>org.apache.spark</groupId>
    27         <artifactId>spark-core_2.11</artifactId>
    28         <version>${spark.version}</version>
    29     </dependency>
    30     <dependency>
    31         <groupId>org.apache.spark</groupId>
    32         <artifactId>spark-sql_2.11</artifactId>
    33         <version>${spark.version}</version>
    34     </dependency>
    35     <dependency>
    36         <groupId>org.apache.spark</groupId>
    37         <artifactId>spark-streaming_2.11</artifactId>
    38         <version>${spark.version}</version>
    39         <!--这行在local模式中,一定要注销,否则会导致找不到spark context类的异常-->
    40         <!--<scope>provided</scope>-->
    41     </dependency>
    42     <dependency>
    43         <groupId>org.apache.spark</groupId>
    44         <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    45         <version>2.2.0</version>
    46     </dependency>
    47 
    48     <!-- hadoop -->
    49     <dependency>
    50         <groupId>org.apache.hadoop</groupId>
    51         <artifactId>hadoop-client</artifactId>
    52         <version>${hadoop.version}</version>
    53     </dependency>
    54     <dependency>
    55         <groupId>org.apache.hadoop</groupId>
    56         <artifactId>hadoop-common</artifactId>
    57         <version>${hadoop.version}</version>
    58     </dependency>
    59     <dependency>
    60         <groupId>org.apache.hadoop</groupId>
    61         <artifactId>hadoop-hdfs</artifactId>
    62         <version>${hadoop.version}</version>
    63     </dependency>
    64 
    65     <!--hbase-->
    66     <dependency>
    67         <groupId>org.apache.hbase</groupId>
    68         <artifactId>hbase-client</artifactId>
    69         <version>${hbase.version}</version>
    70     </dependency>
    71     <dependency>
    72         <groupId>org.apache.hbase</groupId>
    73         <artifactId>hbase-server</artifactId>
    74         <version>${hbase.version}</version>
    75     </dependency>
    76 </dependencies>
    77 </project>
    KafkaStreaming.scala代码:
     1 package gjm.sparkDemos
     2 
     3 import org.apache.kafka.common.serialization.StringDeserializer
     4 import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
     5 import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
     6 import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
     7 import org.apache.spark.streaming.{Seconds, StreamingContext}
     8 import org.apache.spark.{SparkConf, SparkContext}
     9 import org.slf4j.LoggerFactory
    10 
    11 object KafkaStreaming {
    12   def main(args: Array[String]): Unit = {
    13     val LOG = LoggerFactory.getLogger(KafkaStreaming.getClass)
    14     LOG.info("Streaming start----->")
    15 
    16     val conf = new SparkConf().setMaster("local[6]")//这里设置消费kafka的线程数为6,看看会有什么情况
    17       .setAppName("KafkaStreaming")
    18     val sc = new SparkContext(conf)
    19     val ssc = new StreamingContext(sc, Seconds(3))
    20     val topics = Array("xzrz")
    21     val kafkaParams = Map[String, Object](
    22       "bootstrap.servers" -> "coo1:9092,coo2:9092,coo3:9092",
    23       "key.deserializer" -> classOf[StringDeserializer],
    24       "value.deserializer" -> classOf[StringDeserializer],
    25       "group.id" -> "fwjkcx",
    26       "auto.offset.reset" -> "earliest",
    27       "enable.auto.commit" -> (false: java.lang.Boolean)
    28       //      "heartbeat.interval.ms" -> (90000: java.lang.Integer),
    29       //      "session.timeout.ms" -> (120000: java.lang.Integer),
    30       //      "group.max.session.timeout.ms" -> (120000: java.lang.Integer),
    31       //      "request.timeout.ms" -> (130000: java.lang.Integer),
    32       //      "fetch.max.wait.ms" -> (120000: java.lang.Integer)
    33     )
    34 
    35     val stream = KafkaUtils.createDirectStream[String, String](
    36       ssc,
    37       PreferConsistent,
    38       Subscribe[String, String](topics, kafkaParams)
    39     )
    40     LOG.info("Streaming had Created----->")
    41     LOG.info("Streaming Consuming msg----->")
    42     stream.foreachRDD { rdd =>
    43       val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
    44       rdd.foreachPartition(recordIt => {
    45         for (record <- recordIt) {
    46           LOG.info("Message recode info : Topics-->{},partition-->{}, checkNum-->{}, offset-->{}, value-->{}", record.topic(), record.partition().toString, record.checksum().toString, record.offset().toString, record.value())
    47         }
    48       })
    49       // some time later, after outputs have completed
    50       stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    51     }
    52     ssc.start()
    53     ssc.awaitTermination()
    54   }
    55 }

    验证测试:

    1、使用发送端发送100条消息;
    2、启动kafka自带的consumer消费端,group id为test;
    sh kafka-console-consumer.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --topic xzrz --from-beginning --group test
    3、启动spark-stream-kafka,在代码中已经设置流的间隔时间为每3s一次;
    4、使用kafka自带的group消费情况查询消费情况:
    1 ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group test
    2 ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx
    结果:
    1、首先是测试的test消费端:消费情况
    [root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group test
    
    GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
    test            xzrz            0          25              25              0               consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217  consumer-1
    test            xzrz            1          25              25              0               consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217  consumer-1
    test            xzrz            2          25              25              0               consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217  consumer-1
    test            xzrz            3          25              25              0               consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217  consumer-1
    观察发现,4个分区同一个消费者线程,一共消费了100条。
    2、spark-streaming的消费端:消费情况
    [root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx
    
    GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
    fwjkcx          xzrz            0          25              25              0               consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60   consumer-1
    fwjkcx          xzrz            1          25              25              0               consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60   consumer-1
    fwjkcx          xzrz            2          25              25              0               consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60   consumer-1
    fwjkcx          xzrz            3          25              25              0               consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60   consumer-1
    观察发现,4个分区每个分区消费25条,符合正常认知
    3、多增加一个实验,现在将spart-stream的local数量改为3,更改消费者组为fwjkcx01,观察消费情况
    [root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx01
    
    GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
    fwjkcx01        xzrz            0          25              25              0               consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60   consumer-1
    fwjkcx01        xzrz            1          25              25              0               consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60   consumer-1
    fwjkcx01        xzrz            2          25              25              0               consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60   consumer-1
    fwjkcx01        xzrz            3          25              25              0               consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60   consumer-1
    发现仍然是4个线程在消费,所以在local位置指定线程数量根本不生效。
    4、这时候再发1000条消息,观察group:fwjkcx的消费情况
    [root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx
    
    GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
    fwjkcx          xzrz            0          275             275             0               consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60   consumer-1
    fwjkcx          xzrz            1          275             275             0               consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60   consumer-1
    fwjkcx          xzrz            2          275             275             0               consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60   consumer-1
    fwjkcx          xzrz            3          275             275             0               consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60   consumer-1
    一切正常。










  • 相关阅读:
    ALINK(十):数据导入与导出 (三)Catalog读入 (CatalogSourceBatchOp)
    ALINK(九):数据导入与导出 (二)Text文件读入 (TextSourceBatchOp)
    ALINK(八):数据导入与导出 (一)CSV文件读入 (CsvSourceBatchOp)
    机器学习sklearn(四): 数据处理(一)数据集拆分(一)train_test_split
    机器学习sklearn(三):加载数据集(数据导入)
    机器学习sklearn(二):SKLEARN快速开始
    机器学习sklearn(一):教程与资料
    程序员的代码的基本素养
    mysql常用函数和语句模式
    BootStrap学习
  • 原文地址:https://www.cnblogs.com/qfxydtk/p/11662591.html
Copyright © 2011-2022 走看看